Onwuchekwa Ikechukwu Charles | Freelancer Portfolio Item #365506

12/4/23, 8:21 PM -__BD2 University of Stirling ITNPBD2 Representing and Manipulating Data Assignment Autumn 2023 A Consultancy Job for JC Penney Structure You may structure the project how you wish, but here is a suggested guideline to help you organise your work: 1. Data Exploration - Explore the data and show you understand its structure and relations 2. Data Validation - Check the quality of the data. Is it complete? Are there obvious errors? 3. Data Visualisation - Gain an overall understanding of the data with visualisations 4. Data Analysis = Set some questions and use the data to answer them 5. Data Augmentation - Add new data from another source to bring new insights to the data you already have 1. Data Exploration Data exploration was carried out to understan the dataset, including its structure, content, and overall characteristics. And also to insights into the types of variables, their data types, and the relationships between them. Then, it proceeds to load the CSV and JSON files into DataFrames using pandas' read_csv() and JSON decoding methods. The required libraries to be used for this consultancy analysis are imported below. In [1]: # We import our libraries needed for this analysis import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns import json from textblob import TextBlob In [2]: #Data Exploration showing the structures and relations of the data # Load the CSV files into dataframes products_df = pd.read_csv('products.csv') reviews_df = pd.read_csv('reviews.csv') localhost:8888/nbconvert/html/-__BD2.ipynb?download=false 1/33 12/4/23, 8:21 PM -__BD2 users_df = pd.read_csv('users.csv') # Displaying basic information about the structure of the Products data print("Products Data:") print(products_df.head()) # Displays the first few rows of the Products datafr print(products_df.info()) # Displays information about the data types and miss print(products_df.columns) # Displays the column names of the Products datafra # Displaying basic information about the structure of the Reviews data print("\nReviews Data:") print(reviews_df.head()) # Displays the first few rows of the Reviews datafram print(reviews_df.info()) # Displays information about the data types and missi print(reviews_df.columns) # Displays the column names of the Reviews dataframe # Displaying basic information about the structure of the Users data print("\nUsers Data:") print(users_df.head()) # Displays the first few rows of the Users dataframe print(users_df.info()) # Displays information about the data types and missing print(users_df.columns) # Displays the column names of the Users dataframe localhost:8888/nbconvert/html/-__BD2.ipynb?download=false 2/33 12/4/23, 8:21 PM -__BD2 Products Data: 0 1 2 3 4 Uniq_id b6c0b6bea69c-baeac73c13d 93e5272c51d8cce02597e3ce67b7ad0a 013e320f2f2ec0cf5b3ff5418d-e6633d81f2cb7400c0cfa0394c427 d969a-e1331e304b09f81a83f6 0 1 2 3 4 Alfred Alfred Alfred Alfred Alfred Dunner® Dunner® Dunner® Dunner® Dunner® Essential Essential Essential Essential Essential Pull Pull Pull Pull Pull On On On On On SKU pp- pp- pp- pp- pp- Capri Capri Capri Capri Capri Name Pant Pant Pant Pant Pant \ \ Description Price Av_Score 0 Youll return to our Alfred Dunner pull-on capr..- Youll return to our Alfred Dunner pull-on capr..- Youll return to our Alfred Dunner pull-on capr..- Youll return to our Alfred Dunner pull-on capr..- Youll return to our Alfred Dunner pull-on capr..- RangeIndex: 7982 entries, 0 to 7981 Data columns (total 6 columns): # Column Non-Null Count Dtype --- ------------------- ----0 Uniq_id 7982 non-null object 1 SKU 7915 non-null object 2 Name 7982 non-null object 3 Description 7439 non-null object 4 Price 5816 non-null float64 5 Av_Score 7982 non-null float64 dtypes: float64(2), object(4) memory usage: 374.3+ KB None Index(['Uniq_id', 'SKU', 'Name', 'Description', 'Price', 'Av_Score'], dtype='o bject') Reviews Data: 0 1 2 3 4 Uniq_id b6c0b6bea69c-baeac73c13d b6c0b6bea69c-baeac73c13d b6c0b6bea69c-baeac73c13d b6c0b6bea69c-baeac73c13d b6c0b6bea69c-baeac73c13d Username fsdv4141 krpz1113 mbmg3241 zeqg1222 nvfn3212 Score 2 1 2 0 3 \ Review 0 You never have to worry about the fit...Alfred... 1 Good quality fabric. Perfect fit. Washed very ... 2 I do not normally wear pants or capris that ha... 3 I love these capris! They fit true to size and... 4 This product is very comfortable and the fabri... RangeIndex: 39063 entries, 0 to 39062 Data columns (total 4 columns): # Column Non-Null Count Dtype --- ------------------- ----0 Uniq_id 39063 non-null object 1 Username 39063 non-null object 2 Score 39063 non-null int64 localhost:8888/nbconvert/html/-__BD2.ipynb?download=false 3/33 12/4/23, 8:21 PM -__BD2 3 Review 39063 non-null object dtypes: int64(1), object(3) memory usage: 1.2+ MB None Index(['Uniq_id', 'Username', 'Score', 'Review'], dtype='object') Users Data: Username DOB State 0 bkpn- Oregon 1 gqjs- Massachusetts 2 eehe- Idaho 3 hkxj- Florida 4 jjbd- Georgia RangeIndex: 5000 entries, 0 to 4999 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------------------- ----0 Username 5000 non-null object 1 DOB 5000 non-null object 2 State 5000 non-null object dtypes: object(3) memory usage: 117.3+ KB None Index(['Username', 'DOB', 'State'], dtype='object') Conducting a summary statistics for the numerical columns of the datasets In [3]: #Summary Statistics print(products_df.describe()) # Displays summary statistics for numerical colu print("\nReviews Data:") print(reviews_df.describe()) # Displays summary statistics for numerical colum print("\nUsers Data:") users_df.describe() # Displays summary statistics for numerical columns in the count mean std min 25% 50% 75% max Price- - Av_Score- Reviews Data: count mean std min 25% 50% 75% max Score- Users Data: localhost:8888/nbconvert/html/-__BD2.ipynb?download=false 4/33 12/4/23, 8:21 PM -__BD2 Out[3]: count unique top freq Username DOB State- dqft- Massachusetts- Data Size of the datasets In [4]: # Data size print("Data Size:") print("Products Dataset Size:", products_df.shape)#Displays the data size in th print("Reviews Dataset Size:", reviews_df.shape) #Displays the data size in the print("Users Dataset Size:", users_df.shape) #Displays the data size in the use Data Size: Products Dataset Size: (7982, 6) Reviews Dataset Size: (39063, 4) Users Dataset Size: (5000, 3) Loading the json file for reviewers and products In [5]: json_file_path = "jcpenney_reviewers.json" data_list = [] # Specify the path to the JSON file # Initializes an empty list to store JSON objects # Read the JSON file line by line with open(json_file_path, "r") as json_file: for line_number, line in enumerate(json_file, start=1): try: # Load each line as a JSON object and append it to the list json_data = json.loads(line) data_list.append(json_data) # Display each JSON object in the list print(json_data) if len(data_list) == 3: # Break out of the loop after the first 3 break except json.JSONDecodeError as e: print(f"Error decoding JSON at line {line_number}: {e}") # Handles {'Username': 'bkpn1412', 'DOB': '-', 'State': 'Oregon', 'Reviewed': ['cea76118f6a9110a893de2b-c0']} {'Username': 'gqjs4414', 'DOB': '-', 'State': 'Massachusetts', 'Revie wed': ['fa04fe6c0dd5189f54fe600838da43d3']} {'Username': 'eehe1434', 'DOB': '-', 'State': 'Idaho', 'Reviewed': []} In [6]: json_file_path = "jcpenney_products.json" localhost:8888/nbconvert/html/-__BD2.ipynb?download=false # Specifys the path to the JSON file 5/33 12/4/23, 8:21 PM -__BD2 data_list = [] # Initializes an empty list to store JSON objects # Read the JSON file line by line with open(json_file_path, "r") as json_file: for line_number, line in enumerate(json_file, start=1): try: # Load each line as a JSON object and append it to the list json_data = json.loads(line) data_list.append(json_data) # Display each JSON object in the list for the first row if len(data_list) <= 1: print(json_data) except json.JSONDecodeError as e: print(f"Error decoding JSON at line {line_number}: {e}") # Handles {'uniq_id': 'b6c0b6bea69c-baeac73c13d', 'sku': 'pp-', 'name_t itle': 'Alfred Dunner® Essential Pull On Capri Pant', 'description': 'You\'ll return to our Alfred Dunner pull-on capris again and again when you want an up dated, casual look and all the comfort you love. \xa0 elastic waistband appro x. 19-21" inseam slash pockets polyester washable imported \xa0 \xa0 \xa0', 'l ist_price': '41.09', 'sale_price': '24.16', 'category': 'alfred dunner', 'cate gory_tree': 'jcpenney|women|alfred dunner', 'average_product_rating': 2.625, 'product_url': 'http://www.jcpenney.com/alfred-dunner-essential-pull-on-capripant/prod.jump?ppId=pp-&catId=cat-&&_dyncharset=UTF-8&urlSta te=/women/shop-brands/alfred-dunner/yellow/_/N-gkmp33Z132/cat.jump', 'product_ image_urls': 'http://s7d9.scene7.com/is/image/JCPenney/DP-M.ti f?hei=380&wid=380&op_usm=.4,.8,0,0&resmode=sharp2&op_usm=1.5,.8,0,0&resmod e=sharp', 'brand': 'Alfred Dunner', 'total_number_reviews': 8, 'Reviews': [{'U ser': 'fsdv4141', 'Review': 'You never have to worry about the fit...Alfred Du nner clothing sizes are true to size and fits perfectly. Great value for the m oney.', 'Score': 2}, {'User': 'krpz1113', 'Review': 'Good quality fabric. Perf ect fit. Washed very well no iron.', 'Score': 4}, {'User': 'mbmg3241', 'Revie w': 'I do not normally wear pants or capris that have an elastic waist, but I decided to try these since they were on sale and I loved the color. I was very surprised at how comfortable they are and wear really well even wearing all da y. I will buy this style again!', 'Score': 4}, {'User': 'zeqg1222', 'Review': 'I love these capris! They fit true to size and are so comfortable to wear. I am planning to order more of them.', 'Score': 1}, {'User': 'nvfn3212', 'Revie w': 'This product is very comfortable and the fabric launders very well', 'Sco re': 1}, {'User': 'aajh3423', 'Review': 'I did not like the fabric. It is 100% polyester I thought it was different.I bought one at the store apprx two monts ago, and I thought it was just like it', 'Score': 5}, {'User': 'usvp2142', 'Re view': 'What a great deal. Beautiful Pants. Its more than I expected.', 'Scor e': 3}, {'User': 'yemw3321', 'Review': 'Alfred Dunner has great pants, good fi t and very comfortable', 'Score': 1}], 'Bought With': ['898e42fe937a33e8ce5e90 0ca7a4d924', '8c02c262567a2267cd207e35637feb1c', 'b62dd54545cdc1a05d8aaa2d25ae d996', '0da4c2dcc8cfa0e-b00d22b30', '90c46b841e2eeece992c- c']} 1.b Data Integration Data integration was carried out to combine the information from the (products_df) dataset and the ("jcpenney_products.json") to create a comprehensive and unified view using their common value. localhost:8888/nbconvert/html/-__BD2.ipynb?download=false 6/33 12/4/23, 8:21 PM -__BD2 The common column name for the product_df ('Uniq_id', 'SKU', 'Name', 'Description') and jcpenney_products.json ("'uniq_id', 'sKU','description'") didnot correspound, so we have to rename the column for (product_df) dataset. In [7]: # Define a dictionary to map old column names to new column names column_mapping = {'Uniq_id': 'uniq_id', 'SKU': 'sKU', 'Description': 'descripti # Use the rename method to replace column names products_df.rename(columns=column_mapping, inplace=True) # Now, 'products_df' has updated column names products_df.head() uniq_id Out[7]: sKU 0 b6c0b6bea69c-baeac73c13d pp- 1 93e5272c51d8cce02597e3ce67b7ad0a pp- 2 013e320f2f2ec0cf5b3ff5418d688528 pp- 3 505e6633d81f2cb7400c0cfa0394c427 pp- 4 d969a-e1331e304b09f81a83f6 pp- Name Alfred Dunner® Essential Pull On Capri Pant Alfred Dunner® Essential Pull On Capri Pant Alfred Dunner® Essential Pull On Capri Pant Alfred Dunner® Essential Pull On Capri Pant Alfred Dunner® Essential Pull On Capri Pant description Youll return to our Alfred Dunner pull-on capr... Youll return to our Alfred Dunner pull-on capr... Youll return to our Alfred Dunner pull-on capr... Youll return to our Alfred Dunner pull-on capr... Youll return to our Alfred Dunner pull-on capr... Price Av_Score 41.09 2.625 41.09 3.000 41.09 2.625 41.09 3.500 41.09 3.125 In [8]: # Changing uniqe value for review dataset # Define a dictionary to map old column names to new column names column_mapping = {'Uniq_id': 'uniq_id'} # Use the rename method to replace column names reviews_df.rename(columns=column_mapping, inplace=True) # Now, 'products_df' has updated column names reviews_df.head() localhost:8888/nbconvert/html/-__BD2.ipynb?download=false 7/33 12/4/23, 8:21 PM Out[8]: -__BD2 uniq_id Username Score Review about the 0 b6c0b6bea69c-baeac73c13d fsdv4141 2 You never have to worryfit...Alfred... Good quality fabric. Perfect fit. 1 b6c0b6bea69c-baeac73c13d krpz1113 1 Washed very ... I do not normally wear pants or 2 b6c0b6bea69c-baeac73c13d mbmg3241 2 capris that ha... fit true to 3 b6c0b6bea69c-baeac73c13d zeqg1222 0 I love these capris! Theysize and... 4 b6c0b6bea69c-baeac73c13d nvfn3212 3 This product is veryandcomfortable the fabri... Merrging ('product_df dataset') with the ('product json_file') using theirthecommon colum('uniq_id') In [9]: json_file = pd.DataFrame(data_list) # Converts the list of JSON objects to a Da #common key to be merged common_key = 'uniq_id' # Merge the JSON DataFrame with the CSV DataFrame based on the common key merged_data = pd.merge(products_df, json_file, on='uniq_id') print(merged_data.head()) # Displays the first few rows of the merged data merged_data.to_csv('merged_data.csv', index=False) localhost:8888/nbconvert/html/-__BD2.ipynb?download=false # Writes the merged data to 8/33 12/4/23, 8:21 PM -__BD2 0 1 2 3 4 uniq_id b6c0b6bea69c-baeac73c13d 93e5272c51d8cce02597e3ce67b7ad0a 013e320f2f2ec0cf5b3ff5418d-e6633d81f2cb7400c0cfa0394c427 d969a-e1331e304b09f81a83f6 0 1 2 3 4 Alfred Alfred Alfred Alfred Alfred 0 1 2 3 4 Youll Youll Youll Youll Youll 0 1 2 3 4 sku pp- pp- pp- pp- pp- Alfred Alfred Alfred Alfred Alfred Dunner® Dunner® Dunner® Dunner® Dunner® 0 1 2 3 4 You'll You'll You'll You'll You'll to to to to to Alfred Alfred Alfred Alfred Alfred 0 1 2 3 4 category alfred dunner alfred dunner view all view all view all 0 1 2 3 4 product_url http://www.jcpenney.com/alfred-dunner-essentia... http://www.jcpenney.com/alfred-dunner-essentia... http://www.jcpenney.com/alfred-dunner-essentia... http://www.jcpenney.com/alfred-dunner-essentia... http://www.jcpenney.com/alfred-dunner-essentia... 0 1 2 3 4 product_image_urls http://s7d9.scene7.com/is/image/JCPenney/DP122... http://s7d9.scene7.com/is/image/JCPenney/DP122... http://s7d9.scene7.com/is/image/JCPenney/DP122... http://s7d9.scene7.com/is/image/JCPenney/DP122... http://s7d9.scene7.com/is/image/JCPenney/DP122... 0 1 2 total_number_reviews 8 8 8 Dunner® Dunner® Dunner® Dunner® Dunner® return return return return return Essential Essential Essential Essential Essential to to to to to return return return return return our our our our our Pull Pull Pull Pull Pull Alfred Alfred Alfred Alfred Alfred our our our our our On On On On On sKU pp- pp- pp- pp- pp- Capri Capri Capri Capri Capri Name Pant Pant Pant Pant Pant \ description_x pull-on capr... pull-on capr... pull-on capr... pull-on capr... pull-on capr... Dunner Dunner Dunner Dunner Dunner Essential Essential Essential Essential Essential Dunner Dunner Dunner Dunner Dunner Pull Pull Pull Pull Pull On On On On On Price- name_title Capri Pant Capri Pant Capri Pant Capri Pant Capri Pant Av_Score- \ \ description_y list_price sale_price pull-on cap..- pull-on cap..- pull-on cap..- pull-on cap..- pull-on cap..- category_tree jcpenney|women|alfred dunner jcpenney|women|alfred dunner jcpenney|women|view all jcpenney|women|view all jcpenney|women|view all localhost:8888/nbconvert/html/-__BD2.ipynb?download=false \ average_product_rating- \ \ \ Alfred Alfred Alfred Alfred Alfred brand Dunner Dunner Dunner Dunner Dunner \ Reviews [{'User': 'fsdv4141', 'Review': 'You never hav... [{'User': 'tpcu2211', 'Review': 'You never hav... [{'User': 'pcfg3234', 'Review': 'You never hav... \ 9/33 12/4/23, 8:21 PM -__BD2 3 4 0 1 2 3 4 8 8 [{'User': 'ngrq4411', 'Review': 'You never hav... [{'User': 'nbmi2334', 'Review': 'You never hav... [898e42fe937a33e8ce5e900ca7a4d924, [bc9ab3406dcaa84a123b9da862e6367d, [3ce70f519a9cfdd85cdbdecd358e5347, [efcd811edccbeb5e67eaa8ef0d991f7c, [0ca5ad2a218f59eb83eec1e248a0782d, Bought With 8c02c262567... 18eb69e8fc2... b0295c96d2b... 7b2cc00171e... 9869fc8da14... Conducting a data exploration for the merged data ('merged_data') In [10]: # Displaying basic information about the structure of the 'merged_data' print(merged_data.head()) # Displays the first few rows of the Products dataf print(merged_data.info()) # Displays information about the data types and mis print(merged_data.columns) # Displays the column names of the Users dataframe localhost:8888/nbconvert/html/-__BD2.ipynb?download=false 10/33 12/4/23, 8:21 PM -__BD2 0 1 2 3 4 uniq_id b6c0b6bea69c-baeac73c13d 93e5272c51d8cce02597e3ce67b7ad0a 013e320f2f2ec0cf5b3ff5418d-e6633d81f2cb7400c0cfa0394c427 d969a-e1331e304b09f81a83f6 0 1 2 3 4 Alfred Alfred Alfred Alfred Alfred 0 1 2 3 4 Youll Youll Youll Youll Youll 0 1 2 3 4 sku pp- pp- pp- pp- pp- Alfred Alfred Alfred Alfred Alfred Dunner® Dunner® Dunner® Dunner® Dunner® 0 1 2 3 4 You'll You'll You'll You'll You'll to to to to to Alfred Alfred Alfred Alfred Alfred 0 1 2 3 4 category alfred dunner alfred dunner view all view all view all 0 1 2 3 4 product_url http://www.jcpenney.com/alfred-dunner-essentia... http://www.jcpenney.com/alfred-dunner-essentia... http://www.jcpenney.com/alfred-dunner-essentia... http://www.jcpenney.com/alfred-dunner-essentia... http://www.jcpenney.com/alfred-dunner-essentia... 0 1 2 3 4 product_image_urls http://s7d9.scene7.com/is/image/JCPenney/DP122... http://s7d9.scene7.com/is/image/JCPenney/DP122... http://s7d9.scene7.com/is/image/JCPenney/DP122... http://s7d9.scene7.com/is/image/JCPenney/DP122... http://s7d9.scene7.com/is/image/JCPenney/DP122... 0 1 2 total_number_reviews 8 8 8 Dunner® Dunner® Dunner® Dunner® Dunner® return return return return return Essential Essential Essential Essential Essential to to to to to return return return return return our our our our our Pull Pull Pull Pull Pull Alfred Alfred Alfred Alfred Alfred our our our our our On On On On On sKU pp- pp- pp- pp- pp- Capri Capri Capri Capri Capri Name Pant Pant Pant Pant Pant \ description_x pull-on capr... pull-on capr... pull-on capr... pull-on capr... pull-on capr... Dunner Dunner Dunner Dunner Dunner Essential Essential Essential Essential Essential Dunner Dunner Dunner Dunner Dunner Pull Pull Pull Pull Pull On On On On On Price- name_title Capri Pant Capri Pant Capri Pant Capri Pant Capri Pant Av_Score- \ \ description_y list_price sale_price pull-on cap..- pull-on cap..- pull-on cap..- pull-on cap..- pull-on cap..- category_tree jcpenney|women|alfred dunner jcpenney|women|alfred dunner jcpenney|women|view all jcpenney|women|view all jcpenney|women|view all localhost:8888/nbconvert/html/-__BD2.ipynb?download=false \ average_product_rating- \ \ \ Alfred Alfred Alfred Alfred Alfred brand Dunner Dunner Dunner Dunner Dunner \ Reviews [{'User': 'fsdv4141', 'Review': 'You never hav... [{'User': 'tpcu2211', 'Review': 'You never hav... [{'User': 'pcfg3234', 'Review': 'You never hav... \ 11/33 12/4/23, 8:21 PM -__BD2 3 4 8 8 [{'User': 'ngrq4411', 'Review': 'You never hav... [{'User': 'nbmi2334', 'Review': 'You never hav... Bought With 0 [898e42fe937a33e8ce5e900ca7a4d924, 8c02c262567... 1 [bc9ab3406dcaa84a123b9da862e6367d, 18eb69e8fc2... 2 [3ce70f519a9cfdd85cdbdecd358e5347, b0295c96d2b... 3 [efcd811edccbeb5e67eaa8ef0d991f7c, 7b2cc00171e... 4 [0ca5ad2a218f59eb83eec1e248a0782d, 9869fc8da14... Int64Index: 7982 entries, 0 to 7981 Data columns (total 20 columns): # Column Non-Null Count Dtype --- ------------------- ----0 uniq_id 7982 non-null object 1 sKU 7915 non-null object 2 Name 7982 non-null object 3 description_x 7439 non-null object 4 Price 5816 non-null float64 5 Av_Score 7982 non-null float64 6 sku 7982 non-null object 7 name_title 7982 non-null object 8 description_y 7982 non-null object 9 list_price 7982 non-null object 10 sale_price 7982 non-null object 11 category 7982 non-null object 12 category_tree 7982 non-null object 13 average_product_rating 7982 non-null float64 14 product_url 7982 non-null object 15 product_image_urls 7982 non-null object 16 brand 7982 non-null object 17 total_number_reviews 7982 non-null int64 18 Reviews 7982 non-null object 19 Bought With 7982 non-null object dtypes: float64(3), int64(1), object(16) memory usage: 1.3+ MB None Index(['uniq_id', 'sKU', 'Name', 'description_x', 'Price', 'Av_Score', 'sku', 'name_title', 'description_y', 'list_price', 'sale_price', 'category', 'category_tree', 'average_product_rating', 'product_url', 'product_image_urls', 'brand', 'total_number_reviews', 'Reviews', 'Bought With'], dtype='object') 2. Data Validation Data validation was carried out to reveal the presence of missing values in the datasets and ensure that the data is accurate and free from errors. In addition, to verify that relationships between variables or columns are consistent and align with expectations. In [11]: #Data Validation to check the quality of the data and possiblee Errors # Checking for missing values in the Products DataFrame print("\nMissing values in Products Data:") print(products_df.isnull().sum()) # Displays the sum of missing values for eac # Checking for missing values in the Reviews DataFrame print("\nMissing values in Reviews Data:") localhost:8888/nbconvert/html/-__BD2.ipynb?download=false 12/33 12/4/23, 8:21 PM -__BD2 print(reviews_df.isnull().sum()) # Displays the sum of missing values for each # Checking for missing values in the Users DataFrame print("\nMissing values in Users Data:") print(users_df.isnull().sum()) # Displays the sum of missing values for each c # Checking data types consistency in the Products DataFrame print("\nData types in Products Data:") print(products_df.dtypes) # Displays the data types of each column in the Prod # Checking data types consistency in the Reviews DataFrame print("\nData types in Reviews Data:") print(reviews_df.dtypes) # Displays the data types of each column in the Revie # Checking data types consistency in the Users DataFrame print("\nData types in Users Data:") print(users_df.dtypes) # Displays the data types of each column in the Users D localhost:8888/nbconvert/html/-__BD2.ipynb?download=false 13/33 12/4/23, 8:21 PM -__BD2 Missing values in Products Data: uniq_id 0 sKU 67 Name 0 description 543 Price 2166 Av_Score 0 dtype: int64 Missing values in Reviews Data: uniq_id 0 Username 0 Score 0 Review 0 dtype: int64 Missing values in Users Data: Username 0 DOB 0 State 0 dtype: int64 Data types in Products Data: uniq_id object sKU object Name object description object Price float64 Av_Score float64 dtype: object Data types in Reviews Data: uniq_id object Username object Score int64 Review object dtype: object Data types in Users Data: Username object DOB object State object dtype: object Data Validation for the (merged_data) data In [12]: print("\nMissing values in Merged Data:") print(merged_data.isnull().sum()) localhost:8888/nbconvert/html/-__BD2.ipynb?download=false 14/33 12/4/23, 8:21 PM -__BD2 Missing values in Merged Data: uniq_id 0 sKU 67 Name 0 description_x 543 Price 2166 Av_Score 0 sku 0 name_title 0 description_y 0 list_price 0 sale_price 0 category 0 category_tree 0 average_product_rating 0 product_url 0 product_image_urls 0 brand 0 total_number_reviews 0 Reviews 0 Bought With 0 dtype: int64 Fixing the missing values in columns (Price and Description) for ('products_df') In [13]: # For price products_df['Price'].fillna(0, inplace=True) # For Description products_df['description'].fillna('No Description', inplace=True) # For SKU, its best i drop it to have a smooth analysis column_to_drop = 'sKU' products_df.drop(columns=column_to_drop, inplace=True) print(products_df.isnull().sum()) uniq_id Name description Price Av_Score dtype: int64 0 0 0 0 0 Fixing the missing values in columns (Price , Description and sKU ) For the merged data In [14]: # For price merged_data['Price'].fillna(0, inplace=True) # For Description merged_data['description_x'].fillna('No Description', inplace=True) # For SKU, its best i drop it to have a smooth analysis column_to_drop = 'sKU' merged_data.drop(columns=column_to_drop, inplace=True) localhost:8888/nbconvert/html/-__BD2.ipynb?download=false 15/33 12/4/23, 8:21 PM -__BD2 print(products_df.isnull().sum()) uniq_id Name description Price Av_Score dtype: int64 0 0 0 0 0 Performing a Sentiment Analysis on Production Description for the Merged Data For this part, a natural language processing technique (NLP) will be used by extracting sentiment polarity from the descirption text to determine the positive, negative or netural sentiments associted with specific producs. In [15]: # Function to analyze sentiment using TextBlob def analyze_sentiment(description): if pd.isna(description): # Check for NaN values return 0 # or any default value based on your preference blob = TextBlob(description) return blob.sentiment.polarity # Apply sentiment analysis to the "Description" column merged_data['Sentiment'] = merged_data['description_x'].apply(analyze_sentiment # Creating a new column for sentiment labels (positive, negative, neutral) merged_data['Sentiment_Label'] = merged_data['Sentiment'].apply( lambda score: 'Positive' if score > 0 else ('Negative' if score < 0 else 'N ) print(merged_data[['description_x', 'Sentiment', 'Sentiment_Label']]) localhost:8888/nbconvert/html/-__BD2.ipynb?download=false # Displa 16/33 12/4/23, 8:21 PM -__BD2 0 1 2 3 4 ..- 0 1 2 3 4 ..- description_x pull-on capr... pull-on capr... pull-on capr... pull-on capr... pull-on capr... ... This Hoover® vacuum features dual-stage cyclon... This Hoover® vacuum features dual-stage cyclon... This Hoover® vacuum features dual-stage cyclon... No Description No Description Youll Youll Youll Youll Youll return return return return return to to to to to our our our our our Alfred Alfred Alfred Alfred Alfred Dunner Dunner Dunner Dunner Dunner Sentiment -e-17 -e-17 -e-17 -e-17 -e-17 ..-e-e-e-e-e+00 \ Sentiment_Label Negative Negative Negative Negative Negative ... Positive Positive Positive Neutral Neutral [7982 rows x 3 columns] 3. Data Visualization Data visualization was constructed to understand the temporal dynamics of the data, identify trends and patterns. The code below utilizes matplotlib.pyplot to create various charts and graphs. Visualization for the merged data In [16]: #Data Visualizatio using the merged data # Distribution of Price plt.hist(merged_data['Price'], bins=20, color='skyblue', edgecolor='black') plt.title('Distribution of Prices') plt.xlabel('Price') plt.ylabel('Frequency') plt.show() # Displays the plot # Distribution of Price vs. Average Product Rating plt.figure(figsize=(12, 8)) sns.scatterplot(x='Price', y='Av_Score', data=merged_data) # Creates a scatter # Set labels and title plt.xlabel('Product Price') plt.ylabel('Average Product Rating') plt.title('Scatter Plot of Price vs. Average Product Rating') plt.show() # Displays the plot ''' The x-axis represents the product price. localhost:8888/nbconvert/html/-__BD2.ipynb?download=false 17/33 12/4/23, 8:21 PM -__BD2 The y-axis represents the average product rating. Each point in the scatter plot represents a product. ''' Out[16]: '\nThe x-axis represents the product price.\nThe y-axis represents the average product rating.\nEach point in the scatter plot represents a product.\n' Other Visualization using the unmerged datasets In [17]: # Data Visualization to gain an overal understanding of the datasets # Visualize the distribution of score in the reviews data by state plt.figure(figsize=(8, 6)) sns.countplot(x='Score', data=reviews_df) plt.title('Distribution of Scores') plt.xlabel('Score') localhost:8888/nbconvert/html/-__BD2.ipynb?download=false 18/33 12/4/23, 8:21 PM -__BD2 plt.ylabel('State') # the count axis counts the occurrences of each unique score and visualizes th plt.show()# Displays the plot # Visualize the distribution of Review in the reviews data plt.figure(figsize=(8, 6)) sns.countplot(x='Score', data=reviews_df) plt.title('Distribution of Reviews') plt.xlabel('Review') plt.ylabel('Count') # the count axis counts the occurrences of each unique Reeview and visualizes plt.show() # Displays the plot # User activity by state plt.figure(figsize=(14, 12)) user_activity_by_location = users_df['State'].value_counts().sort_values(ascend user_activity_by_location.plot(kind='bar') sns.countplot(x='State', data=users_df) plt.title('User Activity by State') plt.xlabel('State') plt.ylabel('User Count') plt.show() # Displays the plot localhost:8888/nbconvert/html/-__BD2.ipynb?download=false 19/33 12/4/23, 8:21 PM localhost:8888/nbconvert/html/-__BD2.ipynb?download=false -__BD2 20/33 12/4/23, 8:21 PM -__BD2 4. Data Analysis Data analysis was conducted by extracting meaningful insights from data through statistical techniques and computational methods. In the context of this assignment, data analysis aims to answer specific questions and gain deeper understanding of the data. This questions are listed below: -Average Review Rating by Product Category: Identifying the product categories with the highest average review ratings can help JC Penney focus on promoting and improving products in those categories. -Top 10 Most Reviewed Products: Analyzing the most reviewed products can reveal popular items and areas where JC Penney could potentially expand its product offerings. -User Influence on Product Reviews: Understanding the influence of individual users on product reviews can help JC Penney identify influential customers and potentially leverage localhost:8888/nbconvert/html/-__BD2.ipynb?download=false 21/33 12/4/23, 8:21 PM -__BD2 their opinions for marketing purposes. -Top 5 most common states among users: Analyzing the most state among users -How many unique users wrote a review: Identifying the total number of users that wrote a review. This analysis was done using the ('merged_data') dataframe. QUESTION 1: What is the average Review Rating by Product? In [18]: # Merge the already merged dataset(merge_data) and reviews data on the common c merged_data2 = pd.merge(reviews_df,merged_data, on='uniq_id', how='inner') average_rating_by_product = merged_data2.groupby('Name')['Score'].mean().reset_ print("Average Review Rating by Product:") print(average_rating_by_product) Average Review Rating by Product: 0 1 2 3 4 ..- 1 1 1 ¼ ½ Name 1 CT. Certified Diamond Solitaire Ring CT. T.W. Certified Diamond 14K White Gold Br... CT. T.W. Certified Diamond 14K White Gold Pr... CT. T.W. Certified Diamond 14K Yellow Gold B... 1 CT. T.W. Diamond 10K White Gold Cluster Ring ... CT. T.W. White & Color-Enhanced Black Diamon... ½ CT. Princess Certified Diamond Solitaire Ring CT. T.W. Diamond 10K Yellow Gold Contoured A... ½ CT. T.W. Diamond Bridal Set ⅓ CT. T.W. Diamond 3-Stone Promise Ring Score- ..- [6001 rows x 2 columns] QUESTION 2: What is the top 10 Most Reviewed product? In [19]: print("\nMost Reviewed Product:") # Count the number of reviews for each product, then get the top 10 most review top_10_most_reviewed = merged_data2['Name'].value_counts().head(10).reset_index top_10_most_reviewed.columns = ['Product', 'Review Count'] print("Top 10 Most Reviewed Products:") print(top_10_most_reviewed) Most Reviewed Product: Top 10 Most Reviewed Products:- Product Stafford® Gunner Mens Cap Toe Leather Boots Clarks® Leisa Grove Leather Sandals Xersion™ Quick-Dri Performance Bootcut Pant Clarks® Leisa Grove Leather Sandals - Wide Width St. Johns Bay® Secretly Slender Straight-Leg J... Xersion™ Quick-Dri Performance Capris Arizona Harbor Boat Shoes Liz Claiborne® Rockele Stretch Wedge Sandals Liz Claiborne® Essential Original-Fit Straight... Arizona Raglan-Sleeve Thermal Pullover Review Count- QUESTION 3: What is the User influence on product reviews? localhost:8888/nbconvert/html/-__BD2.ipynb?download=false 22/33 12/4/23, 8:21 PM -__BD2 In [20]: print("\nUser influence on product reviews:") # Merge reviews and users data on the common column 'Username' merged_data3 = pd.merge(reviews_df, users_df, on='Username', how='inner') # Calculate average review score by user average_score_by_user = merged_data3.groupby('Username')['Score'].mean().reset_ average_score_by_user.columns = ['Username', 'Average_Score'] # Calculate total number of reviews by user total_reviews_by_user = merged_data3['Username'].value_counts().reset_index() total_reviews_by_user.columns = ['Username', 'Total_Reviews'] # Merge the calculated metrics user_influence_metrics = pd.merge(average_score_by_user, total_reviews_by_user, print(user_influence_metrics.head()) User influence on product reviews: Username Average_Score Total_Reviews 0 aaez- aage- aagf- aahc- aajh- QUESTION 4: What is the top 5 most common states among users? In [21]: print("\n Top 5 most common states among users:") # Count the number of users from each state, then get the top 5 most common sta top_states = users_df['State'].value_counts().head(5).reset_index() top_states.columns = ['State', 'User Count'] print(top_states) Top 5 most common states among users: State User Count 0 Massachusetts 107 1 Delaware 106 2 Vermont 103 3 Northern Mariana Islands 102 4 New Jersey 101 QUESTION 5: How many unique users wrote a review? In [22]: print("\nUnique users that wrote a review:") # Count the number of unique users who have written reviews unique_users_count = reviews_df['Username'].nunique() print("Number of Unique Users who have Written Reviews:", unique_users_count) Unique users that wrote a review: Number of Unique Users who have Written Reviews: 4993 5. Data Augmentation For the data augmentaion combining information from different sources to enhance the already merged dataset (merged_data.csv). In this case, i want to augment my already merged dataset (merged_data.csv) with information from the new CSV file localhost:8888/nbconvert/html/-__BD2.ipynb?download=false 23/33 12/4/23, 8:21 PM -__BD2 (myntra_products_catalog_2.csv). For the new dataset the folowing process will be carried out: -Data Exploration. -Data Validation. In [23]: # First thing is to load the new cvs file new_data = pd.read_csv('myntra_products_catalog 2.csv') new_data.head #Displays the first few rows of the new data new_data.info() # Displays information about the data types and missing values new_data.columns # Displays the column names of the (new data) dataframe RangeIndex: 12491 entries, 0 to 12490 Data columns (total 8 columns): # Column Non-Null Count Dtype --- ------------------- ----0 ProductID 12491 non-null int64 1 ProductName 12491 non-null object 2 ProductBrand 12491 non-null object 3 Gender 12491 non-null object 4 Price (INR) 12491 non-null int64 5 NumImages 12491 non-null int64 6 Description 12491 non-null object 7 PrimaryColor 11597 non-null object dtypes: int64(3), object(5) memory usage: 780.8+ KB Index(['ProductID', 'ProductName', 'ProductBrand', 'Gender', 'Price (INR)', Out[23]: 'NumImages', 'Description', 'PrimaryColor'], dtype='object') Summary statistics for the new data In [24]: new_data.describe() # Displays summary statistics for numerical columns in the Out[24]: count mean std min 25% 50% 75% max ProductID-e-e-e-e-e-e-e-e+07 Price (INR) NumImages- Data Validation for the (new data) dataframe In [25]: #Checking for missing values print(new_data.isnull().sum()) # Displays the columns and there missing values localhost:8888/nbconvert/html/-__BD2.ipynb?download=false 24/33 12/4/23, 8:21 PM -__BD2 ProductID ProductName ProductBrand Gender Price (INR) NumImages Description PrimaryColor dtype: int64 - Fixing the missing Value for the (new_data) dataframe In [26]: # For PrimaryColor, its best i drop it to have a smooth data for analysis as it column_to_drop = 'PrimaryColor' new_data.drop(columns=column_to_drop, inplace=True) print(new_data.isnull().sum()) ProductID ProductName ProductBrand Gender Price (INR) NumImages Description dtype: int64 - Data integration for the (new_data) dataframe The common column name for the (merged_data.csv) ('uniq_id', 'description', 'Price') and (myntra_products_catalog_2.csv) which is the(new_data) data frame ('ProductID','Description', 'Price') did not correspound, so we have to rename the column for (myntra_products_catalog_2.csv). In [27]: # Define a dictionary to map old column names to new column names column_mapping = {'ProductID': 'uniq_id', 'Description': 'description_x', 'Pric new_data.rename(columns=column_mapping, inplace=True) # Uses the rename method # No the 'products_df' has updated column names new_data.head(10) #Displays the first 10 rows of the dataset and its columns localhost:8888/nbconvert/html/-__BD2.ipynb?download=false 25/33 12/4/23, 8:21 PM Out[27]: -__BD2 uniq_id- ProductName ProductBrand Gender Price NumImages description_x DKNY Unisex Black and grey Black & Grey DKNY Unisex 11745 7 printed medium Printed Medium trolley bag, sec... Trolle... EthnoVogue Beige & Grey Women Beige & EthnoVogue Women 5810 made to measure 7 Grey Made to kurta with Measure ... churid... Pink coloured SPYKAR Women wash 5-pocket Pink Alexa Super SPYKAR Women 899 7 high-rise Skinny Fit High-... cropped ... Raymond Men Blue self-design Blue Self-Design bandhgala Raymond Men 5599 5 Single-Breasted suitBlue selfB... desig... Brown and offParx Men Brown white printed & Off-White Slim Parx Men 759 5 casual shirt, has Fit Printed Ca... ... Brown solid lowSHOWOFF Men rise regular Brown Solid Slim SHOWOFF Men 791 5 shorts, has four Fit Regular Shorts ... Parx Men Blue Blue checked Slim Fit Checked Parx Men 719 5 casual shirt, has Casual Shirt a spread collar... SPYKAR Women Burgundy Burgundy Alexa coloured wash 5SPYKAR Women 899 7 pocket high-rise Super Skinny Fit H... jean... Parx Men Brown Brown solid Tapered Fit Solid Parx Men 664 5 regular trousers Regular Trousers regular trousers DKNY Unisex Black solid large Black Large DKNY Unisex 17360 5 trolley bag, Trolley Bag secured with a ... From the above data, it is clearly seen the 'price' Column for (new_data) data frame is not in a float format or decimal format. A data cleaning will be carried out on the price column. In [28]: # Convert the 'Price'column to numeric data type(float) as seen in the (merge_d new_data['Price'] = new_data['Price'].astype(float).round(2) print(new_data) new_data['Price'] = new_data['Price'].astype(float).round(2) new_data.head() localhost:8888/nbconvert/html/-__BD2.ipynb?download=false 26/33 12/4/23, 8:21 PM -__BD2 0 1 2 3 4 ..- 0 1 2 3 4 ..- 0 1 2 3 4 ..- uniq_id- ..- ProductName DKNY Unisex Black & Grey Printed Medium Trolle... EthnoVogue Women Beige & Grey Made to Measure ... SPYKAR Women Pink Alexa Super Skinny Fit High-... Raymond Men Blue Self-Design Single-Breasted B... Parx Men Brown & Off-White Slim Fit Printed Ca... ... Pepe Jeans Men Black Hammock Slim Fit Low-Rise... Mochi Women Gold-Toned Solid Heels 612 league Girls Navy Blue & White Printed Reg... Bvlgari Men Aqva Pour Homme Marine Eau de Toil... Pepe Jeans Men Black & Grey Striped Polo Colla... ProductBrand DKNY EthnoVogue SPYKAR Raymond Parx ... Pepe Jeans Mochi 612 league Bvlgari Pepe Jeans Gender Unisex Women Women Men Men ... Men Women Girls Men Men Price- ..- NumImages 7 7 7 5 5 ... 7 5 4 2 5 \ \ description_x Black and grey printed medium trolley bag, sec... Beige & Grey made to measure kurta with churid... Pink coloured wash 5-pocket high-rise cropped ... Blue self-design bandhgala suitBlue self-desig... Brown and off-white printed casual shirt, has ... ... Black dark wash 5-pocket low-rise jeans, clean... A pair of gold-toned open toe heels, has regul... Navy Blue and White printed mid-rise denim sho... Bvlgari Men Aqva Pour Homme Marine Eau de Toil... Black and grey striped T-shirt, has a polo col... [12491 rows x 7 columns] localhost:8888/nbconvert/html/-__BD2.ipynb?download=false 27/33 12/4/23, 8:21 PM Out[28]: -__BD2 uniq_id- ProductName ProductBrand Gender Price NumImages description_x DKNY Unisex Black and grey Black & Grey DKNY Unisex- printed medium Printed Medium trolley bag, sec... Trolle... EthnoVogue Beige & Grey Women Beige & EthnoVogue Women 5810.0 made to measure 7 Grey Made to kurta with Measure ... churid... SPYKAR Women Pink coloured Pink Alexa Super wash 5-pocket SPYKAR Women 899.0 7 Skinny Fit high-rise High-... cropped ... Raymond Men Blue self-design Blue Self-Design bandhgala Raymond Men 5599.0 5 suitBlue Single-Breasted selfB... desig... Brown and offParx Men Brown white printed & Off-White Slim Parx Men 759.0 5 casual shirt, has Fit Printed Ca... ... Merging (merged_data) and the (new_data) using their commonthe column name "uniq_id" In [29]: #common key to be merged common_key = 'Price' # Merge the (merged_data) DataFrame with the (new_data) DataFrame based on the augmented_data = pd.merge(merged_data, new_data, on='Price', how='inner') augmented_data.head() # Displays the first few rows of the merged data localhost:8888/nbconvert/html/-__BD2.ipynb?download=false 28/33 12/4/23, 8:21 PM -__BD2 uniq_id_x Out[29]: 0 d18bc79994ff35a2bca40977f81c4bc7 1 d2a5ad88091a614e2a45987ad967250f 2 aae6a185a8dde153e40d8a6ed271998d 3 b31700da42b1a5363e7243a498e1ebd3 4 a00bbd447d31c7c687e580c79bbf354d Name St. Johns Bay® Jamie Womens Suede Slouch Boots Samsung 7.5 Cu. Ft. Electric Dryer with Steam Dry GE® ENERGY STAR® 7.8 Cu. ft. Electric Dryer Samsung 7.5 Cu. Ft. Gas Dryer LG ENERGY STAR® 7.4 Cu. Ft. Ultra Large Capaci... description_x_x Price Av_Score Our Jamie suede slouch boots are- pp- soft, supple ... true Skip the dry cleaners with- pp- this powerful ... true Meeting or exceeding- pp500630 federal guidelines f... true Get all the power you need- pp- with this elec... true Getting your laundry fresh- pp500638 and dry has ne... 5 rows × 27 columns Further Exploration of the Augmented data ('augmented_data') In [30]: augmented_data.info() # Display information about the data types and missing v augmented_data.describe() # Display summary statistics for numerical columns i localhost:8888/nbconvert/html/-__BD2.ipynb?download=false 29/33 12/4/23, 8:21 PM -__BD2 Int64Index: 11 entries, 0 to 10 Data columns (total 27 columns): # Column Non-Null Count --- ------------------0 uniq_id_x 11 non-null 1 Name 11 non-null 2 description_x_x 11 non-null 3 Price 11 non-null 4 Av_Score 11 non-null 5 sku 11 non-null 6 name_title 11 non-null 7 description_y 11 non-null 8 list_price 11 non-null 9 sale_price 11 non-null 10 category 11 non-null 11 category_tree 11 non-null 12 average_product_rating 11 non-null 13 product_url 11 non-null 14 product_image_urls 11 non-null 15 brand 11 non-null 16 total_number_reviews 11 non-null 17 Reviews 11 non-null 18 Bought With 11 non-null 19 Sentiment 11 non-null 20 Sentiment_Label 11 non-null 21 uniq_id_y 11 non-null 22 ProductName 11 non-null 23 ProductBrand 11 non-null 24 Gender 11 non-null 25 NumImages 11 non-null 26 description_x_y 11 non-null dtypes: float64(4), int64(3), object(20) memory usage: 2.4+ KB Out[30]: count mean std min 25% 50% 75% max Price- Dtype ----object object object float64 float64 object object object object object object object float64 object object object int64 object object float64 object int64 object object object int64 object Av_Score average_product_rating total_number_reviews Sentiment u- Data Visualization for the Augmented data ('augmented_data') Data visualization was constructed to understand the temporal dynamics of the augmented data, identify trends and patterns. The code below utilizes matplotlib.pyplot to create various charts and graphs. localhost:8888/nbconvert/html/-__BD2.ipynb?download=false 30/33 12/4/23, 8:21 PM -__BD2 In [31]: # Create a bar chart for the total number of reviews for each product sns.set(style="whitegrid") # Sets the style of seaborn plt.figure(figsize=(12, 8)) sns.barplot(x='total_number_reviews', y='ProductName', data=augmented_data, hue # The below sets labels and title plt.xlabel('Total Number of Reviews') plt.ylabel('Product Name') plt.title('Total Number of Reviews for Each Product') plt.show() # Show the plot ''' The x-axis represents the total number of reviews. The y-axis represents the product names. The hue parameter is used to differentiate between products for different gende ''' # Create a scatter plot for the relationship between product prices and average sns.set(style="whitegrid") # Sets the style of seaborn plt.figure(figsize=(12, 8)) sns.scatterplot(x='Price', y='Av_Score', data=augmented_data, hue='Gender') # Set labels and title plt.xlabel('Product Price') plt.ylabel('Average Score') plt.title('Relationship Between Product Prices and Average Scores') plt.show() # Show the plot ''' The x-axis represents the product prices. The y-axis represents the average scores. Different colors are used to differentiate between products for different gende ''' localhost:8888/nbconvert/html/-__BD2.ipynb?download=false 31/33 12/4/23, 8:21 PM Out[31]: -__BD2 '\nThe x-axis represents the product prices.\nThe y-axis represents the averag e scores.\nDifferent colors are used to differentiate between products for dif ferent genders\n' Recommendation Based on average product reviews and analysis of the top 10 products, here are possible recommendations for JCPenny: -Quality Improvement for Low-Rated Products: Identify products with consistently low average review ratings. A low average rating can be caused by a variety of reasons, including poor product quality, functionality issues, inadequate customer support, and unmet expectations. This can be improved by improving the manufacturing process, materials used or product design. Continuous quality control ensures that our products meet or exceed customer expectations. -Promote highly rated products: Highlight and promote products with a high review average. Use positive customer reviews in marketing materials and product descriptions to build trust and attract more customers. -Adjust marketing strategy: Adjust marketing strategy based on the 10 most viewed products. Also including these products in the marketing campaigns and promotions. localhost:8888/nbconvert/html/-__BD2.ipynb?download=false 32/33 12/4/23, 8:21 PM -__BD2 -Customer-centric approach: Prioritize products that receive high levels of customer engagement and positive feedback. Align product development strategies with customer needs and wants. In [ ]: localhost:8888/nbconvert/html/-__BD2.ipynb?download=false 33/33