12/4/23, 8:21 PM
-__BD2
University of Stirling
ITNPBD2 Representing and Manipulating Data
Assignment Autumn 2023
A Consultancy Job for JC Penney
Structure
You may structure the project how you wish, but here is a suggested guideline to help you
organise your work:
1. Data Exploration - Explore the data and show you understand its structure and relations
2. Data Validation - Check the quality of the data. Is it complete? Are there obvious errors?
3. Data Visualisation - Gain an overall understanding of the data with visualisations
4. Data Analysis = Set some questions and use the data to answer them
5. Data Augmentation - Add new data from another source to bring new insights to the
data you already have
1. Data Exploration
Data exploration was carried out to understan the dataset, including its structure, content,
and overall characteristics. And also to insights into the types of variables, their data types,
and the relationships between them. Then, it proceeds to load the CSV and JSON files into
DataFrames using pandas' read_csv() and JSON decoding methods.
The required libraries to be used for this consultancy analysis are imported below.
In [1]: # We import our libraries needed for this analysis
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import json
from textblob import TextBlob
In [2]: #Data Exploration showing the structures and relations of the data
# Load the CSV files into dataframes
products_df = pd.read_csv('products.csv')
reviews_df = pd.read_csv('reviews.csv')
localhost:8888/nbconvert/html/-__BD2.ipynb?download=false
1/33
12/4/23, 8:21 PM
-__BD2
users_df = pd.read_csv('users.csv')
# Displaying basic information about the structure of the Products data
print("Products Data:")
print(products_df.head()) # Displays the first few rows of the Products datafr
print(products_df.info()) # Displays information about the data types and miss
print(products_df.columns) # Displays the column names of the Products datafra
# Displaying basic information about the structure of the Reviews data
print("\nReviews Data:")
print(reviews_df.head()) # Displays the first few rows of the Reviews datafram
print(reviews_df.info()) # Displays information about the data types and missi
print(reviews_df.columns) # Displays the column names of the Reviews dataframe
# Displaying basic information about the structure of the Users data
print("\nUsers Data:")
print(users_df.head()) # Displays the first few rows of the Users dataframe
print(users_df.info()) # Displays information about the data types and missing
print(users_df.columns) # Displays the column names of the Users dataframe
localhost:8888/nbconvert/html/-__BD2.ipynb?download=false
2/33
12/4/23, 8:21 PM
-__BD2
Products Data:
0
1
2
3
4
Uniq_id
b6c0b6bea69c-baeac73c13d
93e5272c51d8cce02597e3ce67b7ad0a
013e320f2f2ec0cf5b3ff5418d-e6633d81f2cb7400c0cfa0394c427
d969a-e1331e304b09f81a83f6
0
1
2
3
4
Alfred
Alfred
Alfred
Alfred
Alfred
Dunner®
Dunner®
Dunner®
Dunner®
Dunner®
Essential
Essential
Essential
Essential
Essential
Pull
Pull
Pull
Pull
Pull
On
On
On
On
On
SKU
pp-
pp-
pp-
pp-
pp-
Capri
Capri
Capri
Capri
Capri
Name
Pant
Pant
Pant
Pant
Pant
\
\
Description Price Av_Score
0 Youll return to our Alfred Dunner pull-on capr..- Youll return to our Alfred Dunner pull-on capr..- Youll return to our Alfred Dunner pull-on capr..- Youll return to our Alfred Dunner pull-on capr..- Youll return to our Alfred Dunner pull-on capr..-
RangeIndex: 7982 entries, 0 to 7981
Data columns (total 6 columns):
#
Column
Non-Null Count Dtype
--- ------------------- ----0
Uniq_id
7982 non-null
object
1
SKU
7915 non-null
object
2
Name
7982 non-null
object
3
Description 7439 non-null
object
4
Price
5816 non-null
float64
5
Av_Score
7982 non-null
float64
dtypes: float64(2), object(4)
memory usage: 374.3+ KB
None
Index(['Uniq_id', 'SKU', 'Name', 'Description', 'Price', 'Av_Score'], dtype='o
bject')
Reviews Data:
0
1
2
3
4
Uniq_id
b6c0b6bea69c-baeac73c13d
b6c0b6bea69c-baeac73c13d
b6c0b6bea69c-baeac73c13d
b6c0b6bea69c-baeac73c13d
b6c0b6bea69c-baeac73c13d
Username
fsdv4141
krpz1113
mbmg3241
zeqg1222
nvfn3212
Score
2
1
2
0
3
\
Review
0 You never have to worry about the fit...Alfred...
1 Good quality fabric. Perfect fit. Washed very ...
2 I do not normally wear pants or capris that ha...
3 I love these capris! They fit true to size and...
4 This product is very comfortable and the fabri...
RangeIndex: 39063 entries, 0 to 39062
Data columns (total 4 columns):
#
Column
Non-Null Count Dtype
--- ------------------- ----0
Uniq_id
39063 non-null object
1
Username 39063 non-null object
2
Score
39063 non-null int64
localhost:8888/nbconvert/html/-__BD2.ipynb?download=false
3/33
12/4/23, 8:21 PM
-__BD2
3
Review
39063 non-null object
dtypes: int64(1), object(3)
memory usage: 1.2+ MB
None
Index(['Uniq_id', 'Username', 'Score', 'Review'], dtype='object')
Users Data:
Username
DOB
State
0 bkpn-
Oregon
1 gqjs- Massachusetts
2 eehe-
Idaho
3 hkxj-
Florida
4 jjbd-
Georgia
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 3 columns):
#
Column
Non-Null Count Dtype
--- ------------------- ----0
Username 5000 non-null
object
1
DOB
5000 non-null
object
2
State
5000 non-null
object
dtypes: object(3)
memory usage: 117.3+ KB
None
Index(['Username', 'DOB', 'State'], dtype='object')
Conducting a summary statistics for the numerical columns of the datasets
In [3]: #Summary Statistics
print(products_df.describe()) # Displays summary statistics for numerical colu
print("\nReviews Data:")
print(reviews_df.describe()) # Displays summary statistics for numerical colum
print("\nUsers Data:")
users_df.describe() # Displays summary statistics for numerical columns in the
count
mean
std
min
25%
50%
75%
max
Price-
-
Av_Score-
Reviews Data:
count
mean
std
min
25%
50%
75%
max
Score-
Users Data:
localhost:8888/nbconvert/html/-__BD2.ipynb?download=false
4/33
12/4/23, 8:21 PM
-__BD2
Out[3]:
count
unique
top
freq
Username
DOB
State-
dqft- Massachusetts-
Data Size of the datasets
In [4]: # Data size
print("Data Size:")
print("Products Dataset Size:", products_df.shape)#Displays the data size in th
print("Reviews Dataset Size:", reviews_df.shape) #Displays the data size in the
print("Users Dataset Size:", users_df.shape) #Displays the data size in the use
Data Size:
Products Dataset Size: (7982, 6)
Reviews Dataset Size: (39063, 4)
Users Dataset Size: (5000, 3)
Loading the json file for reviewers and products
In [5]: json_file_path = "jcpenney_reviewers.json"
data_list = []
# Specify the path to the JSON file
# Initializes an empty list to store JSON objects
# Read the JSON file line by line
with open(json_file_path, "r") as json_file:
for line_number, line in enumerate(json_file, start=1):
try:
# Load each line as a JSON object and append it to the list
json_data = json.loads(line)
data_list.append(json_data)
# Display each JSON object in the list
print(json_data)
if len(data_list) == 3: # Break out of the loop after the first 3
break
except json.JSONDecodeError as e:
print(f"Error decoding JSON at line {line_number}: {e}") # Handles
{'Username': 'bkpn1412', 'DOB': '-', 'State': 'Oregon', 'Reviewed':
['cea76118f6a9110a893de2b-c0']}
{'Username': 'gqjs4414', 'DOB': '-', 'State': 'Massachusetts', 'Revie
wed': ['fa04fe6c0dd5189f54fe600838da43d3']}
{'Username': 'eehe1434', 'DOB': '-', 'State': 'Idaho', 'Reviewed':
[]}
In [6]: json_file_path = "jcpenney_products.json"
localhost:8888/nbconvert/html/-__BD2.ipynb?download=false
# Specifys the path to the JSON file
5/33
12/4/23, 8:21 PM
-__BD2
data_list = [] # Initializes an empty list to store JSON objects
# Read the JSON file line by line
with open(json_file_path, "r") as json_file:
for line_number, line in enumerate(json_file, start=1):
try:
# Load each line as a JSON object and append it to the list
json_data = json.loads(line)
data_list.append(json_data)
# Display each JSON object in the list for the first row
if len(data_list) <= 1:
print(json_data)
except json.JSONDecodeError as e:
print(f"Error decoding JSON at line {line_number}: {e}") # Handles
{'uniq_id': 'b6c0b6bea69c-baeac73c13d', 'sku': 'pp-', 'name_t
itle': 'Alfred Dunner® Essential Pull On Capri Pant', 'description': 'You\'ll
return to our Alfred Dunner pull-on capris again and again when you want an up
dated, casual look and all the comfort you love. \xa0 elastic waistband appro
x. 19-21" inseam slash pockets polyester washable imported \xa0 \xa0 \xa0', 'l
ist_price': '41.09', 'sale_price': '24.16', 'category': 'alfred dunner', 'cate
gory_tree': 'jcpenney|women|alfred dunner', 'average_product_rating': 2.625,
'product_url': 'http://www.jcpenney.com/alfred-dunner-essential-pull-on-capripant/prod.jump?ppId=pp-&catId=cat-&&_dyncharset=UTF-8&urlSta
te=/women/shop-brands/alfred-dunner/yellow/_/N-gkmp33Z132/cat.jump', 'product_
image_urls': 'http://s7d9.scene7.com/is/image/JCPenney/DP-M.ti
f?hei=380&wid=380&op_usm=.4,.8,0,0&resmode=sharp2&op_usm=1.5,.8,0,0&resmod
e=sharp', 'brand': 'Alfred Dunner', 'total_number_reviews': 8, 'Reviews': [{'U
ser': 'fsdv4141', 'Review': 'You never have to worry about the fit...Alfred Du
nner clothing sizes are true to size and fits perfectly. Great value for the m
oney.', 'Score': 2}, {'User': 'krpz1113', 'Review': 'Good quality fabric. Perf
ect fit. Washed very well no iron.', 'Score': 4}, {'User': 'mbmg3241', 'Revie
w': 'I do not normally wear pants or capris that have an elastic waist, but I
decided to try these since they were on sale and I loved the color. I was very
surprised at how comfortable they are and wear really well even wearing all da
y. I will buy this style again!', 'Score': 4}, {'User': 'zeqg1222', 'Review':
'I love these capris! They fit true to size and are so comfortable to wear. I
am planning to order more of them.', 'Score': 1}, {'User': 'nvfn3212', 'Revie
w': 'This product is very comfortable and the fabric launders very well', 'Sco
re': 1}, {'User': 'aajh3423', 'Review': 'I did not like the fabric. It is 100%
polyester I thought it was different.I bought one at the store apprx two monts
ago, and I thought it was just like it', 'Score': 5}, {'User': 'usvp2142', 'Re
view': 'What a great deal. Beautiful Pants. Its more than I expected.', 'Scor
e': 3}, {'User': 'yemw3321', 'Review': 'Alfred Dunner has great pants, good fi
t and very comfortable', 'Score': 1}], 'Bought With': ['898e42fe937a33e8ce5e90
0ca7a4d924', '8c02c262567a2267cd207e35637feb1c', 'b62dd54545cdc1a05d8aaa2d25ae
d996', '0da4c2dcc8cfa0e-b00d22b30', '90c46b841e2eeece992c-
c']}
1.b Data Integration
Data integration was carried out to combine the information from the (products_df) dataset
and the ("jcpenney_products.json") to create a comprehensive and unified view using their
common value.
localhost:8888/nbconvert/html/-__BD2.ipynb?download=false
6/33
12/4/23, 8:21 PM
-__BD2
The common column name for the product_df ('Uniq_id', 'SKU', 'Name', 'Description')
and jcpenney_products.json ("'uniq_id', 'sKU','description'") didnot correspound, so we
have to rename the column for (product_df) dataset.
In [7]: # Define a dictionary to map old column names to new column names
column_mapping = {'Uniq_id': 'uniq_id', 'SKU': 'sKU', 'Description': 'descripti
# Use the rename method to replace column names
products_df.rename(columns=column_mapping, inplace=True)
# Now, 'products_df' has updated column names
products_df.head()
uniq_id
Out[7]:
sKU
0 b6c0b6bea69c-baeac73c13d pp-
1 93e5272c51d8cce02597e3ce67b7ad0a pp-
2 013e320f2f2ec0cf5b3ff5418d688528 pp-
3 505e6633d81f2cb7400c0cfa0394c427 pp-
4 d969a-e1331e304b09f81a83f6 pp-
Name
Alfred
Dunner®
Essential
Pull On
Capri
Pant
Alfred
Dunner®
Essential
Pull On
Capri
Pant
Alfred
Dunner®
Essential
Pull On
Capri
Pant
Alfred
Dunner®
Essential
Pull On
Capri
Pant
Alfred
Dunner®
Essential
Pull On
Capri
Pant
description
Youll return
to our
Alfred
Dunner
pull-on
capr...
Youll return
to our
Alfred
Dunner
pull-on
capr...
Youll return
to our
Alfred
Dunner
pull-on
capr...
Youll return
to our
Alfred
Dunner
pull-on
capr...
Youll return
to our
Alfred
Dunner
pull-on
capr...
Price Av_Score
41.09
2.625
41.09
3.000
41.09
2.625
41.09
3.500
41.09
3.125
In [8]: # Changing uniqe value for review dataset
# Define a dictionary to map old column names to new column names
column_mapping = {'Uniq_id': 'uniq_id'}
# Use the rename method to replace column names
reviews_df.rename(columns=column_mapping, inplace=True)
# Now, 'products_df' has updated column names
reviews_df.head()
localhost:8888/nbconvert/html/-__BD2.ipynb?download=false
7/33
12/4/23, 8:21 PM
Out[8]:
-__BD2
uniq_id Username Score
Review
about the
0 b6c0b6bea69c-baeac73c13d fsdv4141
2 You never have to worryfit...Alfred...
Good quality fabric. Perfect fit.
1 b6c0b6bea69c-baeac73c13d krpz1113
1
Washed very ...
I do not normally wear pants or
2 b6c0b6bea69c-baeac73c13d mbmg3241
2
capris that ha...
fit true to
3 b6c0b6bea69c-baeac73c13d zeqg1222
0 I love these capris! Theysize
and...
4 b6c0b6bea69c-baeac73c13d nvfn3212
3 This product is veryandcomfortable
the fabri...
Merrging
('product_df
dataset') with the ('product json_file')
using theirthecommon
colum('uniq_id')
In [9]: json_file = pd.DataFrame(data_list) # Converts the list of JSON objects to a Da
#common key to be merged
common_key = 'uniq_id'
# Merge the JSON DataFrame with the CSV DataFrame based on the common key
merged_data = pd.merge(products_df, json_file, on='uniq_id')
print(merged_data.head())
# Displays the first few rows of the merged data
merged_data.to_csv('merged_data.csv', index=False)
localhost:8888/nbconvert/html/-__BD2.ipynb?download=false
# Writes the merged data to
8/33
12/4/23, 8:21 PM
-__BD2
0
1
2
3
4
uniq_id
b6c0b6bea69c-baeac73c13d
93e5272c51d8cce02597e3ce67b7ad0a
013e320f2f2ec0cf5b3ff5418d-e6633d81f2cb7400c0cfa0394c427
d969a-e1331e304b09f81a83f6
0
1
2
3
4
Alfred
Alfred
Alfred
Alfred
Alfred
0
1
2
3
4
Youll
Youll
Youll
Youll
Youll
0
1
2
3
4
sku
pp-
pp-
pp-
pp-
pp-
Alfred
Alfred
Alfred
Alfred
Alfred
Dunner®
Dunner®
Dunner®
Dunner®
Dunner®
0
1
2
3
4
You'll
You'll
You'll
You'll
You'll
to
to
to
to
to
Alfred
Alfred
Alfred
Alfred
Alfred
0
1
2
3
4
category
alfred dunner
alfred dunner
view all
view all
view all
0
1
2
3
4
product_url
http://www.jcpenney.com/alfred-dunner-essentia...
http://www.jcpenney.com/alfred-dunner-essentia...
http://www.jcpenney.com/alfred-dunner-essentia...
http://www.jcpenney.com/alfred-dunner-essentia...
http://www.jcpenney.com/alfred-dunner-essentia...
0
1
2
3
4
product_image_urls
http://s7d9.scene7.com/is/image/JCPenney/DP122...
http://s7d9.scene7.com/is/image/JCPenney/DP122...
http://s7d9.scene7.com/is/image/JCPenney/DP122...
http://s7d9.scene7.com/is/image/JCPenney/DP122...
http://s7d9.scene7.com/is/image/JCPenney/DP122...
0
1
2
total_number_reviews
8
8
8
Dunner®
Dunner®
Dunner®
Dunner®
Dunner®
return
return
return
return
return
Essential
Essential
Essential
Essential
Essential
to
to
to
to
to
return
return
return
return
return
our
our
our
our
our
Pull
Pull
Pull
Pull
Pull
Alfred
Alfred
Alfred
Alfred
Alfred
our
our
our
our
our
On
On
On
On
On
sKU
pp-
pp-
pp-
pp-
pp-
Capri
Capri
Capri
Capri
Capri
Name
Pant
Pant
Pant
Pant
Pant
\
description_x
pull-on capr...
pull-on capr...
pull-on capr...
pull-on capr...
pull-on capr...
Dunner
Dunner
Dunner
Dunner
Dunner
Essential
Essential
Essential
Essential
Essential
Dunner
Dunner
Dunner
Dunner
Dunner
Pull
Pull
Pull
Pull
Pull
On
On
On
On
On
Price-
name_title
Capri Pant
Capri Pant
Capri Pant
Capri Pant
Capri Pant
Av_Score-
\
\
description_y list_price sale_price
pull-on cap..-
pull-on cap..-
pull-on cap..-
pull-on cap..-
pull-on cap..-
category_tree
jcpenney|women|alfred dunner
jcpenney|women|alfred dunner
jcpenney|women|view all
jcpenney|women|view all
jcpenney|women|view all
localhost:8888/nbconvert/html/-__BD2.ipynb?download=false
\
average_product_rating-
\
\
\
Alfred
Alfred
Alfred
Alfred
Alfred
brand
Dunner
Dunner
Dunner
Dunner
Dunner
\
Reviews
[{'User': 'fsdv4141', 'Review': 'You never hav...
[{'User': 'tpcu2211', 'Review': 'You never hav...
[{'User': 'pcfg3234', 'Review': 'You never hav...
\
9/33
12/4/23, 8:21 PM
-__BD2
3
4
0
1
2
3
4
8
8
[{'User': 'ngrq4411', 'Review': 'You never hav...
[{'User': 'nbmi2334', 'Review': 'You never hav...
[898e42fe937a33e8ce5e900ca7a4d924,
[bc9ab3406dcaa84a123b9da862e6367d,
[3ce70f519a9cfdd85cdbdecd358e5347,
[efcd811edccbeb5e67eaa8ef0d991f7c,
[0ca5ad2a218f59eb83eec1e248a0782d,
Bought With
8c02c262567...
18eb69e8fc2...
b0295c96d2b...
7b2cc00171e...
9869fc8da14...
Conducting a data exploration for the merged data ('merged_data')
In [10]: # Displaying basic information about the structure of the 'merged_data'
print(merged_data.head())
# Displays the first few rows of the Products dataf
print(merged_data.info())
# Displays information about the data types and mis
print(merged_data.columns)
# Displays the column names of the Users dataframe
localhost:8888/nbconvert/html/-__BD2.ipynb?download=false
10/33
12/4/23, 8:21 PM
-__BD2
0
1
2
3
4
uniq_id
b6c0b6bea69c-baeac73c13d
93e5272c51d8cce02597e3ce67b7ad0a
013e320f2f2ec0cf5b3ff5418d-e6633d81f2cb7400c0cfa0394c427
d969a-e1331e304b09f81a83f6
0
1
2
3
4
Alfred
Alfred
Alfred
Alfred
Alfred
0
1
2
3
4
Youll
Youll
Youll
Youll
Youll
0
1
2
3
4
sku
pp-
pp-
pp-
pp-
pp-
Alfred
Alfred
Alfred
Alfred
Alfred
Dunner®
Dunner®
Dunner®
Dunner®
Dunner®
0
1
2
3
4
You'll
You'll
You'll
You'll
You'll
to
to
to
to
to
Alfred
Alfred
Alfred
Alfred
Alfred
0
1
2
3
4
category
alfred dunner
alfred dunner
view all
view all
view all
0
1
2
3
4
product_url
http://www.jcpenney.com/alfred-dunner-essentia...
http://www.jcpenney.com/alfred-dunner-essentia...
http://www.jcpenney.com/alfred-dunner-essentia...
http://www.jcpenney.com/alfred-dunner-essentia...
http://www.jcpenney.com/alfred-dunner-essentia...
0
1
2
3
4
product_image_urls
http://s7d9.scene7.com/is/image/JCPenney/DP122...
http://s7d9.scene7.com/is/image/JCPenney/DP122...
http://s7d9.scene7.com/is/image/JCPenney/DP122...
http://s7d9.scene7.com/is/image/JCPenney/DP122...
http://s7d9.scene7.com/is/image/JCPenney/DP122...
0
1
2
total_number_reviews
8
8
8
Dunner®
Dunner®
Dunner®
Dunner®
Dunner®
return
return
return
return
return
Essential
Essential
Essential
Essential
Essential
to
to
to
to
to
return
return
return
return
return
our
our
our
our
our
Pull
Pull
Pull
Pull
Pull
Alfred
Alfred
Alfred
Alfred
Alfred
our
our
our
our
our
On
On
On
On
On
sKU
pp-
pp-
pp-
pp-
pp-
Capri
Capri
Capri
Capri
Capri
Name
Pant
Pant
Pant
Pant
Pant
\
description_x
pull-on capr...
pull-on capr...
pull-on capr...
pull-on capr...
pull-on capr...
Dunner
Dunner
Dunner
Dunner
Dunner
Essential
Essential
Essential
Essential
Essential
Dunner
Dunner
Dunner
Dunner
Dunner
Pull
Pull
Pull
Pull
Pull
On
On
On
On
On
Price-
name_title
Capri Pant
Capri Pant
Capri Pant
Capri Pant
Capri Pant
Av_Score-
\
\
description_y list_price sale_price
pull-on cap..-
pull-on cap..-
pull-on cap..-
pull-on cap..-
pull-on cap..-
category_tree
jcpenney|women|alfred dunner
jcpenney|women|alfred dunner
jcpenney|women|view all
jcpenney|women|view all
jcpenney|women|view all
localhost:8888/nbconvert/html/-__BD2.ipynb?download=false
\
average_product_rating-
\
\
\
Alfred
Alfred
Alfred
Alfred
Alfred
brand
Dunner
Dunner
Dunner
Dunner
Dunner
\
Reviews
[{'User': 'fsdv4141', 'Review': 'You never hav...
[{'User': 'tpcu2211', 'Review': 'You never hav...
[{'User': 'pcfg3234', 'Review': 'You never hav...
\
11/33
12/4/23, 8:21 PM
-__BD2
3
4
8
8
[{'User': 'ngrq4411', 'Review': 'You never hav...
[{'User': 'nbmi2334', 'Review': 'You never hav...
Bought With
0 [898e42fe937a33e8ce5e900ca7a4d924, 8c02c262567...
1 [bc9ab3406dcaa84a123b9da862e6367d, 18eb69e8fc2...
2 [3ce70f519a9cfdd85cdbdecd358e5347, b0295c96d2b...
3 [efcd811edccbeb5e67eaa8ef0d991f7c, 7b2cc00171e...
4 [0ca5ad2a218f59eb83eec1e248a0782d, 9869fc8da14...
Int64Index: 7982 entries, 0 to 7981
Data columns (total 20 columns):
#
Column
Non-Null Count Dtype
--- ------------------- ----0
uniq_id
7982 non-null
object
1
sKU
7915 non-null
object
2
Name
7982 non-null
object
3
description_x
7439 non-null
object
4
Price
5816 non-null
float64
5
Av_Score
7982 non-null
float64
6
sku
7982 non-null
object
7
name_title
7982 non-null
object
8
description_y
7982 non-null
object
9
list_price
7982 non-null
object
10 sale_price
7982 non-null
object
11 category
7982 non-null
object
12 category_tree
7982 non-null
object
13 average_product_rating 7982 non-null
float64
14 product_url
7982 non-null
object
15 product_image_urls
7982 non-null
object
16 brand
7982 non-null
object
17 total_number_reviews
7982 non-null
int64
18 Reviews
7982 non-null
object
19 Bought With
7982 non-null
object
dtypes: float64(3), int64(1), object(16)
memory usage: 1.3+ MB
None
Index(['uniq_id', 'sKU', 'Name', 'description_x', 'Price', 'Av_Score', 'sku',
'name_title', 'description_y', 'list_price', 'sale_price', 'category',
'category_tree', 'average_product_rating', 'product_url',
'product_image_urls', 'brand', 'total_number_reviews', 'Reviews',
'Bought With'],
dtype='object')
2. Data Validation
Data validation was carried out to reveal the presence of missing values in the datasets and
ensure that the data is accurate and free from errors. In addition, to verify that relationships
between variables or columns are consistent and align with expectations.
In [11]: #Data Validation to check the quality of the data and possiblee Errors
# Checking for missing values in the Products DataFrame
print("\nMissing values in Products Data:")
print(products_df.isnull().sum()) # Displays the sum of missing values for eac
# Checking for missing values in the Reviews DataFrame
print("\nMissing values in Reviews Data:")
localhost:8888/nbconvert/html/-__BD2.ipynb?download=false
12/33
12/4/23, 8:21 PM
-__BD2
print(reviews_df.isnull().sum())
# Displays the sum of missing values for each
# Checking for missing values in the Users DataFrame
print("\nMissing values in Users Data:")
print(users_df.isnull().sum()) # Displays the sum of missing values for each c
# Checking data types consistency in the Products DataFrame
print("\nData types in Products Data:")
print(products_df.dtypes) # Displays the data types of each column in the Prod
# Checking data types consistency in the Reviews DataFrame
print("\nData types in Reviews Data:")
print(reviews_df.dtypes) # Displays the data types of each column in the Revie
# Checking data types consistency in the Users DataFrame
print("\nData types in Users Data:")
print(users_df.dtypes) # Displays the data types of each column in the Users D
localhost:8888/nbconvert/html/-__BD2.ipynb?download=false
13/33
12/4/23, 8:21 PM
-__BD2
Missing values in Products Data:
uniq_id
0
sKU
67
Name
0
description
543
Price
2166
Av_Score
0
dtype: int64
Missing values in Reviews Data:
uniq_id
0
Username
0
Score
0
Review
0
dtype: int64
Missing values in Users Data:
Username
0
DOB
0
State
0
dtype: int64
Data types in Products Data:
uniq_id
object
sKU
object
Name
object
description
object
Price
float64
Av_Score
float64
dtype: object
Data types in Reviews Data:
uniq_id
object
Username
object
Score
int64
Review
object
dtype: object
Data types in Users Data:
Username
object
DOB
object
State
object
dtype: object
Data Validation for the (merged_data) data
In [12]: print("\nMissing values in Merged Data:")
print(merged_data.isnull().sum())
localhost:8888/nbconvert/html/-__BD2.ipynb?download=false
14/33
12/4/23, 8:21 PM
-__BD2
Missing values in Merged Data:
uniq_id
0
sKU
67
Name
0
description_x
543
Price
2166
Av_Score
0
sku
0
name_title
0
description_y
0
list_price
0
sale_price
0
category
0
category_tree
0
average_product_rating
0
product_url
0
product_image_urls
0
brand
0
total_number_reviews
0
Reviews
0
Bought With
0
dtype: int64
Fixing
the
missing
values
in
columns
(Price
and
Description)
for
('products_df')
In [13]: # For price
products_df['Price'].fillna(0, inplace=True)
# For Description
products_df['description'].fillna('No Description', inplace=True)
# For SKU, its best i drop it to have a smooth analysis
column_to_drop = 'sKU'
products_df.drop(columns=column_to_drop, inplace=True)
print(products_df.isnull().sum())
uniq_id
Name
description
Price
Av_Score
dtype: int64
0
0
0
0
0
Fixing
the
missing
values
in
columns
(Price
,
Description
and
sKU
) For the merged data
In [14]: # For price
merged_data['Price'].fillna(0, inplace=True)
# For Description
merged_data['description_x'].fillna('No Description', inplace=True)
# For SKU, its best i drop it to have a smooth analysis
column_to_drop = 'sKU'
merged_data.drop(columns=column_to_drop, inplace=True)
localhost:8888/nbconvert/html/-__BD2.ipynb?download=false
15/33
12/4/23, 8:21 PM
-__BD2
print(products_df.isnull().sum())
uniq_id
Name
description
Price
Av_Score
dtype: int64
0
0
0
0
0
Performing
a
Sentiment
Analysis
on
Production
Description
for
the Merged Data
For this part, a natural language processing technique (NLP) will be used by extracting
sentiment polarity from the descirption text to determine the positive, negative or netural
sentiments associted with specific producs.
In [15]: # Function to analyze sentiment using TextBlob
def analyze_sentiment(description):
if pd.isna(description): # Check for NaN values
return 0 # or any default value based on your preference
blob = TextBlob(description)
return blob.sentiment.polarity
# Apply sentiment analysis to the "Description" column
merged_data['Sentiment'] = merged_data['description_x'].apply(analyze_sentiment
# Creating a new column for sentiment labels (positive, negative, neutral)
merged_data['Sentiment_Label'] = merged_data['Sentiment'].apply(
lambda score: 'Positive' if score > 0 else ('Negative' if score < 0 else 'N
)
print(merged_data[['description_x', 'Sentiment', 'Sentiment_Label']])
localhost:8888/nbconvert/html/-__BD2.ipynb?download=false
# Displa
16/33
12/4/23, 8:21 PM
-__BD2
0
1
2
3
4
..-
0
1
2
3
4
..-
description_x
pull-on capr...
pull-on capr...
pull-on capr...
pull-on capr...
pull-on capr...
...
This Hoover® vacuum features dual-stage cyclon...
This Hoover® vacuum features dual-stage cyclon...
This Hoover® vacuum features dual-stage cyclon...
No Description
No Description
Youll
Youll
Youll
Youll
Youll
return
return
return
return
return
to
to
to
to
to
our
our
our
our
our
Alfred
Alfred
Alfred
Alfred
Alfred
Dunner
Dunner
Dunner
Dunner
Dunner
Sentiment
-e-17
-e-17
-e-17
-e-17
-e-17
..-e-e-e-e-e+00
\
Sentiment_Label
Negative
Negative
Negative
Negative
Negative
...
Positive
Positive
Positive
Neutral
Neutral
[7982 rows x 3 columns]
3. Data Visualization
Data visualization was constructed to understand the temporal dynamics of the data,
identify trends and patterns.
The code below utilizes matplotlib.pyplot to create various charts and graphs.
Visualization for the merged data
In [16]: #Data Visualizatio using the merged data
# Distribution of Price
plt.hist(merged_data['Price'], bins=20, color='skyblue', edgecolor='black')
plt.title('Distribution of Prices')
plt.xlabel('Price')
plt.ylabel('Frequency')
plt.show() # Displays the plot
# Distribution of Price vs. Average Product Rating
plt.figure(figsize=(12, 8))
sns.scatterplot(x='Price', y='Av_Score', data=merged_data) # Creates a scatter
# Set labels and title
plt.xlabel('Product Price')
plt.ylabel('Average Product Rating')
plt.title('Scatter Plot of Price vs. Average Product Rating')
plt.show() # Displays the plot
'''
The x-axis represents the product price.
localhost:8888/nbconvert/html/-__BD2.ipynb?download=false
17/33
12/4/23, 8:21 PM
-__BD2
The y-axis represents the average product rating.
Each point in the scatter plot represents a product.
'''
Out[16]:
'\nThe x-axis represents the product price.\nThe y-axis represents the average
product rating.\nEach point in the scatter plot represents a product.\n'
Other Visualization using the unmerged datasets
In [17]: # Data Visualization to gain an overal understanding of the datasets
# Visualize the distribution of score in the reviews data by state
plt.figure(figsize=(8, 6))
sns.countplot(x='Score', data=reviews_df)
plt.title('Distribution of Scores')
plt.xlabel('Score')
localhost:8888/nbconvert/html/-__BD2.ipynb?download=false
18/33
12/4/23, 8:21 PM
-__BD2
plt.ylabel('State')
# the count axis counts the occurrences of each unique score and visualizes th
plt.show()# Displays the plot
# Visualize the distribution of Review in the reviews data
plt.figure(figsize=(8, 6))
sns.countplot(x='Score', data=reviews_df)
plt.title('Distribution of Reviews')
plt.xlabel('Review')
plt.ylabel('Count')
# the count axis counts the occurrences of each unique Reeview and visualizes
plt.show() # Displays the plot
# User activity by state
plt.figure(figsize=(14, 12))
user_activity_by_location = users_df['State'].value_counts().sort_values(ascend
user_activity_by_location.plot(kind='bar')
sns.countplot(x='State', data=users_df)
plt.title('User Activity by State')
plt.xlabel('State')
plt.ylabel('User Count')
plt.show() # Displays the plot
localhost:8888/nbconvert/html/-__BD2.ipynb?download=false
19/33
12/4/23, 8:21 PM
localhost:8888/nbconvert/html/-__BD2.ipynb?download=false
-__BD2
20/33
12/4/23, 8:21 PM
-__BD2
4. Data Analysis
Data analysis was conducted by extracting meaningful insights from data through statistical
techniques and computational methods. In the context of this assignment, data analysis
aims to answer specific questions and gain deeper understanding of the data. This
questions are listed below:
-Average Review Rating by Product Category: Identifying the product categories with the
highest average review ratings can help JC Penney focus on promoting and improving
products in those categories.
-Top 10 Most Reviewed Products: Analyzing the most reviewed products can reveal
popular items and areas where JC Penney could potentially expand its product offerings.
-User Influence on Product Reviews: Understanding the influence of individual users on
product reviews can help JC Penney identify influential customers and potentially leverage
localhost:8888/nbconvert/html/-__BD2.ipynb?download=false
21/33
12/4/23, 8:21 PM
-__BD2
their opinions for marketing purposes.
-Top 5 most common states among users: Analyzing the most state among users
-How many unique users wrote a review: Identifying the total number of users that wrote
a review.
This analysis was done using the ('merged_data') dataframe.
QUESTION 1: What is the average Review Rating by Product?
In [18]: # Merge the already merged dataset(merge_data) and reviews data on the common c
merged_data2 = pd.merge(reviews_df,merged_data, on='uniq_id', how='inner')
average_rating_by_product = merged_data2.groupby('Name')['Score'].mean().reset_
print("Average Review Rating by Product:")
print(average_rating_by_product)
Average Review Rating by Product:
0
1
2
3
4
..-
1
1
1
¼
½
Name
1 CT. Certified Diamond Solitaire Ring
CT. T.W. Certified Diamond 14K White Gold Br...
CT. T.W. Certified Diamond 14K White Gold Pr...
CT. T.W. Certified Diamond 14K Yellow Gold B...
1 CT. T.W. Diamond 10K White Gold Cluster Ring
...
CT. T.W. White & Color-Enhanced Black Diamon...
½ CT. Princess Certified Diamond Solitaire Ring
CT. T.W. Diamond 10K Yellow Gold Contoured A...
½ CT. T.W. Diamond Bridal Set
⅓ CT. T.W. Diamond 3-Stone Promise Ring
Score-
..-
[6001 rows x 2 columns]
QUESTION 2: What is the top 10 Most Reviewed product?
In [19]: print("\nMost Reviewed Product:")
# Count the number of reviews for each product, then get the top 10 most review
top_10_most_reviewed = merged_data2['Name'].value_counts().head(10).reset_index
top_10_most_reviewed.columns = ['Product', 'Review Count']
print("Top 10 Most Reviewed Products:")
print(top_10_most_reviewed)
Most Reviewed Product:
Top 10 Most Reviewed Products:-
Product
Stafford® Gunner Mens Cap Toe Leather Boots
Clarks® Leisa Grove Leather Sandals
Xersion™ Quick-Dri Performance Bootcut Pant
Clarks® Leisa Grove Leather Sandals - Wide Width
St. Johns Bay® Secretly Slender Straight-Leg J...
Xersion™ Quick-Dri Performance Capris
Arizona Harbor Boat Shoes
Liz Claiborne® Rockele Stretch Wedge Sandals
Liz Claiborne® Essential Original-Fit Straight...
Arizona Raglan-Sleeve Thermal Pullover
Review Count-
QUESTION 3: What is the User influence on product reviews?
localhost:8888/nbconvert/html/-__BD2.ipynb?download=false
22/33
12/4/23, 8:21 PM
-__BD2
In [20]: print("\nUser influence on product reviews:")
# Merge reviews and users data on the common column 'Username'
merged_data3 = pd.merge(reviews_df, users_df, on='Username', how='inner')
# Calculate average review score by user
average_score_by_user = merged_data3.groupby('Username')['Score'].mean().reset_
average_score_by_user.columns = ['Username', 'Average_Score']
# Calculate total number of reviews by user
total_reviews_by_user = merged_data3['Username'].value_counts().reset_index()
total_reviews_by_user.columns = ['Username', 'Total_Reviews']
# Merge the calculated metrics
user_influence_metrics = pd.merge(average_score_by_user, total_reviews_by_user,
print(user_influence_metrics.head())
User influence on product reviews:
Username Average_Score Total_Reviews
0 aaez- aage- aagf- aahc- aajh-
QUESTION 4: What is the top 5 most common states among users?
In [21]: print("\n Top 5 most common states among users:")
# Count the number of users from each state, then get the top 5 most common sta
top_states = users_df['State'].value_counts().head(5).reset_index()
top_states.columns = ['State', 'User Count']
print(top_states)
Top 5 most common states among users:
State User Count
0
Massachusetts
107
1
Delaware
106
2
Vermont
103
3 Northern Mariana Islands
102
4
New Jersey
101
QUESTION 5: How many unique users wrote a review?
In [22]: print("\nUnique users that wrote a review:")
# Count the number of unique users who have written reviews
unique_users_count = reviews_df['Username'].nunique()
print("Number of Unique Users who have Written Reviews:", unique_users_count)
Unique users that wrote a review:
Number of Unique Users who have Written Reviews: 4993
5. Data Augmentation
For the data augmentaion combining information from different sources to enhance the
already merged dataset (merged_data.csv). In this case, i want to augment my already
merged dataset (merged_data.csv) with information from the new CSV file
localhost:8888/nbconvert/html/-__BD2.ipynb?download=false
23/33
12/4/23, 8:21 PM
-__BD2
(myntra_products_catalog_2.csv). For the new dataset the folowing process will be carried
out:
-Data Exploration. -Data Validation.
In [23]: # First thing is to load the new cvs file
new_data = pd.read_csv('myntra_products_catalog 2.csv')
new_data.head
#Displays the first few rows of the new data
new_data.info() # Displays information about the data types and missing values
new_data.columns # Displays the column names of the (new data) dataframe
RangeIndex: 12491 entries, 0 to 12490
Data columns (total 8 columns):
#
Column
Non-Null Count Dtype
--- ------------------- ----0
ProductID
12491 non-null int64
1
ProductName
12491 non-null object
2
ProductBrand 12491 non-null object
3
Gender
12491 non-null object
4
Price (INR)
12491 non-null int64
5
NumImages
12491 non-null int64
6
Description
12491 non-null object
7
PrimaryColor 11597 non-null object
dtypes: int64(3), object(5)
memory usage: 780.8+ KB
Index(['ProductID', 'ProductName', 'ProductBrand', 'Gender', 'Price (INR)',
Out[23]:
'NumImages', 'Description', 'PrimaryColor'],
dtype='object')
Summary statistics for the new data
In [24]: new_data.describe() # Displays summary statistics for numerical columns in the
Out[24]:
count
mean
std
min
25%
50%
75%
max
ProductID-e-e-e-e-e-e-e-e+07
Price (INR) NumImages-
Data Validation for the (new data) dataframe
In [25]: #Checking for missing values
print(new_data.isnull().sum()) # Displays the columns and there missing values
localhost:8888/nbconvert/html/-__BD2.ipynb?download=false
24/33
12/4/23, 8:21 PM
-__BD2
ProductID
ProductName
ProductBrand
Gender
Price (INR)
NumImages
Description
PrimaryColor
dtype: int64
-
Fixing the missing Value for the (new_data) dataframe
In [26]: # For PrimaryColor, its best i drop it to have a smooth data for analysis as it
column_to_drop = 'PrimaryColor'
new_data.drop(columns=column_to_drop, inplace=True)
print(new_data.isnull().sum())
ProductID
ProductName
ProductBrand
Gender
Price (INR)
NumImages
Description
dtype: int64
-
Data integration for the (new_data) dataframe
The common column name for the (merged_data.csv) ('uniq_id', 'description', 'Price') and
(myntra_products_catalog_2.csv) which is the(new_data) data frame
('ProductID','Description', 'Price') did not correspound, so we have to rename the column for
(myntra_products_catalog_2.csv).
In [27]: # Define a dictionary to map old column names to new column names
column_mapping = {'ProductID': 'uniq_id', 'Description': 'description_x', 'Pric
new_data.rename(columns=column_mapping, inplace=True) # Uses the rename method
# No the 'products_df' has updated column names
new_data.head(10) #Displays the first 10 rows of the dataset and its columns
localhost:8888/nbconvert/html/-__BD2.ipynb?download=false
25/33
12/4/23, 8:21 PM
Out[27]:
-__BD2
uniq_id-
ProductName ProductBrand Gender Price NumImages description_x
DKNY Unisex
Black and grey
Black & Grey
DKNY
Unisex
11745
7
printed
medium
Printed Medium
trolley
bag,
sec...
Trolle...
EthnoVogue
Beige & Grey
Women Beige & EthnoVogue Women 5810
made
to measure
7
Grey Made to
kurta with
Measure ...
churid...
Pink coloured
SPYKAR Women
wash
5-pocket
Pink Alexa Super
SPYKAR Women 899
7
high-rise
Skinny Fit High-...
cropped ...
Raymond Men
Blue self-design
Blue Self-Design
bandhgala
Raymond Men 5599
5
Single-Breasted
suitBlue selfB...
desig...
Brown and offParx Men Brown
white printed
& Off-White Slim
Parx Men 759
5 casual
shirt, has
Fit Printed Ca...
...
Brown solid lowSHOWOFF Men
rise regular
Brown Solid Slim
SHOWOFF Men 791
5 shorts,
has four
Fit Regular Shorts
...
Parx Men Blue
Blue checked
Slim Fit Checked
Parx Men 719
5 casual shirt, has
Casual Shirt
a spread collar...
SPYKAR Women
Burgundy
Burgundy Alexa
coloured
wash 5SPYKAR Women 899
7 pocket high-rise
Super Skinny Fit
H...
jean...
Parx Men Brown
Brown solid
Tapered Fit Solid
Parx Men 664
5 regular trousers
Regular Trousers
regular trousers
DKNY Unisex
Black solid large
Black Large
DKNY Unisex 17360
5
trolley bag,
Trolley Bag
secured with a ...
From the above data, it is clearly seen the 'price' Column for (new_data) data frame is not in
a float format or decimal format. A data cleaning will be carried out on the price column.
In [28]: # Convert the 'Price'column to numeric data type(float) as seen in the (merge_d
new_data['Price'] = new_data['Price'].astype(float).round(2)
print(new_data)
new_data['Price'] = new_data['Price'].astype(float).round(2)
new_data.head()
localhost:8888/nbconvert/html/-__BD2.ipynb?download=false
26/33
12/4/23, 8:21 PM
-__BD2
0
1
2
3
4
..-
0
1
2
3
4
..-
0
1
2
3
4
..-
uniq_id-
..-
ProductName
DKNY Unisex Black & Grey Printed Medium Trolle...
EthnoVogue Women Beige & Grey Made to Measure ...
SPYKAR Women Pink Alexa Super Skinny Fit High-...
Raymond Men Blue Self-Design Single-Breasted B...
Parx Men Brown & Off-White Slim Fit Printed Ca...
...
Pepe Jeans Men Black Hammock Slim Fit Low-Rise...
Mochi Women Gold-Toned Solid Heels
612 league Girls Navy Blue & White Printed Reg...
Bvlgari Men Aqva Pour Homme Marine Eau de Toil...
Pepe Jeans Men Black & Grey Striped Polo Colla...
ProductBrand
DKNY
EthnoVogue
SPYKAR
Raymond
Parx
...
Pepe Jeans
Mochi
612 league
Bvlgari
Pepe Jeans
Gender
Unisex
Women
Women
Men
Men
...
Men
Women
Girls
Men
Men
Price-
..-
NumImages
7
7
7
5
5
...
7
5
4
2
5
\
\
description_x
Black and grey printed medium trolley bag, sec...
Beige & Grey made to measure kurta with churid...
Pink coloured wash 5-pocket high-rise cropped ...
Blue self-design bandhgala suitBlue self-desig...
Brown and off-white printed casual shirt, has ...
...
Black dark wash 5-pocket low-rise jeans, clean...
A pair of gold-toned open toe heels, has regul...
Navy Blue and White printed mid-rise denim sho...
Bvlgari Men Aqva Pour Homme Marine Eau de Toil...
Black and grey striped T-shirt, has a polo col...
[12491 rows x 7 columns]
localhost:8888/nbconvert/html/-__BD2.ipynb?download=false
27/33
12/4/23, 8:21 PM
Out[28]:
-__BD2
uniq_id-
ProductName ProductBrand Gender Price NumImages description_x
DKNY Unisex
Black and grey
Black & Grey
DKNY
Unisex-
printed
medium
Printed Medium
trolley
bag,
sec...
Trolle...
EthnoVogue
Beige & Grey
Women Beige & EthnoVogue Women 5810.0
made
to measure
7
Grey Made to
kurta with
Measure ...
churid...
SPYKAR Women
Pink coloured
Pink Alexa Super
wash
5-pocket
SPYKAR Women 899.0
7
Skinny Fit
high-rise
High-...
cropped ...
Raymond Men
Blue self-design
Blue Self-Design
bandhgala
Raymond Men 5599.0
5 suitBlue
Single-Breasted
selfB...
desig...
Brown and offParx Men Brown
white printed
& Off-White Slim
Parx Men 759.0
5 casual
shirt, has
Fit Printed Ca...
...
Merging
(merged_data)
and the (new_data) using their
commonthe
column
name "uniq_id"
In [29]: #common key to be merged
common_key = 'Price'
# Merge the (merged_data) DataFrame with the (new_data) DataFrame based on the
augmented_data = pd.merge(merged_data, new_data, on='Price', how='inner')
augmented_data.head() # Displays the first few rows of the merged data
localhost:8888/nbconvert/html/-__BD2.ipynb?download=false
28/33
12/4/23, 8:21 PM
-__BD2
uniq_id_x
Out[29]:
0 d18bc79994ff35a2bca40977f81c4bc7
1 d2a5ad88091a614e2a45987ad967250f
2 aae6a185a8dde153e40d8a6ed271998d
3 b31700da42b1a5363e7243a498e1ebd3
4 a00bbd447d31c7c687e580c79bbf354d
Name
St.
Johns
Bay®
Jamie
Womens
Suede
Slouch
Boots
Samsung
7.5 Cu.
Ft.
Electric
Dryer
with
Steam
Dry
GE®
ENERGY
STAR®
7.8 Cu.
ft.
Electric
Dryer
Samsung
7.5 Cu.
Ft. Gas
Dryer
LG
ENERGY
STAR®
7.4 Cu.
Ft. Ultra
Large
Capaci...
description_x_x Price Av_Score
Our Jamie suede
slouch boots are- pp-
soft, supple ...
true Skip the dry
cleaners with- pp-
this powerful ...
true Meeting or
exceeding- pp500630
federal
guidelines f...
true Get all the
power you need- pp-
with this elec...
true Getting your
laundry fresh- pp500638
and dry has ne...
5 rows × 27 columns
Further Exploration of the Augmented data ('augmented_data')
In [30]: augmented_data.info() # Display information about the data types and missing v
augmented_data.describe() # Display summary statistics for numerical columns i
localhost:8888/nbconvert/html/-__BD2.ipynb?download=false
29/33
12/4/23, 8:21 PM
-__BD2
Int64Index: 11 entries, 0 to 10
Data columns (total 27 columns):
#
Column
Non-Null Count
--- ------------------0
uniq_id_x
11 non-null
1
Name
11 non-null
2
description_x_x
11 non-null
3
Price
11 non-null
4
Av_Score
11 non-null
5
sku
11 non-null
6
name_title
11 non-null
7
description_y
11 non-null
8
list_price
11 non-null
9
sale_price
11 non-null
10 category
11 non-null
11 category_tree
11 non-null
12 average_product_rating 11 non-null
13 product_url
11 non-null
14 product_image_urls
11 non-null
15 brand
11 non-null
16 total_number_reviews
11 non-null
17 Reviews
11 non-null
18 Bought With
11 non-null
19 Sentiment
11 non-null
20 Sentiment_Label
11 non-null
21 uniq_id_y
11 non-null
22 ProductName
11 non-null
23 ProductBrand
11 non-null
24 Gender
11 non-null
25 NumImages
11 non-null
26 description_x_y
11 non-null
dtypes: float64(4), int64(3), object(20)
memory usage: 2.4+ KB
Out[30]:
count
mean
std
min
25%
50%
75%
max
Price-
Dtype
----object
object
object
float64
float64
object
object
object
object
object
object
object
float64
object
object
object
int64
object
object
float64
object
int64
object
object
object
int64
object
Av_Score average_product_rating total_number_reviews Sentiment u-
Data Visualization for the Augmented data ('augmented_data')
Data visualization was constructed to understand the temporal dynamics of the
augmented data, identify trends and patterns.
The code below utilizes matplotlib.pyplot to create various charts and graphs.
localhost:8888/nbconvert/html/-__BD2.ipynb?download=false
30/33
12/4/23, 8:21 PM
-__BD2
In [31]: # Create a bar chart for the total number of reviews for each product
sns.set(style="whitegrid") # Sets the style of seaborn
plt.figure(figsize=(12, 8))
sns.barplot(x='total_number_reviews', y='ProductName', data=augmented_data, hue
# The below sets labels and title
plt.xlabel('Total Number of Reviews')
plt.ylabel('Product Name')
plt.title('Total Number of Reviews for Each Product')
plt.show() # Show the plot
'''
The x-axis represents the total number of reviews.
The y-axis represents the product names.
The hue parameter is used to differentiate between products for different gende
'''
# Create a scatter plot for the relationship between product prices and average
sns.set(style="whitegrid") # Sets the style of seaborn
plt.figure(figsize=(12, 8))
sns.scatterplot(x='Price', y='Av_Score', data=augmented_data, hue='Gender')
# Set labels and title
plt.xlabel('Product Price')
plt.ylabel('Average Score')
plt.title('Relationship Between Product Prices and Average Scores')
plt.show() # Show the plot
'''
The x-axis represents the product prices.
The y-axis represents the average scores.
Different colors are used to differentiate between products for different gende
'''
localhost:8888/nbconvert/html/-__BD2.ipynb?download=false
31/33
12/4/23, 8:21 PM
Out[31]:
-__BD2
'\nThe x-axis represents the product prices.\nThe y-axis represents the averag
e scores.\nDifferent colors are used to differentiate between products for dif
ferent genders\n'
Recommendation
Based on average product reviews and analysis of the top 10 products, here are possible
recommendations for JCPenny:
-Quality Improvement for Low-Rated Products:
Identify products with consistently low average review ratings. A low average rating can
be caused by a variety of reasons, including poor product quality, functionality issues,
inadequate customer support, and unmet expectations. This can be improved by
improving the manufacturing process, materials used or product design. Continuous
quality control ensures that our products meet or exceed customer expectations.
-Promote highly rated products:
Highlight and promote products with a high review average.
Use positive customer reviews in marketing materials and product descriptions to build
trust and attract more customers.
-Adjust marketing strategy:
Adjust marketing strategy based on the 10 most viewed products.
Also including these products in the marketing campaigns and promotions.
localhost:8888/nbconvert/html/-__BD2.ipynb?download=false
32/33
12/4/23, 8:21 PM
-__BD2
-Customer-centric approach:
Prioritize products that receive high levels of customer engagement and positive
feedback.
Align product development strategies with customer needs and wants.
In [ ]:
localhost:8888/nbconvert/html/-__BD2.ipynb?download=false
33/33