Customer Segmentation (Cluster Analysis)
Customer Segmentation (Unsupervised Machine Learning for Cluster
Analysis)
This project employs unsupervised machine learning for cluster analysis and customer segmentation. The dataset
used here is comprised of thousands of records of customer purchases and shopping habits. The goal of this project is,
first, to analyze and understand in depth the customer base present in the dataset, and, second, to utilize machine
learning algorithms for cluster analysis in order to breakdown the customer base into distinct clusters customer groups.
Cluster analysis can be very useful for understanding a customer base and aiding businesses to tailor targeted marketing
strategies, optimize their product offerings, and be better able to meet their customers' needs and enhance their shopping
experience. As such, after segmenting the customer base into separate groups, the groups will then be analyzed and
compared to develop a thorough understanding of the different customer groups, their characteristics, preferences,
shopping habits and their needs. This subsequently will guide efforts to curate targeted marketing campaigns, develop
customer retention strategies, and/or enhance overall customer satisfaction and loyalty.
The present dataset was taken from Kaggle.com, a popular platform for finding and publishing datasets. You can
quickly access it by clicking here. The dataset consists of around 4,000 records of customer purchases. For each entry
here, a customer is assigned a unique identifier and their purchase, preferences, and other relevant details are recorded.
Indeed, the dataset encompasses a wide variety of variables, including demographic information about the customers,
their shopping frequency and purchase history, their product preferences and overall satisfaction with the product
purchased. This, therefore, makes the current dataset ideal analyzing and understanding consumer behavior, decisionmaking, and for the purposes of cluster analysis and customer segmentation.
You can view each column and its description in the table below:
Variable
Description
Customer ID
Unique identifier for each customer
Age
Age of the customer
Gender
Gender of the customer
Item Purchased
Item or product purchased
Category
Category of the item purchased (e.g., clothing, accessory, etc.)
Purchase Amount (USD)
Amount spent (in USD) in a given transaction
Location
Location from which a purchase was made
Size
Size of the purchased item (if applicable)
Color
Color of the purchased item or product
Season
Season in which the item was purchased (e.g., winter, spring, etc.)
Review Rating
Rating score given by a customer for the item purchased (on a 5-point rating scale)
Subscription Status
Indicates whether or not a customer is subscribed to the brand or shop service
Shipping Type
Method of delivery or shipping type (e.g., standard shipping, express, store pickup, etc.)
Discount Applied
Indicates whether or not a discount was applied to the purchase
Promo Code Used
Indicates whether or or not a promo code or coupon was used during purchase
Previous Purchases
Number of prior purchases made by the same customer
Payment Method
Method of payment for the purchase (e.g., cash, credit card, paypal, etc.)
Frequency of Purchases
Frequency of engagement of a customer in purchasing activities (e.g., weekly, monthly, annually, etc.)
In order to perform cluster analysis to segment customers into different clusters, first the data was inspected,
engineered, and processed in preparation for analysis and modeling. After preparing the data, four clustering algorithms
were developed and evaluated in order to find the most suitable algorithm for task. Having obtained the best clustering
model for the data, customers were segmented into groups and the resultant customer groups were analyzed in depth. A
report was written describing the findings and identifying the main characteristics or unique features of each customer
group, as well as describing the overall similarities and dissimilarities between the customer groups. On the basis of this
report, a subsequent section was developed laying out the key insights and takeaways of the cluster analysis as well as
providing recommendations to improve sales or curate better marketing campaigns tailored to each customer group
separately.
Overall, this project is broken down into 7 sections:
1. Reading and Inspecting the Data
2. Updating the Data
3. Exploratory Data Analysis
4. Data Preprocessing
5. Model Development and Evaluation (Cluster Analysis))
6. Model Interpretation
7. Key Insights and Recommendations
In [ ]: #If you're using the executable notebook version, please run this cell first
# to install the necessary Python libraries for the task
!pip install numpy
!pip install pandas
!pip install matplotlib
!pip install seaborn
!pip install scikit-learn
In [ ]: #Import the modules for use
import math
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib.patches import Patch
from IPython.display import display, Markdown
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler
from sklearn.cluster import KMeans, MeanShift, AgglomerativeClustering, DBSCAN, estimate_bandwidth
from sklearn.metrics import silhouette_score, davies_bouldin_score
import warnings
warnings.simplefilter("ignore")
#Adjust data display options
pd.set_option('display.max_columns', None)
#Set context for plotting
sns.set_theme(context='paper', style='darkgrid')
Defining Custom Functions
In [ ]: #Defining Custom Functions for later use
#Define function to get color palette for visualization
def get_colors(var, colors):
if type(colors) == dict:
return colors.get(var)
elif type(colors) in (str, list, tuple):
return colors
else:
return colors.colors
#Define function to create and return a scatter plot
def get_scatterplot(x_var: str, y_var: str, clusters_var: str = None, colors: any = None):
ax=sns.scatterplot(data=df, x=x_var, y=y_var, hue=clusters_var, palette=colors, s=15, alpha=.75)
ax.set_title(f'Relationship between {y_var} and {x_var}', fontsize=15)
ax.set_xlabel(x_var, fontsize=12.5)
ax.set_ylabel(y_var, fontsize=12.5)
ax.legend(title=clusters_var, loc='upper right', alignment='left')
#Define function to create and return a boxen plot
def get_boxenplot(x_var: str, y_var: str, title_x: str, title_y: str, clusters_var: str = None, boxen: bool = True, colors
if boxen:
ax=sns.boxenplot(data=df, x=x_var, y=y_var, hue=clusters_var if clusters_var != None else x_var, palette=colors, s
order=order.get(x_var) if order!= None else None, alpha=.8 if len(np.unique(df[x_var]))<15 else
showfliers=False, width=.5, gap=.25 if len(np.unique(df[x_var])) < 10 else 0)
else:
ax=sns.boxplot(data=df, x=x_var, y=y_var, hue=clusters_var if clusters_var != None else x_var,
palette=colors, order=order.get(x_var) if order!= None else None,
width=.8, gap=.25 if len(np.unique(df[x_var])) < 10 else 0)
for artist in ax.artists:
r, g, b, _ = artist.get_facecolor()
artist.set_facecolor((r, g, b, 0.8))
if len(np.unique(df[x_var])) > 8:
ax.set_aspect(aspect=(.4 if y_var != 'Purchase Amount (USD)' else .25))
ax.set_title(f'Relationship between {title_y} and {title_x}', fontsize=15)
ax.set_xlabel(x_var, fontsize=12.5)
ax.set_xticklabels(ax.get_xticklabels(), rotation=(90 if (len(ax.get_xticklabels()) > 5 and x_var !='Age Group') else N
ax.set_xlim(-1,len(ax.get_xticklabels())+0.2)
ax.set_ylabel(y_var, fontsize=12.5)
ax.legend(title=clusters_var, loc='upper right', alignment='left')
#Define a function to create and return a heatmap
def get_heatmap(x_var: str, y_var: str, title_x: str, title_y: str, clusters_var: str = None, colors: any = None, order: d
xy_crosstab = pd.crosstab(index=df[y_var], columns=[df[x_var], df[clusters_var]]).reindex((order.get(y_var)[::-1] if (o
if len(xy_crosstab.columns) <= 4:
xy_crosstab = xy_crosstab.reorder_levels([1, 0], axis=1).sort_index(axis=1, level=[0, 1])
xy_crosstab = xy_crosstab.reorder_levels([1, 0], axis=1)
else:
xy_crosstab=xy_crosstab.reindex(order.get(x_var,None) if order!= None else None, axis=1, level=0)
for category in xy_crosstab.columns.levels[0][1:]:
xy_crosstab[(category, '')] = 0
xy_crosstab = xy_crosstab.sort_index(axis=1, level=[0, 1]).reindex(order.get(x_var,None) if order!= None else None
ax=sns.heatmap(xy_crosstab, ax=plt.gca(), cmap='gray_r', annot=True, fmt='.0f', alpha=.65, linewidths=.8, annot_kws={
ax.set_aspect(2 if (len(xy_crosstab.columns) > 4 and len(xy_crosstab.columns) >= len(xy_crosstab.index) and title_x!='
ax.set_title(f'Relationship between {title_y} and {title_x}', fontsize=(12 if (ax.get_aspect()!='auto' and len(xy_cros
ax.set_xlabel(' — '.join(ax.get_xlabel().split('-')), fontsize=11)
ax.set_ylabel(ax.get_ylabel(), fontsize=12.5)
xticklabels = ['' if item.get_text().endswith('-') else item for item in ax.get_xticklabels()]
ax.set_xticklabels(xticklabels, fontsize=(6 if len(ax.get_xticklabels())>10 else 8))
ax.set_yticklabels(ax.get_yticklabels(), rotation=(90 if 10 < len(ax.get_yticklabels()) <= 4 else None), fontsize=8)
colors_dict={'Cluster 1': colors[0], 'Cluster 2': colors[1], 'Cluster 3': colors[2], '': 'w'}
col_colors = [colors_dict[i] for i in xy_crosstab.columns.get_level_values(1)]
for i, color in enumerate(col_colors):
ax.fill_betweenx(y=[0,len(xy_crosstab.index)+1], x1=i, x2=i+1, color=color, alpha=.45)
#Define a function to create and return a pie plot
def get_pieplot(x_var: str, y_var: str, title_x: str, title_y: str, colors: any = None):
xy_crosstab = pd.crosstab(index=df[x_var], columns=df[y_var])
flat_crosstab = xy_crosstab.stack().reset_index()
flat_crosstab.columns = [x_var, y_var, 'Count']
aggregated_data = flat_crosstab.groupby([y_var, x_var])['Count'].sum().reset_index()
sizes = aggregated_data['Count']
labels = [f"{row[x_var]} - {row[y_var]}" for _, row in aggregated_data.iterrows()]
ax=plt.gca()
wedges, texts, autotexts = ax.pie(sizes, labels=labels, colors=colors[::-1], autopct='%1.1f%%', startangle=100, labeld
for i,wedge in enumerate(wedges):
if i % 2 != 0:
continue
angle = math.radians((wedge.theta1) % 360)
x, y = math.cos(angle), math.sin(angle)
line = plt.Line2D([0, 1.01*x], [0, 1.01*y], transform=ax.transData, color='w', linestyle='-', linewidth=3)
plt.gcf().add_artist(line)
ax.set_title(f'Relationship between {title_y} and {title_x}', fontsize=15, pad=13)
ax.patch.set_facecolor((ax.get_facecolor(),0.77))
ax.set(xticks=[], yticks=[])
ax.axis('equal')
#Define a function to create and return a bar plot
def get_barplot(x_var: str, y_var: str, title_x: str, title_y: str, clusters_var: str = None, colors: any = None, order: d
if y_var == 'Review Rating':
ax=sns.barplot(data=df, x=x_var, y=y_var, hue=clusters_var if clusters_var != None else x_var, palette=colors, alp
width=(.8 if len(np.unique(df[y_var]))>2 else .5), order=order.get(x_var) if order != None else None
ax.set_title(f'Relationship between {title_y} and {title_x}', fontsize=15)
ax.set_xlabel(x_var, fontsize=12.5)
ax.set_ylabel(y_var, fontsize=12.5)
ax.set_xticklabels(ax.get_xticklabels(), rotation=(90 if (len(ax.get_xticklabels()) > 5 and x_var !='Age Group') e
ax.set_ylim(0, 4.5)
else:
if clusters_var is None:
color, cmap = None if len(colors) != 2 else colors, colors if len(colors) != 2 else None
xy_crosstab = pd.crosstab(index=df[y_var], columns=df[x_var]).reindex(order.get(y_var,df[y_var].unique()) if ty
xy_crosstab = xy_crosstab.reindex(order.get(x_var, df[x_var].unique()) if type(order) == dict else df[x_var].u
ax=xy_crosstab.plot(kind='bar', ax=plt.gca(), color=color, cmap=cmap, alpha=.8 if len(np.unique(df[x_var])) <
ax.set_title(f'Relationship between {title_y} and {title_x}', fontsize=15)
ax.set_xlabel(y_var, fontsize=12.5)
ax.set_xticklabels(ax.get_xticklabels(), rotation=(90 if (len(ax.get_xticklabels()) > 4 and y_var != 'Age Grou
ax.set_xlim(-1,len(ax.get_xticklabels())+0.2)
ax.set_ylabel('Count', fontsize=12.5)
else:
if (title_y == 'Gender' and len(np.unique(df[title_x]))>2) or (title_x=='Gender' and len(np.unique(df[title_y]
xy_crosstab = pd.crosstab(index=df[y_var], columns=[df[x_var], df[clusters_var]]).reindex(order.get(y_var,d
xy_crosstab = xy_crosstab.sort_index(axis=1, level=[0,0]).reindex(order.get(x_var,df[x_var].unique()) if ty
colors_dict={'Cluster 1': colors[0], 'Cluster 2': colors[1], 'Cluster 3': colors[2]}
colors = [colors_dict[i] for i in xy_crosstab.columns.get_level_values(1)]
ax=xy_crosstab.plot(kind='bar', ax=plt.gca(), color=colors, alpha=.8 if len(np.unique(df[x_var])) < 15 else
ax.set_title(f'Relationship between {title_y} and {title_x}', fontsize=15)
ax.set_xlabel(y_var, fontsize=12.5)
ax.set_xticklabels(ax.get_xticklabels(), rotation=(90 if (len(ax.get_xticklabels()) > 4 and y_var != 'Age G
ax.set_xlim(-1,len(ax.get_xticklabels())+0.2)
ax.set_ylabel('Count', fontsize=12.5)
else:
xy_crosstab = pd.crosstab(index=df[y_var], columns=[df[x_var], df[clusters_var]], dropna=False).sort_index
xy_crosstab = xy_crosstab.sort_index(axis=1, level=[1,1]).reindex(order.get(x_var,df[x_var].unique()) if o
colors_dict={'Cluster 1': colors[0], 'Cluster 2': colors[1], 'Cluster 3': colors[2]}
ax=xy_crosstab.plot(kind='bar', ax=plt.gca(), color=[colors_dict[i] for i in xy_crosstab.columns.get_level_
ax.set_title(f'Relationship between {title_y} and {title_x}', fontsize=15)
ax.set_xlabel(y_var, fontsize=12.5)
ax.set_xticklabels(ax.get_xticklabels(), rotation=(90 if (len(np.unique(df[y_var])) > 4 and y_var != 'Age G
ax.set_xlim(-1,len(ax.get_xticklabels())+0.2)
ax.set_ylabel('Count', fontsize=12.5)
#Add hatching pattern to mark empty bars
for cluster_idx, hatch_value in enumerate(xy_crosstab.columns.get_level_values(0)):
for bar in ax.containers[cluster_idx].patches:
bar.set_hatch('x') if hatch_value == np.unique(df[x_var])[0] else bar.set_hatch('')
#Add a horizontal line to indicate empty bar
for cluster_idx in range(len(ax.containers)):
for bar in ax.containers[cluster_idx]:
if bar.get_height() == 0:
x = bar.get_x() + bar.get_width() / 2
y = bar.get_y()
ax.plot([x - bar.get_width() / 2, x + bar.get_width() / 2], [y, y], color=[colors*2][0][cluste
#adjust the legend's handles
legend_handles = []
for cluster,color in colors_dict.items():
for hatch_value in np.unique(df[x_var])[::-1]:
legend_handles.append(Patch(facecolor=color, edgecolor='lightgray', hatch='xxx' if hatch_value ==
ax.legend(handles=legend_handles, title=f'Clusters | {x_var}', loc='upper right', alignment='left')
#Define helper function to analyze data and visualize the results
def Get_Plots(x_vars: str | list, y_vars : str | list, clusters_var: str = None, colors: any = plt.get_cmap('tab10'), orde
for y_var in pd.Index(y_vars):
fig = plt.figure(facecolor='ghostwhite', dpi=150)
if clusters_var is not None:
n_cols, n_rows = 4, 4
fig.set_size_inches(40, 35)
plt.subplots_adjust(wspace=.28, hspace=.28, top=.94)
plt.suptitle(f'Customer Segmentation by {y_var}', fontsize=33.5)
else:
n_cols = kwargs.get('n_cols', 4 if len(x_vars) - 1 > 5 else len(x_vars))
n_rows = kwargs.get('n_rows', math.ceil(len(x_vars) / n_cols))
fig.set_size_inches(12*n_cols, 10*n_rows)
plt.subplots_adjust(wspace=.28, hspace=.28, top=.92)
plt.suptitle(f'Bivariate Analysis by {y_var}', fontsize=33.5)
for i,x_var in enumerate(pd.Index(x_vars).drop(labels=y_var, errors='ignore')):
#Create subplot for current variables
plt.subplot(n_rows, n_cols, i+1)
title_x, title_y = x_var, y_var
Num_x_categories, Num_y_categories = len(np.unique(df[x_var])), len(np.unique(df[y_var]))
#Adjust type of plot based on data types of the variables
if df[x_var].dtype != 'object' and df[y_var].dtype != 'object':
#visualize data using scatter plot
color_palette = get_colors(title_y, colors)
get_scatterplot(x_var, y_var, clusters_var, color_palette)
elif df[x_var].dtype == 'object' and df[y_var].dtype != 'object':
#visualize data using boxen plot
color_palette = get_colors(title_y, colors)
boxen = kwargs.get('boxen', True) if len(df[x_var].unique())>2 else True
get_boxenplot(x_var, y_var, title_x, title_y, clusters_var, boxen, color_palette, order)
elif df[x_var].dtype != 'object' and df[y_var].dtype == 'object':
#switch variables on the xy axes
title_x, title_y = x_var, y_var
x_var, y_var, Num_x_categories, Num_y_categories = y_var, x_var, Num_y_categories, Num_x_categories
switched = True
if title_x == 'Review Rating':
#visualize data using bar plot
color_palette = get_colors(title_y, colors)
get_barplot(x_var, y_var, title_x, title_y, clusters_var, color_palette, order)
else:
#visualize data using boxen plot
color_palette = get_colors(title_y, colors)
boxen = kwargs.get('boxen', True) if len(df[x_var].unique())>2 else True
get_boxenplot(x_var, y_var, title_x, title_y, clusters_var, boxen, color_palette, order)
elif df[x_var].dtype == 'object' and df[y_var].dtype == 'object':
title_x, title_y = x_var, y_var
if Num_x_categories > Num_y_categories:
#switch variables on the xy axes
x_var, y_var, Num_x_categories, Num_y_categories = y_var, x_var, Num_y_categories, Num_x_categories
switched = True
heatmap_conditions = ((clusters_var is not None) and ((Num_x_categories > 2) or (Num_x_categories==2 and N
if heatmap_conditions:
#visualize data using heatmap
color_palette = get_colors(title_y, colors)
get_heatmap(x_var, y_var, title_x, title_y, clusters_var, color_palette, order)
else:
if kwargs.get('pie', False) and Num_y_categories==4:
#visualize data using pie plot
color_palette = get_colors(title_y, colors)
get_pieplot(x_var, y_var, title_x, title_y, color_palette)
else:
#visualize data using bar plot
color_palette = get_colors(title_y, colors)
get_barplot(x_var, y_var, title_x, title_y, clusters_var, color_palette, order)
#return variables back to original for next iteration
try:
if switched==True:
x_var, y_var, Num_x_categories, Num_y_categories = y_var, x_var, Num_y_categories, Num_x_categories
switched = False
except:
continue
plt.show()
if len(y_vars)>1:
display(Markdown('<
Part One: Reading and Inspecting the Data
In this section, I will access and load the data file, inspect its shape and data types, and look for missing entries or duplicates in
the data before proceeding with the necessary data cleaning or updating.
Loading and reading the dataset
In [ ]: #Access and read data into dataframe
df = pd.read_csv('shopping customers dataset.csv').drop('Customer ID',axis=1)
#Preview the first 10 entries
df.head(10)
Age
Gender
Item
Purchased
Category
Purchase
Amount
(USD)
Location
Size
Color
Season
Review
Rating
Subscription
Status
Shipping
Type
Discount
Applied
0
55
Male
Blouse
Clothing
53
Kentucky
L
Gray
Winter
3.1
Yes
Express
Yes
1
19
Male
Sweater
Clothing
64
Maine
L
Maroon
Winter
3.1
Yes
Express
Yes
2
50
Male
Jeans
Clothing
73
Massachusetts
S
Maroon
Spring
3.1
Yes
Free
Shipping
Yes
3
21
Male
Sandals
Footwear
90
Rhode Island
M
Maroon
Spring
3.5
Yes
Next Day
Air
Yes
4
45
Male
Blouse
Clothing
49
Oregon
M
Turquoise
Spring
2.7
Yes
Free
Shipping
Yes
5
46
Male
Sneakers
Footwear
20
Wyoming
M
White
Summer
2.9
Yes
Standard
Yes
6
63
Male
Shirt
Clothing
85
Montana
M
Gray
Fall
3.2
Yes
Free
Shipping
Yes
7
27
Male
Shorts
Clothing
34
Louisiana
L
Charcoal
Winter
3.2
Yes
Free
Shipping
Yes
8
26
Male
Coat
Outerwear
97
West Virginia
L
Silver
Summer
2.6
Yes
Express
Yes
9
57
Male
Handbag
Accessories
31
Missouri
M
Pink
Spring
4.8
Yes
2-Day
Shipping
Yes
Out[ ]:
Inspecting the data
Inspecting the data shape
In [ ]: #Report the shape of the dataframe
shape = df.shape
print('Number of coloumns:', shape[1])
print('Number of rows:', shape[0])
Number of coloumns: 17
Number of rows: 3900
Checking the data type and number of entries
In [ ]: #Inspect coloumn headers, data type, and number of entries
print(df.info())
RangeIndex: 3900 entries, 0 to 3899
Data columns (total 17 columns):
#
Column
Non-Null Count
--- ------------------0
Age
3900 non-null
1
Gender
3900 non-null
2
Item Purchased
3900 non-null
3
Category
3900 non-null
4
Purchase Amount (USD)
3900 non-null
5
Location
3900 non-null
6
Size
3900 non-null
7
Color
3900 non-null
8
Season
3900 non-null
9
Review Rating
3900 non-null
10 Subscription Status
3900 non-null
11 Shipping Type
3900 non-null
12 Discount Applied
3900 non-null
13 Promo Code Used
3900 non-null
14 Previous Purchases
3900 non-null
15 Payment Method
3900 non-null
16 Frequency of Purchases 3900 non-null
dtypes: float64(1), int64(3), object(13)
memory usage: 518.1+ KB
None
Dtype
----int64
object
object
object
int64
object
object
object
object
float64
object
object
object
object
int64
object
object
Checking for missing entries
In [ ]: #Report number of missing values per column
print('Number of missing values per column:')
print(df.isna().sum())
Number of missing values per column:
Age
0
Gender
0
Item Purchased
0
Category
0
Purchase Amount (USD)
0
Location
0
Size
0
Color
0
Season
0
Review Rating
0
Subscription Status
0
Shipping Type
0
Discount Applied
0
Promo Code Used
0
Previous Purchases
0
Payment Method
0
Frequency of Purchases
0
dtype: int64
Checking for data duplicates
In [ ]: #Report number of duplicates
print('Number of duplicate values: ', df.duplicated().sum())
Number of duplicate values:
0
Based on present data inspection, it seems that there are no missing or NaN (not a number) entries in the data, no duplicates, and
all the data are in the correct data format. Next I will update and enrich the data by creating two new columns to represent age
group and region by state.
Part Two: Updating the Data
In this section, I will create two new columns, age group and region, to allow more broad view based off age group and local
region and allow better generalizations. The age groups will be divided according to the commonly accepted age cohorts:
young adults (18-25 years), adults (26-35 years), middle-aged adults (36-45 years), older adults (46-55 years), seniors (56-65
years) and elderly (65+ years). As for the regions column, I will aggregate the American states in the dataset into 8 distinct
regions following the Bureau of Economic Analysis' lead; these are the Far West, Great Lakes, Mideast, New England, Plains,
Rocky Mountatin, Southeast, and Southwest.
Creating a column for age group
In [ ]: #Specify age bins
age_bins = [17, 25, 35, 45, 55, 65, float('inf')]
#Specify labels for the age bins
age_labels = ['18-25', '26-35', '36-45', '46-55', '56-65', '65+']
#i.e., young adults, adults, middle-aged adults, older
#Perform data binning on the Age column to get new Age Groups column
age_group_col = pd.cut(df['Age'], bins=age_bins, labels=age_labels, ordered=True).astype('object')
df.insert(1, 'Age Group', age_group_col)
#preview obtained age groups
df[['Age', 'Age Group']].sample(10)
Age
Age Group
945
38
36-45
3001
29
26-35
2859
31
26-35
3623
39
36-45
3048
69
65+
3321
32
26-35
550
62
56-65
997
64
56-65
582
40
36-45
772
18
18-25
Out[ ]:
Creating a column for region by state
In [ ]: #specify the regions by states
regions_dict = {
'Far West': ['Alaska', 'California', 'Hawaii', 'Nevada', 'Oregon', 'Washington'],
'Great Lakes': ['Illinois', 'Indiana', 'Michigan', 'Ohio', 'Wisconsin'],
'Mideast': ['Delaware', 'District of Columbia', 'Maryland', 'New Jersey', 'New York', 'Pennsylvania'],
'New England': ['Connecticut', 'Maine', 'Massachusetts', 'New Hampshire', 'Rhode Island', 'Vermont'],
'Plains': ['Iowa', 'Kansas', 'Minnesota', 'Missouri', 'Nebraska', 'North Dakota', 'South Dakota'],
'Rocky Mountains': ['Colorado', 'Idaho', 'Montana', 'Utah', 'Wyoming'],
'Southeast': ['Alabama', 'Arkansas', 'Florida', 'Georgia', 'Kentucky', 'Louisiana', 'Mississippi', 'North Carolina', 'S
'Southwest': ['Arizona', 'New Mexico', 'Oklahoma', 'Texas'] }
#Create a state to region dictionary
state_to_region = {state: region for region, states in regions_dict.items() for state in states}
#Create new regions column
region_col = df['Location'].map(lambda state: state_to_region.get(state))
df.insert(7, 'Region', region_col)
#preview the obtained regions
df[['Location', 'Region']].sample(10)
Location
Region
3215
Rhode Island
New England
3742
Maryland
Mideast
10
Arkansas
Southeast
3177
Missouri
Plains
3838
Rhode Island
New England
3331
California
Far West
418
Illinois
Great Lakes
2490
Montana
Rocky Mountains
2204
Oregon
Far West
665
New Mexico
Southwest
Out[ ]:
Part Three: Exploratory Data Analysis
In this section, I will explore the data in more detail, obtaining descriptive statistical summaries, performing univariate and
bivariate analyses based on the data type, and examining the frequency distribution of each variable in the data in order to get a
better overview of the current dataset.
Descriptive Statistics
Numerical Data
In [ ]: #Get statistical summary of the numerical data
display(df.describe().round(2).T, Markdown(''))
#Show frequency distribution of numerical data using histogram
plt.figure(figsize=(12,9), facecolor='ghostwhite')
plt.suptitle('Frequency Distribution for Numerical Variables', fontsize=14.5)
plt.subplots_adjust(hspace=.25, wspace=.25, top=.94)
for i, col in enumerate(df.select_dtypes(exclude='object')):
plt.subplot(2,2,i+1)
ax=sns.histplot(data=df, x=col, bins=10, color='#4C72B0')
ax.set_xlabel(str(col), fontsize=11)
ax.set_ylabel('Total Count', fontsize=11)
plt.show()
count
mean
std
min
25%
50%
75%
max
Age
3900.0
44.07
15.21
18.0
31.0
44.0
57.0
70.0
Purchase Amount (USD)
3900.0
59.76
23.69
20.0
39.0
60.0
81.0
100.0
Review Rating
3900.0
3.75
0.72
2.5
3.1
3.7
4.4
5.0
Previous Purchases
3900.0
25.35
14.45
1.0
13.0
25.0
38.0
50.0
As depicted here, most of the data exhibit uniform distribution with no skewness in either direction. Further, based on the statistical
summary table, the ages of customers in the dataset range from 18 to 70 years old, with a mean of 44 years. Noteworthy also, all
the purchases made range from about 20 USD to 100 USD with a mean of 60 USD. Lastly, the observed ratings fall on a 5-star
rating scale with an average of 3.7 stars. Now, I will review non-numeric data.
Categorical Data
In [ ]: #Get statistical summary of non-numeric (categorical) data
display(df.describe(include='object').T, Markdown(''))
#Show distribution of categorical data
n_rows, n_cols = 3,5
plt.figure(figsize=(46,30), facecolor='ghostwhite', dpi=150)
plt.suptitle('Count Distribution for Categorical Variables', fontsize=40)
plt.subplots_adjust(hspace=.3, top=.94)
for i,col in enumerate(df.select_dtypes(include='object').columns):
order_dict = {'Age Group': age_labels, 'Size': ['S', 'M', 'L', 'XL'], 'Season': ['Winter', 'Spring', 'Summer', 'Fall']
plt.subplot(n_rows, n_cols, i+1)
if len(df[col].unique())==2 or col=='Category':
ax=plt.gca()
ax.pie(df[col].value_counts(), labels=df[col].value_counts().index, autopct='%1.0f%%', labeldistance=1.05, startang
ax.patch.set_facecolor((ax.get_facecolor(),0.95))
ax.set_xlabel(str(col), fontsize=14, labelpad=12)
ax.set_ylabel('%', fontsize=15, labelpad=12)
ax.set_xlim(xmin=-1.115, xmax=1.115)
ax.set_ylim(ymin=-1.1, ymax=1.1)
ax.set(xticks=[], yticks=[])
else:
ax=sns.countplot(x=df[col], color='#4C72B0', order=order_dict.get(col, None))
ax.set_xticklabels(ax.get_xticklabels(), rotation=(0 if (len(np.unique(df[col])) <= 4 or col=='Age Group') else 60
ax.set_xlabel(str(col), fontsize=14, labelpad=12)
ax.set_ylabel('Total Count', fontsize=13)
plt.show()
count
unique
top
freq
Age Group
3900
6
46-55
753
Gender
3900
2
Item Purchased
3900
25
Blouse
171
Category
3900
4
Clothing
1737
Location
3900
50
Montana
96
Region
3900
8
Southeast
947
Size
3900
4
M
1755
Color
3900
25
Olive
177
Season
3900
4
Spring
999
Subscription Status
3900
2
No
2847
Shipping Type
3900
6
Free Shipping
675
Discount Applied
3900
2
No
2223
Promo Code Used
3900
2
No
2223
Payment Method
3900
6
PayPal
677
Frequency of Purchases
3900
7
Every 3 Months
584
Male 2652
Notable here, as demonstrated by both the statistical summary table and the graphs, most of the customers in the current data are
Southeastern, middle-aged males, with males making up around two-thirds of the sample; most of the purchases are clothing items;
and most customers are not subscribers of the shop or brand in question. For most of the remaining variables, we once again
finding them exhibiting uniform distribution with little variation between their different categories. Now I will explore the data more
closely by performing bivariate analysis to uncover the relationships between the features in the data.
Bivariate Analysis
For this part of the analysis, I will pick up the features that seem most important or relevant for the current task and analyze them in
relation to each other and other features that may be relevant. Accordingly, I will analyze the dataset by gender, age group,
purchase amount, purchase frequency, and region.
Bivariate Analysis by Gender
In [ ]: #Define target variable and features to compare it to
target = ['Gender']
features = ['Age', 'Age Group', 'Purchase Amount (USD)', 'Previous Purchases', 'Season',
'Frequency of Purchases', 'Category', 'Region', 'Review Rating', 'Subscription Status']
#Analyze data and plot results
Get_Plots(features, target, colors=['#3b5998', '#b92b27'], order={'Age Group': age_labels}, pie=True, n_rows=2, n_cols=5)
Performing bivariate analysis by gender, we once again observe a higher number of male customers than female customers overall
across all measures, however, beside the number of customers, we can see no notable differences between male and female
customers across almost all of the variables analyzed, including age, spending capacity (purchase amounts), number of previous
purchases, average rating scores, and region. Most customers, whether males or females, are adults to seniors, they tend to spend
around 40 to 80 dollars per purchase, they shop consistently across the year, mostly shopping for clothes followed by accessories,
they have equivalent satisfaction levels, judging by review rating scores, and the highest number of customers, both males and
females, are concentrated in the Southeast. Where male and female customers evidently differ most drastically is when it comes to
subscription status: most, or perhaps all, subscribers to the brand or shop services tend to be males rather than females. Thus,
subscribed customers tend to be exclusively males.
Bivariate Analysis by Age Group
In [ ]: #Define target and relevant features
target = ['Age Group']
features = ['Gender', 'Purchase Amount (USD)', 'Previous Purchases', 'Season',
'Frequency of Purchases', 'Category', 'Region', 'Review Rating', 'Subscription Status']
#Analyze data and plot results
Get_Plots(features, target, colors='vlag', order={'Age Group': age_labels}, boxen=False, n_rows=2, n_cols=5)
Moving to bivariate analysis by age group, we don't find much variation between the different age gcohorts in relation to most of the
variables considered, at least nothing that univariate analysis of age groups alone did not reveal. Where age group seem to be
most impactful or relevant is in relation to shopping frequency, region, and to a lesser degree subscription status.
First, at one extreme, those who tend to shop most frequently, twice a week, tend to be predominantly adults aged 26 to 35 years
old followed by older adults aged 45 to 55 years old, whereas, at the other extreme, those who tend to shop least frequently,
annually, tend to be mostly seniors aged 56 to 65 years old followed by middle-aged adults (36-45 years old). Second, age cohort
seem to be relevant to shopping behavior in relation to region. Most notably, customers in the Southeast, the region with the highest
number of customers, tend to be predominantly seniors aged 56-65 years followed by adults aged 26 to 35. This also the case,
however to a comparatively lesser degree, in the Mideast. We also find notable differences the Rocky Mountains region, New
England, and the Plains. Across these 3 regions, middle aged adults (46-55 years) tend to be overrepresented (except for the
Plains where they compete for the top spot with adults (26-35 years)). More generally, the age group least engaged across all
regions is the elderly aged 65 and above followed by young adults (18-25 years). Lastly, age might likely play a role in subscription
rates. Particularly, most subscribers to the brand or shop services tend to be middle-aged adults and the least group of subscribers
tend to be the elderly.
Bivariate Analysis by Purchase Amount
In [ ]: #Define target and relevant features
target = ['Purchase Amount (USD)']
features = ['Gender', 'Age', 'Age Group', 'Previous Purchases', 'Season', 'Frequency of Purchases',
'Category', 'Region', 'Review Rating', 'Subscription Status', 'Discount Applied', 'Promo Code Used']
#Analyze data and plot results
Get_Plots(features, target, colors='vlag', order={'Age Group': age_labels}, boxen=False)
Analyzing by Purchase Amount (per purchase), we find general uniformity across most variables. Customers do not differ much in
purchasing capacity or spending amounts based on gender or age, previous experience, frequency of purchase, time of season or
region, among many others. What's perhaps most odd or paradoxical here is that customers' spendings also do not differ based on
subscription status, with the application of a discount or the use of promo codes, which seem to suggest that these benefits or
services are merely symbolic rather than offering a real advantage. All in all, across the different dimensions, customers tend to
spend mostly between 40 and 80 dollars on a given purchase with a median purchase amount of 60 dollars.
Bivariate Analysis by Frequency of Purchases
In [ ]: #Define target and relevant features
target = ['Frequency of Purchases']
features = ['Gender', 'Age', 'Age Group', 'Purchase Amount (USD)', 'Previous Purchases', 'Season', 'Frequency of Purchases
'Category', 'Region', 'Review Rating', 'Subscription Status', 'Discount Applied', 'Promo Code Used']
#Analyze data and plot results
order_dict = {'Age Group': age_labels, 'Frequency of Purchases': ['Bi-Weekly', 'Weekly', 'Fortnightly', 'Monthly', 'Every
Get_Plots(features, target, colors='vlag', order=order_dict, boxen=False)
Analyzing by Frequency of Purchase, we once again find a general unifromity across the different variables. Males are
overrepresented across all purchasing frequency groups compared to females; frequency of purchase doesn't seem to vary much
with spending capacity or purchase history, season or the category of the item purchased. Also, oddly enough the frequency of
purchase also does not seem to be influenced much by subscription status or the presence of discounts and promo codes, which
again may be reflective of their symbolic nature.
Where frequency of purchase tends to vary relatively more prominent is in relation to age group and local region, albeit to a lesser
degree. Although examining by age in absolute terms does not reveal much difference, analyzing by age cohorts, as discussed
earlier, we get a more nuanced picture. First, customers who tend to shop most frequently, i.e. twice a week, tend to be mostly
adults aged 26 to 35 years, and they remain a prominent customers group across all other shopping frequencies. Conversely,
customers who tend to shop least frequently, i.e. annually, tend to be mostly seniors aged 56 to 65 years. More generally, younger
adults tend to shop mostly every 3 months or annually, followed by as frequently as twice a week or once every two weeks
(fortnightly), however they make up a smaller customer base overall, only next to last to the elderly group aged 65 and above.
The second slightly notable finding has to do with region. Particularly, customers in the Southeast, the region with the highest
number of customers, tend to shop either very frequently, once a week, or quarterly (once every 4 months), but they do less so
monthly or every 3 months. Moving to the Plains, customers mostly shop once every 4 months or once every 3 months but less so
weekly or bi-weekly. The Mideast customers tend to shop mostly very frequently, twice a week, or once every 3 months. Customers
in the Rocky Mountains region are also mostly engaged once every 3 months, but they are the least engaged in frequent shopping.
And lastly, customers in the Great Lakes tend to shop mostly once every 2 weeks. There are no notable differences in terms of
shopping frequency for the remaining regions.
Bivariate Analysis by Region
In [ ]: #Define target and relevant features
target = ['Region']
features = ['Gender', 'Age', 'Age Group', 'Purchase Amount (USD)', 'Previous Purchases',
'Season', 'Frequency of Purchases', 'Category', 'Review Rating', 'Subscription Status']
#Analyze data and plot results
order_dict = {'Age Group': age_labels, 'Frequency of Purchases': ['Bi-Weekly', 'Weekly', 'Fortnightly', 'Monthly', 'Every
Get_Plots(features, target, colors='vlag', order=order_dict, boxen=False)
Finally, bivariate analysis by region, the last variable considered here, reveals a number of interesting relationship between
shopping habits and local region. The first notable relationship among them is between region and age group, which I have
discussed at length earlier. As mentioned, customers in the Southeast and the Mideast tend to be mostly seniors or adults. In each
of New England and the Rocky Mountains, we find that most customers tend to be middle-aged adults. We also find the same in the
Plains, however middle-aged adults seem to compete with adults over the most represented age cohort in this region. Consistent
across all regions, we find the least engaged customers to be the elderly followed by young adults.
We also find some relationship between region and frequency of purchase. Again as discussed earlier, customers in the Southeast
tend to shop either very frequently, i.e. weekly, or quarterly; thus the Southeast seem to be divided into a group of very frequent
buyers and other more seasonal buyers. We also see something similar in the Mideast and the Rocky Mountains, with most
customers being either very frequent buyers, shopping as frequently as twice a week, or less frequent buyers, shopping mostly
once every 3 months. Other regions, as with the Plains, features customers who are mostly seasonal shoppers only, shopping
mostly once every 4 or 3 months, others, particularly the Great Lakes, feature customers who are mostly frequent buyers, shopping
mostly once every 2 weeks. No significant differences are observable for the remaining three regions.
Lastly, we find a relationship for some of the regions and shopping season. More particularly, customers in the Southeast tend to
engage in shopping activity slightly more in the winter but engage least in the summer. Customers in New England engage in
shopping activity significantly more in the winter compared to all other seasons. Customers in the Far West engage most in the
Spring compared to other seasons. And, finishing with the last region that seems to have a notable relationship with season,
customers in the Plains tend to shop most in the Fall and least in Winter.
Now that we have familiarized ourselves with the data, establishing a better overview and developing a rough idea about what to
expect, I will proceed with the necessary data preprataions and subsequent model development for clustering and customer
segmentation.
Part Four: Data Preprocessing
In preparation for model development to perform cluster analysis and segment customers, in this section I will follow the
necessary data transformation procedures to make the data viable for modeling and numerical analysis. As most of the
variables in the data are categorical, first I have to convert these categorical data to numeric-type data to be able to analyze
them. Accordingly, I will first perform One-Hot Encoding to convert the categorical data to numeric ones suitable for analysis.
Second, I will perform feature scaling to ensure that all the variables in the dataset are represented on the same scale or data
range, from 0 to 1, and thereby ensure that the dataset is modeled accurately.
Dealing with Categorical Variables: One-Hot Encoding
First, I will identify the categorical variables in the data and then perform one-hot encoding on them to covert their categorical values to
numeric-type values that can be viable for numerical analysis and modeling. This method involves creating new binary categories for each
of the unique values in a given categorical variable, assigning 1 to signify its presence or 0 to signify its absence.
In [ ]: #Identify categorical variables
categorical_cols = df.select_dtypes(include='object').columns
#Now we can perform one-hot encoding on the identified columns
#Create encoder object
OHE_encoder = OneHotEncoder(handle_unknown='ignore')
#Perform One-Hot encoding and return new dataframe with the variables encoded
df_encoded_vars = pd.DataFrame(OHE_encoder.fit_transform(df[categorical_cols]).toarray())
df_encoded_vars.columns = OHE_encoder.get_feature_names_out(categorical_cols)
#Create new dataframe joining the new encoded categories with the earlier numerical variables
df_encoded = pd.concat([df.drop(categorical_cols,axis=1), df_encoded_vars], axis=1)
#Examine dataframe shape after encoding
print('Number of coloumns:', df_encoded.shape[1])
print('Number of rows:', df_encoded.shape[0])
print()
#preview head of the new dataframe
df_encoded.head()
Number of coloumns: 157
Number of rows: 3900
Out[ ]:
Age
Purchase
Amount
(USD)
Review
Rating
Previous
Purchases
Age
Group_1825
Age
Group_2635
Age
Group_3645
Age
Group_4655
Age
Group_5665
Age
Group_65+
Gender_Female
0
55
53
3.1
14
0.0
0.0
0.0
1.0
0.0
0.0
0.0
1
19
64
3.1
2
1.0
0.0
0.0
0.0
0.0
0.0
0.0
2
50
73
3.1
23
0.0
0.0
0.0
1.0
0.0
0.0
0.0
3
21
90
3.5
49
1.0
0.0
0.0
0.0
0.0
0.0
0.0
4
45
49
2.7
31
0.0
0.0
1.0
0.0
0.0
0.0
0.0
Feature Scaling
Now given that the different features in the data exist on different scales of different sizes, I will perform feature nomralization to
normalize o redistribute the features' values to be represented on the same scale of 0 to 1.
In [ ]: #Create scaler object
scaler = MinMaxScaler()
#Perform feature normalization
df_encoded = scaler.fit_transform(df_encoded)
#Now we can look at the value distribution of data after rescaling
stats_table = pd.DataFrame(df_encoded, columns=scaler.get_feature_names_out()).describe(percentiles=[]).round(1).T
stats_table
Gender_
count
mean
std
min
50%
max
Age
3900.0
0.5
0.3
0.0
0.5
1.0
Purchase Amount (USD)
3900.0
0.5
0.3
0.0
0.5
1.0
Review Rating
3900.0
0.5
0.3
0.0
0.5
1.0
Previous Purchases
3900.0
0.5
0.3
0.0
0.5
1.0
Age Group_18-25
3900.0
0.1
0.4
0.0
0.0
1.0
...
...
...
...
...
...
...
Frequency of Purchases_Every 3 Months
3900.0
0.1
0.4
0.0
0.0
1.0
Frequency of Purchases_Fortnightly
3900.0
0.1
0.3
0.0
0.0
1.0
Frequency of Purchases_Monthly
3900.0
0.1
0.3
0.0
0.0
1.0
Frequency of Purchases_Quarterly
3900.0
0.1
0.4
0.0
0.0
1.0
Frequency of Purchases_Weekly
3900.0
0.1
0.3
0.0
0.0
1.0
Out[ ]:
157 rows × 6 columns
Part Five: Model Development and Evaluation (Cluster Analysis)
In this section, I will develop different clustering algorithms, tune and optimize each, and compare them together to obtain the
model with the best clustering performance. As such, I will test out and compare four clustering models: K-Means, Hierarchical
Agglomerative Clustering (HAC), Density-Based Spatial Clustering of Applications with Noise (DBSCAN), and Mean Shift (MS).
For each of these four models, I will pick up the most relevant parameter for tuning to obtain the best clustering performance
out of each. For K-Means, since it's the only algorithm that requires specifying the number of clustering in advance, I will test
out different numbers of K clusters (1 to 10); for the HAC model, I will tune the distance threshold parameter which defines
linkage distance threshold above which clusters will stop merging to obtain the final number of clusters; for the DBSCAN
model, I will tune the epsilon distance parameter (eps) which defines the radius of each epsilon neighborhood (cluster) founded
around the designated core points and according to which data points are clustered; and finally, for the Mean Shift model, I will
tune the kernel bandwidth which determines the radius of the kernel used to compute data points densities, particularly by
estimating the bandwidth size using sklearn's bandwidth estimator and setting a range of values around it. To assess and draw
comparisons between the different algorithms, I will use two metrics: the silhouette score and the Davies-Bouldin index (DBI)
score. Both metrics provide useful insights into the quality of clustering and can be used to compare the different clustering
algorithms or to choose the optimal number of clusters. The first metric, silhouette score, measures the degree of similarity
between each data point relative to its cluster and tends to be particularly useful for assessing the overall quality of clustering,
while the second metric, DBI score, evaluates the quality of separation after clustering by measuring the average similarity ratio
of each cluster with its most similar cluster and thus tends to be most useful for assessing the separation between clusters.
Values closer to zero on each metric should indicate better clustering results. The best performing clustering algorithm will be
chosen for clustering and analyzing the data.
Model development, tuning, and evaluation
In [ ]: #Define clustering algorithms to test
estimators_lst = [('K-Means', KMeans(init='k-means++', n_init=10, random_state=42)),
('HAC', AgglomerativeClustering(n_clusters=None, metric='euclidean', linkage='ward', compute_full_tree=True)
('DBSCAN', DBSCAN(min_samples=50, n_jobs=-1)),
('MS', MeanShift(cluster_all=False, n_jobs=-1))]
#Define parameters to tune for each separate algorithm
bandw = estimate_bandwidth(df_encoded, quantile=.3)
params_lst = [('n_clusters', np.arange(1,11)),
('distance_threshold', np.linspace(20,100,9)),
('eps', np.arange(0.1, 1.1, 0.1)),
('bandwidth', np.linspace(bandw-1, bandw+1, 10))]
#Create empty table to store tuning results per model
models_results = []
#Loop over each model, optimize and evaluate it and store results
for estimator, params in zip(estimators_lst,params_lst):
for param in params[1]:
#Set current parameter value and fit the model
estimator[1].set_params(**{params[0]: param})
estimator[1].fit(df_encoded)
#get model clusters
clusters = estimator[1].labels_
#compute Silhouette and DBI scores
try:
silhouette = round(silhouette_score(df_encoded, clusters),3)
davies_bouldin = round(davies_bouldin_score(df_encoded, clusters),3)
except:
silhouette, davies_bouldin = np.nan, np.nan
#store model results
models_results.append({'Model': estimator[0], 'n_clusters': len(np.unique(clusters)),
params[0]: round(param,2), 'silhouette score': silhouette, 'DBI score': davies_bouldin })
#Convert results to dataframe
results_df = pd.DataFrame(models_results).sort_values(['Model','n_clusters']).set_index(keys=['Model', 'n_clusters'])
#report evaluation results by silhouette and DBI score
display(results_df[['silhouette score','DBI score']].drop_duplicates(keep='last'))
silhouette score
DBI score
Model
n_clusters
HAC
2
0.112
2.732
3
0.058
5.042
9
0.018
5.372
2
0.113
2.718
3
0.068
3.851
4
0.049
4.344
5
0.041
4.434
6
0.043
4.604
7
0.037
4.639
8
0.040
4.513
9
0.038
4.520
10
0.035
4.465
1
NaN
NaN
2
0.070
3.258
K-Means
MS
Model Comparison
In [ ]: #Report and plot evaluation results for each model
fig, axes = plt.subplots(1,2, figsize=(12,6), facecolor='ghostwhite')
model_labels = ['K-Means', 'HAC', 'DBSCAN', 'MS']
for model,params in zip(model_labels,params_lst):
cols = [params[0], 'silhouette score', 'DBI score']
cols = [col for col in cols if col != 'n_clusters']
model_res_df = results_df.iloc[results_df.index.get_level_values(0) == model][cols]
#Report results table per model
print(f'\nParameter evaluation results for {model} model:')
display(model_res_df, Markdown(''))
#plot Silhouette scores per model
ax1=sns.lineplot(data=model_res_df, x=model_res_df.index.get_level_values(1), y='silhouette score', ax=axes[0], label=m
sns.scatterplot(data=model_res_df, x=model_res_df.index.get_level_values(1), y='silhouette score', marker='s', ax=ax1)
ax1.set_title('Number of clusters and Silhouette score',fontsize=12)
ax1.set(xlabel='Number of clusters', ylabel='Silhouette Score')
ax1.set_ylim(0, results_df['silhouette score'].max()+.02)
ax1.legend(loc='upper right', title='Models')
#plot DBI scores per model
ax2=sns.lineplot(data=model_res_df, x=model_res_df.index.get_level_values(1), y='DBI score', ax=axes[1], label=model)
sns.scatterplot(data=model_res_df, x=model_res_df.index.get_level_values(1), y='DBI score', marker='s', ax=ax2)
ax2.set_title('Number of clusters and Davies-Bouldin score', fontsize=12)
ax2.set(xlabel='Number of clusters', ylabel='Davies-Bouldin Score')
ax2.set_ylim(0, results_df['DBI score'].max()+2)
ax2.legend(loc='upper right', title='Models')
Parameter evaluation results for K-Means model:
silhouette score
DBI score
Model
n_clusters
K-Means
1
NaN
NaN
2
0.113
2.718
3
0.068
3.851
4
0.049
4.344
5
0.041
4.434
6
0.043
4.604
7
0.037
4.639
8
0.040
4.513
9
0.038
4.520
10
0.035
4.465
Parameter evaluation results for HAC model:
distance_threshold
silhouette score
DBI score
Model
n_clusters
HAC
2
40.0
0.112
2.732
2
50.0
0.112
2.732
2
60.0
0.112
2.732
2
70.0
0.112
2.732
2
80.0
0.112
2.732
2
90.0
0.112
2.732
2
100.0
0.112
2.732
3
30.0
0.058
5.042
9
20.0
0.018
5.372
Parameter evaluation results for DBSCAN model:
eps
silhouette score
DBI score
Model
n_clusters
DBSCAN
1
0.1
NaN
NaN
1
0.2
NaN
NaN
1
0.3
NaN
NaN
1
0.4
NaN
NaN
1
0.5
NaN
NaN
1
0.6
NaN
NaN
1
0.7
NaN
NaN
1
0.8
NaN
NaN
1
0.9
NaN
NaN
1
1.0
NaN
NaN
Parameter evaluation results for MS model:
bandwidth
silhouette score
DBI score
Model
n_clusters
MS
1
3.76
NaN
NaN
1
3.98
NaN
NaN
1
4.20
NaN
NaN
1
4.43
NaN
NaN
1
4.65
NaN
NaN
1
4.87
NaN
NaN
1
5.09
NaN
NaN
1
5.31
NaN
NaN
1
5.54
NaN
NaN
2
3.54
0.07
3.258
As depicted by the result tables and graphs above, except for the DBSCAN model which failed to segment customers and obtain
any clusters, the three other models found at least 2 clusters in the data. Particularly, the Mean Shift model found only 2 clusters in
the data, whereas the K-Means and HAC models managed to obtain two and more customer clusters. Looking at the Silhouette
scores, which as mentioned evaluates the degree of similarity between each data point and its cluster, the scores for each of the
models start pretty low (i.e., closer to 0 than 1), around 0.1 and lower, which indicates a good clustering performance overall for 2
clusters, with the lowest silhouette score being associated with the Mean Shift model (silhouette score = 0.07). However, if we are
to go further than 2 clusters, just as a safe measure, we can see a sharp decrease in the silhouette scores from 2 clusters to 3
clusters for the two models that obtained more than 2: K-Means and HAC. Both start out with a silhouette score of around 0.11 at 2
clusters, and then decrease sharply down to around 0.06 at 3 clusters, with the HAC model obtaining a slightly lower score. No
sharper decrease is observed with greater numbers of clusters. Now, while this gives a slight advantage to the HAC model, looking
the Davies-Bouldin scores at 3 clusters, we find the K-Means model to be the better performing one, acheiving a DBI score of 4,
relative to DBI=5 for the HAC model. Thus, combining the two metrics together, we can conclude that the best clustering algorithm
for the data is the K-Means model with 3 clusters. As such, I will now train a K-Means model with 3 clusters to be the final model for
the data before proceeding with the analysis in relation to the three obtained clusters.
Final Model Selection: K-Means Clustering (n_clusters = 3)
In [ ]: #Create k-means object with 3 clusters
Kmeans = KMeans(n_clusters=3, init='k-means++', n_init=10, random_state=42)
#Fit the K-means model
Kmeans_model = Kmeans.fit(df_encoded)
#Obtain cluster labels and add them to dataframe
df['KM Clusters'] = Kmeans_model.labels_
df['KM Clusters'] = pd.Categorical(df['KM Clusters'].map({0: 'Cluster 1', 1: 'Cluster 2', 2: 'Cluster 3'}))
#Preview sample of the dataframe
df.sample(5)
Out[ ]:
Age
Age
Group
Gender
Item
Purchased
Category
2090
51
46-55
Male
Socks
1507
66
65+
Male
810
52
46-55
3751
36
991
20
Purchase
Amount
(USD)
Location
Region
Size
Color
Season
Review
Rating
Subscription
Status
Clothing
71 Massachusetts
New
England
S
Teal
Winter
3.4
No
Dress
Clothing
86
Pennsylvania
Mideast
S
Maroon
Winter
3.7
No
Male
Scarf
Accessories
83
South Dakota
Plains
S Lavender
Fall
4.9
Yes
36-45
Female
Jewelry
Accessories
97
Alabama
Southeast
S
Silver
Fall
4.2
No
18-25
Male
Shorts
Clothing
97
Wisconsin
Great
Lakes
XL
White
Spring
2.7
Yes
Part Six: Model Interpretation
In this section, I will concentrate on analyzing the obtained clusters in relation to the rest of the dataset. First, I will perform
bivariate analysis to breakdown the clusters by each of the important or relevant features in the dataset and thereby understand
their characteristics and implications in depth. Then, in order to add more nuance to our understanding of the clusters, I will
compare pairs of variables together and in relation to the customer clusters (for instance, examining the spending capacity of
each age group and how these interactions map onto the 3 customer clusters), which, as such, will effectively culminate in
multivariate analysis: testing pairs of variables in relation to a third variable, customer clusters, for a multitude of pairs of
variables. Based on the analysis results, I will write up a report describing the 3 clusters in detail, identifying their main
characteristics or distinguishing features and delineating their main differences across the dataset.
Cluster Analysis: Bivariate Analysis
In order to understand the 3 customer clusters obtained in detail, I will select and focus on the features in the data that seem most
relevant or most informative. As such, I will examine the differences between the different customer groups across each of the
following variables: gender, age, region, purchasing capacity, shopping frequency, previous purchases, season of the year, item
category, subscription status, discount applied and promo code used.
In [ ]: #Get colors for each cluster
cluster_colors = plt.get_cmap('Set1_r').colors[-3:]
#green, blue, red
#Define relevant variables
cols = ['Gender', 'Age', 'Age Group', 'Region', 'Purchase Amount (USD)', 'Frequency of Purchases',
'Previous Purchases', 'Season', 'Category', 'Subscription Status', 'Discount Applied', 'Promo Code Used']
#Plot customer segmentation by variable
fig, axes = plt.subplots(nrows=2, ncols=6, figsize=(60, 20), facecolor='ghostwhite')
plt.suptitle(f'Customer Segmentation per Variable', fontsize=25)
plt.subplots_adjust(wspace=.2, hspace=.3, top=.91)
for i,col in enumerate(cols):
ax = axes[i // 6, i % 6]
if df[col].dtype != 'object':
sns.boxplot(data=df, x='KM Clusters', y=col, palette=cluster_colors, ax=ax)
ax.set_title(f'Relationship Between Customer Clusters and {col}', fontsize=14, pad=12)
ax.set_xlabel('Customer Clusters', fontsize=12, labelpad=5)
ax.set_xlim(-1,len(ax.get_xticklabels())+0.2)
ax.set_ylabel(col, fontsize=12)
else:
sns.countplot(data=df, x=col, hue='KM Clusters', order=order_dict.get(col, None), palette=cluster_colors, gap=.25,
ax.set_title(f'Relationship Between Customer Clusters and {col}', fontsize=14, pad=12)
ax.set_xlabel(col, fontsize=12, labelpad=5)
ax.set_xlim(-1,len(ax.get_xticklabels())+0.2)
ax.set_xticks(ticks=ax.get_xticks(), labels=ax.get_xticklabels(), rotation=(0 if (len(ax.get_xticklabels()) <= 4 o
ax.set_ylabel('Total Count', fontsize=12)
plt.show()
Results:
Based on the cluster analysis above, we can see 3 distinct customer groups emerging from the data:
Group 1 (Green):
This group of customers tends be male only, with ages distributed across the board but peaking slightly in adulthood (26-35
years) and remaining consistent till seniority (56-65 years).
They tend to be least frequent buyers, mostly shopping shopping once every 4 months (quarterly) or annually.
They are not subscribers of the service and and seldom enjoy any benefits such as discounts or promo codes.
Group 2 (Blue):
This second customer group also tends to be male dominated and made up mostly of adults, middle-aged adults and seniors,
but nonetheless also varying in age as much as the former group.
They tend to be the most frequent buyers. Their shopping is consistently high across the year and compared to other customer
groups, and most customers in that group shop every 3 months followed by as frequent as every two weeks.
Consistent with this picture, they also tend to be the only customer group that are subscribers of the shop and the only group of
customers that tend to enjoy benefit such discounts and promo codes, persumably as a result of their subscription status. In
fact, they nearly always shop only if a discount or promo code is avaliable. So it is likely that the benefits they reap from
subscribing to the service is what keeps them avid customers across the year.
Group 3 (Red):
The last customer group is made up of females only, with again ages distributed across the board but slightly concentrated
around middle adulthood.
They are the second most frequent buyers. Their shopping is consistent across the year, however they mostly shop twice a
week, monthly, every three months, and annually. Compared to other groups, within-group analysis shows that this group of
females tend to shop more frequently overall compared to the two other groups which tend to shop relatively more sparesly,
especially compared to the first male group of casual shoppers.
Like group 1, this group too are almost never subscribers of the service and do not enjoy any benefits like discounts or promo
codes.
Common to all three groups:
The three groups do not differ in age or spending capacity. The finding that the 3 clusters of customers are near identical in age
distribution and slightly peaking around middle adulthood might be persumably due to older adults having higher earning and
thus spending capacity compared to younger adults. And arguably they also tend to be more outgoing than the elderly which
explains the other end of the extreme.
Analyzing purchasing behavior by season, all three groups' shopping activity tend to be generally consistent across the year
with very little variation between groups beyond the typically seen frequency of purchase. Perhaps the only notable difference
is that males in the first customer group and females making up the third customer group tend to shop slightly more in the fall
season, whereas the second group of frequent male customers shop more across the three other seasons. This could arguably
be due to the fact that the loyal male customers group is the only group that is subscribed to the brand services and
persumably enjoy discounts and offers across the year as a result, whereas the other two groups don't, in which case they'd be
shopping relatively more during the fall season which typically involves lots discounts and great shopping opportunities
especially during the month of November with annual retail phenomenon like Black Friday.
There are also no drastic differences between the 3 clusters in terms of the category of the items being purchased, with most
customers across the 3 clusters shopping most often for clothes followed by accessories, and conversely shopping least for
outerwear and footwear. However, this is likely because the number of items making up these two latter categories are much
lower in comparison.
Finally, most customers across all 3 clusters tend to be from the Southeast followed by the Plains. And conversely, the least
number of customers again across all 3 clusters tend to be in the Southwest, Rocky Mountains, and the Mideast.
Summary:
Overall, we seem to have 3 distinct groups of customers, the first is predominantly middle-aged males who are the most loyal
customers to the brand, buying and engaging with its services most often. We can call this group loyal male customers group. They
also tend to be the only of the three groups who are subscribers to the shop or brand services and enjoy all the benefits in return,
including discounts and promo codes. In fact, they seem to shop exclusively through offers, discounts and promo codes. The
second customer group is also predominantly middle-aged males however they are less loyal customers with less frequent
shopping, mostly seasonal buyers, have no subscriptions and enjoy no benefits from the brand. We can call them the casual male
customers group. Lastly, the third group obtained is predominantly middle-aged females who like shopping more frequently but who
also are not subscribers and do not enjoy any benefits from the brand. We can call this group the female customers group for
simplicity.
Next, I will analyze the data further by picking pairs of variables together and examining both in relation to our 3 customer groups.
Cluster Analysis: Multivariate Analysis
As mentioned, here I will pick pairs of variables and examine them together and in relation to the obtained customer clusters. This
time, I will focus on each of the key variables, Gender, Age, Purchase Amount, Frequency of Purchase, Previous Purchases, and
Subscription Status, comparing each of them to each of the other relevant variables in data, however segmented by our customer
groups, which effectively results in multivariate analysis with 3 variables at play: variable 1, variable 2, and customer clusters
variable.
In [ ]: #Get independent variables, x_vars
x_vars = df.columns.drop(labels=['Location', 'Color', 'KM Clusters'])
#Get dependent variables, y_vars
y_vars = ['Gender', 'Age', 'Purchase Amount (USD)', 'Frequency of Purchases', 'Previous Purchases', 'Subscription Status']
#Create order dictionary for better presentation of the variables
order_dict = {'Age Group': age_labels,
'Size': ['S', 'M', 'L', 'XL'],
'Season': ['Winter', 'Spring', 'Summer', 'Fall'],
'Frequency of Purchases': ['Bi-Weekly', 'Weekly', 'Fortnightly', 'Monthly', 'Every 3 Months', 'Quarterly', 'A
#Perform multivariate analysis and report results
Get_Plots(x_vars=x_vars, y_vars=y_vars, clusters_var='KM Clusters', colors=cluster_colors, order=order_dict)
Results:
Multivariate analysis in relation to the customer clusters consolidates and adds more to nuance to the picture. I have
done my best to render the data visualizable. Here are some of the better notable results and main takeaways:
There are no notable differences between the three customer groups across the different variables when analyzing in relation
to gender and age, except the Southwestern customers in the loyal male group tend to be younger in age relative to the casual
male customers and female customers group.
Consistent with the picture drawn so far, analyzing customer clusters by purchase amounts across the different variables
reveals a relationship between customer group, purchase amount, and season. Particularly, it appears that the loyal male
group tend to spend least during the summer season, which paradoxically is the season in which they engage in shopping
activity most! This again gives credence to the observation that the loyal male customers group tend to shop for and buy items
most when they are provided discounts and offers, which explains how the season they shop during the most is also the
season their overall purchasing amounts are least relative to the other customer groups.
Second, there's a relationship between customer group and purchase amount in relation to local region: the loyal male
customers spend more than the other two groups particularly in Mideast and spend slightly less in the Plains region; the casual
male group spend more than the other two groups in the the Plains and spend less in the Great Lakes region; and finally, the
female customers group spend slightly more than the two other groups in the Southest in particular. Thus, the loyal male
customers drive more sales in the Mideast, less in the Plains; the casual male customers drive more sales in the Plains, less in
the Great Lakes; and the female customers drive more sales in the Southeast and remaining more or less consistent with the
other two groups across the other regions. Now, it's worth stopping and noting here that since the Midest is one of the lowest
regions in sales overall, this means that special attention needs to be paid to the casual male customers and female customers
in this region in particular.
Analyzing customer clusters by age and frequency of purchase, we find that for frequent shopping, particularly bi-weekly and
weekly shopping, the casual customers whom engage in this frequent shopping tend to be slightly younger than the customers
in the two other groups. On the other hand, for least frequent shopping, i.e. anually, we find that the female customer group
engaging in annual shopping tend to be slightly older than customers in the other groups.
Honing in on age groups rather than absolute age values, we additionally find that young adults and middle-aged
customers in the loyal male customers group tend to engage in less bi-weekly, weekly and twice-a-week shopping
compared to loyal male customers in the other age groups (except the elderly). Further, Adult females (aged 26 to 35) tend
to engage in the most frequent shopping, twice a week, compared to the other female age groups. This is also the case for
adult males within the same age group in addition to middle aged adult males (aged 46 to 55) in the loyal male customers
group. Those who engage in the least shopping frequency (annually) across the 3 customer groups, tend to be seniors
aged 56 to 65 years (in addition to middle aged adult males between 36 and 45 years in the loyal male customer group).
Also, consistent with the picture portrayed so far, shopping increases for seasonal shoppers in the casual male group
particularly in the fall season. This again is likely because they are offered discounts or offers that are otherwise unavailable for
them during the rest of the year, in contrast to the loyal male group.
Interestingly, analyzing customer clusters by frequency of purchase and subscription status, we find that subscribed male in the
loyal customers group tend at starkly higher rates across the different shopping frequency groups compared to their
unsubscribed counterpart within the same customer group. That is, these loyal subscribed male customers tend to shop weekly
and every two weeks more often than loyal male customers who are unscubscribed. Conversely, unsubscribed customers in
the loyal male group tend to shop every 3 months and quarterly the most often, and shop weekly the least often, which is
consistent with the behavior of the customers in the two other customer groups in fact, both of whom are unsubscribed as well.
The only notable exception is that female customers tend to shop the most frequently (i.e., twice a week) the most often
compared to the other groups, although as revealed by customer group x shopping frequency x amounts paid analysis, those
most frequent female shoppers tend to pay less compared to females who shop less frequently, presumably as frequent female
customers shop for less expensive items when shopping that frequently whereas the more seasonal female customers shop for
more expensive ones. Consistent with their overall behavior, unsubscribed male customers in the casual male group tend to
shop quarterly the most often.
Analyzing customer clusters in relation to previous purchases and items purchased, we find that loyal male customers who had
a lot of previous experience with the shop (as indicated by the number of previous purchases) were more likely to opt for tshirts and sneakers than the other customer groups; casual male customers with lots of previous experience were more likely
to opt for jewelry, hoodies, pants, jeans and handbags than the other customer groups (a lot of these purchases in this group
are also influenced by age, so we could expect an interaction between previous experience and age in the casual male group
when it comes to their current choices of items to buy); and finally females in the female customers group with lots of previous
experience were more likely to opt for shoes, boots, or shirts (with the choices of the first two items interacting with age as
well), and were least likely to opt for outwear (coats or jackets) compared to male customers in either male groups.
Additionally, we find a relationship between customer group, previous purchases and local region. Particularly, we find that
female customers in New England and the Plains had the least purchasing history compared to the male customers in the
other two groups. Thus, we can infer that female customers in these two regions have less experience with the shop
services and thus more attention or marketing effort should be paid to female customers in New England and the Plains in
particular.
Analyzing the customer groups by subscription status and age group, we find that those subscribed in the loyal male group
tend to be mostly middle-aged adults in the 46-55 age group. This is particularly interesting because this age group drive the
most sales generally and the loyal customer male group account for the highest proportion of sales more particularly. This
again gives credence to the observation that male customers in the loyal male group tend to be most motivated when given
offers and discounts which presumably comes with being subscribed, so the most sales is being driven my middle-aged adult
males who are subscribed to the service. This becomes clear when looking at variables like discount applied and promo code
used: subscribed males tend to have almost twice as more discounts and promo codes as unsubscribed males within that
same customer group. The only exception here is the elderly group whose behavior doesn't seem to differ much with their
subscription status, persumably because they shop very infrequently. Nonetheless, we can generally assert that when
subscription to the service confers certain benefits like discounts and promo codes, customer engagement greatly increases.
Finally, the last notable interaction effect to report is the interaction between customer group, subscription status and region:
based on the last graph here, subscription doesn't seem to affect purchasing behavior much in the Great Lakes region,
followed by the Southwest, the latter of which is interestingly, as seen earlier, the region with the least customer engagement
and sales overall.
Part Seven: Key Insights and Recommendations
As described at length above, clustering analysis yielded three separate clusters, two of which are male dominated and
the other is female dominated. What seem to distinguish these three groups most, aside from gender, is loyalty to the
brand, purchasing quantities, shopping frequency and subscription to the brand services and related benefits such as
discounts and promo codes. One of the male groups is comprised of frequent adult customers, with consistent shopping
behavior across the year, a lot of whom are subscribed to the brand and enjoy many benefits in return. The second group
of males is comprised of less frequent or seasonal shoppers, mostly shopping in the fall, arguably the shopping season
of the year, and who enjoy no subscriptions or related benefits. Lastly, the last customer group obtained consists of
mostly adult female customers whose shopping often tends to be either very frequent (twice a week) or very infrequent
(annually). This group also have no subscriptions to the brand and enjoy no benefits. Further analysis has been
conducted to examine all the subtle interrelations between all the important variables in the data and in relation to
customer groups. Considering all that has been discussed, here are some key insights and recommendations to increase
sales or curate better marketing campaigns based on the analysis results obtained:
Generally, the current brand would benefit most from advertising more to younger adults and to female customers, catering to
their particular needs and shopping habits, from promoting subscriptions to their services, and by increasing advertisement
efforts and/or opening more branches in the regions with least sales, most notably in the Southwest, Rocky Mountains, and the
Mideast.
First off, it's clear from the data that three customer groups do not differ much in their spending capacity. Instead, as
consistently illustrated over and again, customer engagement increases most when customers are subscribed to the brand and
are provided certain benefits like discounts and promo codes in return. And thus it seems that the highest driving force behind
sales here is subscription and related benefits. In fact, the presence of discounts and promo codes seem to increase the
number of loyal customers and sales coming from this group without really affecting overall spending. That is to say, such
benefits seem to lure customers independent of the benefits they actually bear in absolute monetary terms. Indeed, this is the
case across the three groups as well as within the loyal male customers group: the overall number of loyal customers without
subscribtions or benefits is generally lower compared to the subscribed ones. Discounts and other attractive offers seem to rile
people in independent of their spending capabilities, and perhaps also independent of their satisfaction as indicated by their
review rating scores (albeit unsatisfied customers generally do not seem to give any rating, so caution should be taken when
interpreting this piece of data). Accordingly, it seems particularly important to address young adults and female customers,
tailoring an advertisement campaign just for these populations and/or offering more discounts and benefits and facilitating the
acquisition of memberships or subscriptions to the service. Subscriptions, discounts and other benefits seem especially
important for sales here given that the most loyal group of customers ushering in the highest amount of sales overall seem to
shop exlusively through offers, discounts and promo codes, persumably as a result of their subscription status. Thus,
facilitating subscriptions to the service is highly predictive of increased sales and whatever marketing campaign to be lunched
must promote subscriptions and its benefits for the customers especially for the female group and less frequent male buyers
group.
Second, as costumers in the Southwest, Rocky Mountains, and the Mideast tend to shop least compared to customers in other
regions, the current brand may want to open more branches or increase advertisement for these regions to ensure better sales.
At any rate, each of these regions have their own determinants influencing their sales. One curious aspect about the
Southwest in particular, which is associated with the least sales, is that subscription status doesn't seem to affect sales much in
this region. Now as we know subscription status seems to be highly predictive of purchasing behavior, presumably because of
the benefits subscription to the brand confers for the customers like discounts and promo codes. However, it seems that there's
little benefit to subscription in the Southwest, as there's little to no difference between the purchasing behaviors of subscribers
and unsubscribers in that region. Thus, efforts have to be spent to ensure that subscriptions actually confer benefits to the
customer.
Relatedly, as with the Southwest, analysis revealed that subscription status doesn't seem to influence sales much in the Great
Lakes, also one the less well performing regions. Now, again, since subscription status and related benefits seem to be most
predictive of sales, the Great Lake branches could improve their sales and benefit most from ensuring that subscribed
customers in this region are given reasonable subscription benefits in return to motivate them to engage more. This is
particularly pressing here since looking at the relationship between customer group and purchase amounts in the Great Lakes
in particular reveals that customers generally, and casual male customers especially, tend to spend lower amounts on
purchases overall in the Great Lakes branches. So, even when the number of customers is medium or acceptable, they're still
driving lower sales overall. As such, based off all these interacting findings, the Great Lakes branches could greatly improve
their sales by, firstly, attending to their male customers better, and secondly, by improving their subscription services more.
Moving to the Mideast in particular, one of the top three regions with least sales, we find that overall sales in this region, as
indicated by total purchase amounts, is mostly driven by the loyal male customers group, but less so by either of the other two
customer groups. Now since the Mideast is one of the lowest regions in sales, it seems particularly imperative to address
causal male customers and female customers in this region to bring it to par with the better performing ones.
Further, shopping behavior in some of the mentioned regions is also predicted by age. Once again, in the Southwest for
instance, young adults in the casual male customers group and young adults in the female customers group are the least
engaged. Thus, more efforts seem to be required in addressing younger adults in these two groups, and, as explicated earlier,
in ensuring that subscribed customers actually enjoy good benefits in return, which would likely ramp up both subscription rates
and overall sales in this region.
Finally, given that footwear and outerwear are generally sold less relative to other types of clothes, while this might simply
reflect there being less items under these two categories for sale, advertisement and marketing efforts concentrating more on
these two categories might increase their sales favorably. This is especially true for female customers as they mostly opt for
boots and shoes from this brand.