Overview of this project
This project involve careful analysis of the two dataset which are customer data and transaction data
In this project we will be performing data cleaning, wrangling, and data exploration as well as performing statiscal among the variables
import pandas as pd
import numpy as np
Importing the dataset
# for purchase behaviour
qp = pd.read_csv('C:/Users/User/Documents/Quatum Intenship/purchaseBehaviour.csv')
# for transaction data
qt = pd.read_excel('C:/Users/User/Documents/Quatum Intenship/datEQVI_transaction_data.xlsx')
qp.head()
LYLTY_CARD_NBR
LIFESTAGE PREMIUM_CUSTOMER
0
1000
YOUNG SINGLES/COUPLES
Premium
1
1002
YOUNG SINGLES/COUPLES
Mainstream
2
1003
YOUNG FAMILIES
Budget
3
1004
OLDER SINGLES/COUPLES
Mainstream
4
1005 MIDAGE SINGLES/COUPLES
Mainstream
qt.head()
DATE STORE_NBR LYLTY_CARD_NBR TXN_ID PROD_NBR
PROD_NAME PROD_QTY TOT_SALES
-
1
1000
1
5
Natural Chip Compny SeaSalt175g
2
6.0
-
1
1307
348
66
CCs Nacho Cheese 175g
3
6.3
-
1
1343
383
61
Smiths Crinkle Cut Chips Chicken 170g
2
2.9
-
2
2373
974
69 Smiths Chip Thinly S/Cream&Onion 175g
5
15.0
-
2
2426
1038
3
13.8
108
Kettle Tortilla ChpsHny&Jlpno Chili 150g
Performing Exploring Data Analysis
qt.describe(include='all')
C:\Users\User\AppData\Local\Temp/ipykernel_5028/-.py:1: FutureWarning: Treating datetime data as categor
ical rather than numeric in `.describe` is deprecated and will be removed in a future version of pandas. Specify
`datetime_is_numeric=True` to silence this warning and adopt the future behavior now.
qt.describe(include='all')
DATE
count
STORE_NBR LYLTY_CARD_NBR
-
TXN_ID
PROD_NBR
-e-e-
PROD_NAME
PROD_QTY
TOT_SALES
-
unique
364
NaN
NaN
NaN
NaN
114
NaN
NaN
top
-:00:00
NaN
NaN
NaN
NaN
Kettle Mozzarella Basil
& Pesto 175g
NaN
NaN
freq
939
NaN
NaN
NaN
NaN
3304
NaN
NaN
first
-:00:00
NaN
NaN
NaN
NaN
NaN
NaN
NaN
last
-:00:00
NaN
NaN
NaN
NaN
NaN
NaN
NaN
mean
NaN
-
-e-e+05
-
NaN
-
-
std
NaN
-
-e-e+04
-
NaN
-
-
min
NaN
1.00000
-e-e+00
-
NaN
-
-
25%
NaN
-
-e-e+04
-
NaN
-
-
50%
NaN
-
-e-e+05
-
NaN
-
-
75%
NaN
-
-e-e+05
-
NaN
-
-
max
NaN
-
-e-e+06
-
NaN
-
-
qp.describe(include='all')
LYLTY_CARD_NBR LIFESTAGE PREMIUM_CUSTOMER
count
-e+04
72637
72637
unique
NaN
7
3
top
NaN
RETIREES
Mainstream
freq
NaN
14805
29245
mean
-e+05
NaN
NaN
std
-e+04
NaN
NaN
min
-e+03
NaN
NaN
25%
-e+04
NaN
NaN
50%
-e+05
NaN
NaN
75%
-e+05
NaN
NaN
max
-e+06
NaN
NaN
Checkikng for null variables in the dataset
# For the transaction dataset
qt.isnull().sum()
DATE
STORE_NBR
LYLTY_CARD_NBR
TXN_ID
PROD_NBR
PROD_NAME
PROD_QTY
TOT_SALES
dtype: int64
-
# For the Purchacse behaviour dataset
qp.isnull().sum()
LYLTY_CARD_NBR
LIFESTAGE
PREMIUM_CUSTOMER
dtype: int64
0
0
0
Result from the checking of the null(NaN) in the dataset
We found out that there is no null data in all the columns
Checking the datatypes and converting to necessary data
format
qt
DATE STORE_NBR
LYLTY_CARD_NBR TXN_ID PROD_NBR
PROD_NAME PROD_QTY TOT_SALES
-
1
1000
1
5
Natural Chip Compny SeaSalt175g
2
6.0
-
1
1307
348
66
CCs Nacho Cheese 175g
3
6.3
-
1
1343
383
61
Smiths Crinkle Cut Chips Chicken 170g
2
2.9
-
2
2373
974
69
Smiths Chip Thinly S/Cream&Onion 175g
5
15.0
-
2
2426
1038
108
Kettle Tortilla ChpsHny&Jlpno Chili 150g
3
13.8
...
...
...
...
...
...
...
...
...
-
272
272319
270088
89
Kettle Sweet Chilli And Sour Cream 175g
2
10.8
-
272
272358
270154
74
Tostitos Splash Of Lime 175g
1
4.4
-
272
272379
270187
51
Doritos Mexicana 170g
2
8.8
-
272
272379
270188
42 Doritos Corn Chip Mexican Jalapeno 150g
2
7.8
-
272
272380
270189
74
2
8.8
Tostitos Splash Of Lime 175g
264836 rows × 8 columns
For QT dataset
qt.info()
RangeIndex: 264836 entries, 0 to 264835
Data columns (total 8 columns):
#
Column
Non-Null Count
Dtype
--- ----------------------0
DATE
264836 non-null datetime64[ns]
1
STORE_NBR
264836 non-null int64
2
LYLTY_CARD_NBR 264836 non-null int64
3
TXN_ID
264836 non-null int64
4
PROD_NBR
264836 non-null int64
5
PROD_NAME
264836 non-null object
6
PROD_QTY
264836 non-null int64
7
TOT_SALES
264836 non-null float64
dtypes: datetime64[ns](1), float64(1), int64(5), object(1)
memory usage: 16.2+ MB
Result
We found out that out that there are 8 columns in this dataset and 264836 rows in the dataset
We only need to convert Date column as (datetype)
# Converting the DATE column
#qt['DATE'] = pd.to_datetime(qt['DATE'],format='%D%m%y')
#qt['DATE'] = pd.to_datetime(qt['DATE']).dt.date
# Let confirm whether the DATE as been converted
qt.info()
RangeIndex: 264836 entries, 0 to 264835
Data columns (total 8 columns):
#
Column
Non-Null Count
Dtype
--- ----------------------0
DATE
264836 non-null datetime64[ns]
1
STORE_NBR
264836 non-null int64
2
LYLTY_CARD_NBR 264836 non-null int64
3
TXN_ID
264836 non-null int64
4
PROD_NBR
264836 non-null int64
5
PROD_NAME
264836 non-null object
6
PROD_QTY
264836 non-null int64
7
TOT_SALES
264836 non-null float64
dtypes: datetime64[ns](1), float64(1), int64(5), object(1)
memory usage: 16.2+ MB
For Qp dataset
qp
LYLTY_CARD_NBR
LIFESTAGE PREMIUM_CUSTOMER
0
1000
YOUNG SINGLES/COUPLES
Premium
1
1002
YOUNG SINGLES/COUPLES
Mainstream
2
1003
YOUNG FAMILIES
Budget
3
1004
OLDER SINGLES/COUPLES
Mainstream
4
1005 MIDAGE SINGLES/COUPLES
Mainstream
...
...
...
...
72632
- MIDAGE SINGLES/COUPLES
Mainstream
72633
-
YOUNG FAMILIES
Mainstream
72634
-
YOUNG FAMILIES
Premium
72635
-
OLDER FAMILIES
Budget
72636
-
YOUNG SINGLES/COUPLES
Mainstream
72637 rows × 3 columns
# checking the necessary information
qp.info()
RangeIndex: 72637 entries, 0 to 72636
Data columns (total 3 columns):
#
Column
Non-Null Count
--- ------------------0
LYLTY_CARD_NBR
72637 non-null
1
LIFESTAGE
72637 non-null
2
PREMIUM_CUSTOMER 72637 non-null
dtypes: int64(1), object(2)
memory usage: 1.7+ MB
Dtype
----int64
object
object
This dataset is good
qt
DATE STORE_NBR
LYLTY_CARD_NBR TXN_ID PROD_NBR
-
1
1000
1
5
-
1
1307
348
-
1
1343
383
-
2
2373
-
2
2426
...
PROD_NAME PROD_QTY TOT_SALES
Natural Chip Compny SeaSalt175g
2
6.0
66
CCs Nacho Cheese 175g
3
6.3
61
Smiths Crinkle Cut Chips Chicken 170g
2
2.9
974
69
Smiths Chip Thinly S/Cream&Onion 175g
5
15.0
1038
108
Kettle Tortilla ChpsHny&Jlpno Chili 150g
3
13.8
...
...
...
...
...
...
...
...
-
272
272319
270088
89
Kettle Sweet Chilli And Sour Cream 175g
2
10.8
-
272
272358
270154
74
Tostitos Splash Of Lime 175g
1
4.4
-
272
272379
270187
51
Doritos Mexicana 170g
2
8.8
-
272
272379
270188
42 Doritos Corn Chip Mexican Jalapeno 150g
2
7.8
-
272
272380
270189
74
2
8.8
Tostitos Splash Of Lime 175g
264836 rows × 8 columns
Removing Unwanted Character qt[Prod] columns
qt['PROD_NAME'].str.replace(r'\'W','')
C:\Users\User\AppData\Local\Temp/ipykernel_5028/-.py:1: FutureWarning: The default value of regex will c
hange from True to False in a future version.
qt['PROD_NAME'].str.replace(r'\'W','')
0
1
2
3
4
Natural Chip
Compny SeaSalt175g
CCs Nacho Cheese
175g
Smiths Crinkle Cut Chips Chicken 170g
Smiths Chip Thinly S/Cream&Onion 175g
Kettle Tortilla ChpsHny&Jlpno Chili 150g
...
264831
Kettle Sweet Chilli And Sour Cream 175g
264832
Tostitos Splash Of Lime 175g
264833
Doritos Mexicana
170g
264834
Doritos Corn Chip Mexican Jalapeno 150g
264835
Tostitos Splash Of Lime 175g
Name: PROD_NAME, Length: 264836, dtype: object
Let use descriptive statistic to found the common words in each columns
qt.describe(include='all')
C:\Users\User\AppData\Local\Temp/ipykernel_5028/-.py:1: FutureWarning: Treating datetime data as categor
ical rather than numeric in `.describe` is deprecated and will be removed in a future version of pandas. Specify
`datetime_is_numeric=True` to silence this warning and adopt the future behavior now.
qt.describe(include='all')
DATE
count
STORE_NBR LYLTY_CARD_NBR
-
TXN_ID
PROD_NBR
PROD_NAME
-e-e-
PROD_QTY
TOT_SALES
-
unique
364
NaN
NaN
NaN
NaN
114
NaN
NaN
top
-:00:00
NaN
NaN
NaN
NaN
Kettle Mozzarella Basil
& Pesto 175g
NaN
NaN
freq
939
NaN
NaN
NaN
NaN
3304
NaN
NaN
first
-:00:00
NaN
NaN
NaN
NaN
NaN
NaN
NaN
last
-:00:00
NaN
NaN
NaN
NaN
NaN
NaN
NaN
mean
NaN
-
-e-e+05
-
NaN
-
-
std
NaN
-
-e-e+04
-
NaN
-
-
min
NaN
1.00000
-e-e+00
-
NaN
-
-
25%
NaN
-
-e-e+04
-
NaN
-
-
50%
NaN
-
-e-e+05
-
NaN
-
-
75%
NaN
-
-e-e+05
-
NaN
-
-
max
NaN
-
-e-e+06
-
NaN
-
-
Finding the maximum character in qt dataset
qt.max()
Filter the dataset based the PROD_NAME column in descendign order
qt.sort_values(by='PROD_NAME',ascending=False)
DATE STORE_NBR LYLTY_CARD_NBR TXN_ID PROD_NBR
PROD_NAME PROD_QTY TOT_SALES
-
102
102179
102608
35 Woolworths Mild Salsa 300g
2
3.0
-
89
89511
88678
35 Woolworths Mild Salsa 300g
2
3.0
-
62
62011
57829
35 Woolworths Mild Salsa 300g
2
3.0
-
62
62011
57830
35 Woolworths Mild Salsa 300g
2
3.0
-
62
62062
58155
35 Woolworths Mild Salsa 300g
2
3.0
...
...
...
...
...
...
...
...
-
...
104
104070
104150
94
Burger Rings 220g
2
4.6
-
205
205018
204075
94
Burger Rings 220g
1
2.3
-
271
271121
268911
94
Burger Rings 220g
2
4.6
-
104
104187
104863
94
Burger Rings 220g
2
4.6
-
28
28106
25157
94
Burger Rings 220g
2
4.6
264836 rows × 8 columns
Removing Salsa Products from dataset
There are salsa products in the dataset but we are only interested in the chips category, so let's remove them
qt = qt[qt['PROD_NAME'].str.contains('Salsa')==False]
qt = qt[qt['PROD_NAME'].str.contains('Salsa')==False]
qt
DATE STORE_NBR
LYLTY_CARD_NBR TXN_ID PROD_NBR
PROD_NAME PROD_QTY TOT_SALES
-
1
1000
1
5
Natural Chip Compny SeaSalt175g
2
6.0
-
1
1307
348
66
CCs Nacho Cheese 175g
3
6.3
-
1
1343
383
61
Smiths Crinkle Cut Chips Chicken 170g
2
2.9
-
2
2373
974
69
Smiths Chip Thinly S/Cream&Onion 175g
5
15.0
-
2
2426
1038
108
Kettle Tortilla ChpsHny&Jlpno Chili 150g
3
13.8
...
...
...
...
...
...
...
...
...
-
272
272319
270088
89
Kettle Sweet Chilli And Sour Cream 175g
2
10.8
-
272
272358
270154
74
Tostitos Splash Of Lime 175g
1
4.4
-
272
272379
270187
51
Doritos Mexicana 170g
2
8.8
-
272
272379
270188
42 Doritos Corn Chip Mexican Jalapeno 150g
2
7.8
-
272
272380
270189
74
2
8.8
Tostitos Splash Of Lime 175g
246742 rows × 8 columns
Let's Run Descriptive Statistic
qt.describe()
STORE_NBR LYLTY_CARD_NBR
TXN_ID
PROD_NBR
PROD_QTY
TOT_SALES
count-
-e-e-
mean
-
-e-e+05
-
-
-
std
-
-e-e+04
-
-
-
min
-
-e-e+00
-
-
-
25%
-
-e-e+04
-
-
-
50%
-
-e-e+05
-
-
-
75%
-
-e-e+05
-
-
-
max
-
-e-e+06
-
-
-
Result of the descriptive Statistic
We can see customer purchasing 200 qty of chips in PROD_QTY column.
Let check this customer out
qt.sort_values(by='PROD_QTY',ascending=False)
DATE STORE_NBR LYLTY_CARD_NBR TXN_ID PROD_NBR
PROD_NAME PROD_QTY TOT_SALES
-
226
226000
226210
4
Dorito Corn Chp Supreme 380g
200
650.0
-
226
226000
226201
4
Dorito Corn Chp Supreme 380g
200
650.0
-
46
46296
42138
81
Pringles Original Crisps 134g
5
18.5
-
71
71142
69852
96
WW Original Stacked Chips 160g
5
9.5
-
55
55144
49328
44
Thins Chips Light& Tangy 175g
5
16.5
...
...
...
...
...
...
...
...
-
184
184055
186586
107 Smiths Crinkle Cut French OnionDip 150g
1
2.6
-
184
184152
187227
28
Thins Potato Chips Hot & Spicy 175g
1
3.3
-
174
174004
174974
114
Kettle Sensations Siracha Lime 150g
1
4.6
-
184
184205
187584
13
Red Rock Deli Thai Chilli&Lime 150g
1
2.7
-
114
114006
116914
111
Smiths Chip Thinly Cut Original 175g
1
3.0
...
246742 rows × 8 columns
Let drop this customer out
qt = qt.drop(qt["PROD_QTY"].loc[qt["PROD_QTY"]==200].index)
qt.max()
DATE
STORE_NBR
LYLTY_CARD_NBR
TXN_ID
PROD_NBR
PROD_NAME
PROD_QTY
TOT_SALES
dtype: object
-:00:-
Woolworths Cheese
Rings 190g
5
29.5
Let count the number of transaction by date
qt.DATE.describe(include='all')
C:\Users\User\AppData\Local\Temp/ipykernel_5028/-.py:1: FutureWarning: Treating datetime data as categor
ical rather than numeric in `.describe` is deprecated and will be removed in a future version of pandas. Specify
`datetime_is_numeric=True` to silence this warning and adopt the future behavior now.
qt.DATE.describe(include='all')
count
246740
unique
364
top-:00:00
freq
865
first-:00:00
last-:00:00
Name: DATE, dtype: object
Creating a column of dates that includes every day from 1 Jul 2018 to 30 Jun 2019,
qt
DATE STORE_NBR
LYLTY_CARD_NBR TXN_ID PROD_NBR
-
1
1000
1
5
-
1
1307
348
-
1
1343
383
-
2
2373
-
2
2426
...
PROD_NAME PROD_QTY TOT_SALES
Natural Chip Compny SeaSalt175g
2
6.0
66
CCs Nacho Cheese 175g
3
6.3
61
Smiths Crinkle Cut Chips Chicken 170g
2
2.9
974
69
Smiths Chip Thinly S/Cream&Onion 175g
5
15.0
1038
108
Kettle Tortilla ChpsHny&Jlpno Chili 150g
3
13.8
...
...
...
...
...
...
...
...
-
272
272319
270088
89
Kettle Sweet Chilli And Sour Cream 175g
2
10.8
-
272
272358
270154
74
Tostitos Splash Of Lime 175g
1
4.4
-
272
272379
270187
51
Doritos Mexicana 170g
2
8.8
-
272
272379
270188
42 Doritos Corn Chip Mexican Jalapeno 150g
2
7.8
-
272
272380
270189
74
2
8.8
246740 rows × 8 columns
Checking insight about '-':'-'
import matplotlib.pyplot as plt
date_index = qt.set_index('DATE')
date_index['-':'-']
Tostitos Splash Of Lime 175g
STORE_NBR
LYLTY_CARD_NBR TXN_ID PROD_NBR
PROD_NAME PROD_QTY TOT_SALES
DATE-
1
1000
1
5
Natural Chip Compny SeaSalt175g
2
6.0
-
1
1307
348
66
CCs Nacho Cheese 175g
3
6.3
-
1
1343
383
61
Smiths Crinkle Cut Chips Chicken 170g
2
2.9
-
2
2373
974
69
Smiths Chip Thinly S/Cream&Onion 175g
5
15.0
-
2
2426
1038
108
Kettle Tortilla ChpsHny&Jlpno Chili 150g
3
13.8
...
...
...
...
...
...
...
...
-
272
272319
270088
89
Kettle Sweet Chilli And Sour Cream 175g
2
10.8
-
272
272358
270154
74
Tostitos Splash Of Lime 175g
1
4.4
-
272
272379
270187
51
Doritos Mexicana 170g
2
8.8
-
272
272379
270188
42 Doritos Corn Chip Mexican Jalapeno 150g
2
7.8
-
272
272380
270189
74
2
8.8
Tostitos Splash Of Lime 175g
246740 rows × 7 columns
date_index['TOT_SALES'].resample('D').count().plot(kind='line')
plt.xlabel('Month')
plt.ylabel('Total Sales')
plt.title('Summary of Sales between july-')
Text(0.5, 1.0, 'Summary of Sales between july-')
We can see a lot of activity going on in Dec ending.
Let check it out
id = date_index['-':'-']
# Let plot the sales for december
id['TOT_SALES'].resample('D').count().plot(kind='line',marker='*')
plt.xlabel('Days')
plt.ylabel('Total Sales')
plt.title('Summary of Sales in December 2018')
Text(0.5, 1.0, 'Summary of Sales in December 2018')
We can see that the increase in sales occurs in the lead-up to Christmas and that
there are zero sales on Christmas day itself. This is due to shops being closed on Christmas day.
Let Explore PROD_NAME column
Let check the types of chips in column and their counts
qt.PROD_NAME.value_counts()
Kettle Mozzarella
Basil & Pesto 175g
Kettle Tortilla ChpsHny&Jlpno Chili 150g
Cobs Popd Swt/Chlli &Sr/Cream Chips 110g
Tyrrells Crisps
Ched & Chives 165g
Cobs Popd Sea Salt Chips 110g
Sunbites Whlegrn
Crisps Frch/Onin 90g
RRD Pc Sea Salt
165g
NCC Sour Cream &
Garden Chives 175g
French Fries Potato Chips 175g
WW Crinkle Cut
Original 175g
Name: PROD_NAME, Length: 105, dtype: int64
-
..-
Let create a variable for number of variable in PROD_NAME as packet size
qt['Pack_Size'] = qt['PROD_NAME'].astype('str').str.extractall(
'(\d+)').unstack().fillna('').sum(axis=1).astype(int)
# Let explore the Packet_Size dataset
qt['Pack_Size'].describe(include='all')
count-
mean-
std-
min-%-%-%-
max-
Name: Pack_Size, dtype: float64
Exploration of Packet_Size dataset
We can see that the min packet size is 70kg and the maximum is 380kg
# Let plot histogram to view the data
plt.hist(x=qt['Pack_Size'].values)
plt.title('Summary of Packet_Size of chips')
plt.xlabel('kg')
Text(0.5, 0, 'kg')
Let create a column for brand of chips by using their first string
qt['BRAND'] = qt['PROD_NAME'].str.split(' ').str[0]
# Let print out there value value_counts
qt['BRAND'].value_counts()
Kettle
41288
Smiths
27390
Pringles
25102
Doritos
22041
Thins
14075
RRD
11894
Infuzions
11057
WW
10320
Cobs
9693
Tostitos
9471
Twisties
9454
Tyrrells
6442
Grain
6272
Natural
6050
Cheezels
4603
CCs
4551
Red
4427
Dorito
3183
Infzns
3144
Smith
2963
Cheetos
2927
Snbts
1576
Burger
1564
Woolworths
1516
GrnWves
1468
Sunbites
1432
NCC
1419
French
1418
Name: BRAND, dtype: int64
Note: Some of the brand names look like they are of the same brands - such as RED and RRD, which are both Red Rock Deli chips. Let's
combine these together.
qt['BRAND']=qt['BRAND'].replace({'Red':'RRD'})
qt
DATE STORE_NBR LYLTY_CARD_NBR TXN_ID PROD_NBR
PROD_NAME PROD_QTY TOT_SALES Pack_Size BRAND
0
-
1
1000
1
5
Natural Chip Compny
SeaSalt175g
2
6.0
175
Natural
1
-
1
1307
348
66
CCs Nacho Cheese
175g
3
6.3
175
CCs
2
-
1
1343
383
61
Smiths Crinkle Cut
Chips Chicken 170g
2
2.9
170
Smiths
3
-
2
2373
974
69
Smiths Chip Thinly
S/Cream&Onion 175g
5
15.0
175
Smiths
4
-
2
2426
1038
108
Kettle Tortilla
ChpsHny&Jlpno Chili
150g
3
13.8
150
Kettle
...
...
...
...
...
...
...
...
...
...
...
264831
-
272
272319
270088
89
Kettle Sweet Chilli And
Sour Cream 175g
2
10.8
175
Kettle
264832
-
272
272358
270154
74
Tostitos Splash Of
Lime 175g
1
4.4
175
Tostitos
264833
-
272
272379
270187
51
Doritos Mexicana 170g
2
8.8
170
Doritos
264834
-
272
272379
270188
42
Doritos Corn Chip
Mexican Jalapeno
150g
2
7.8
150
Doritos
264835
-
272
272380
270189
74
Tostitos Splash Of
Lime 175g
2
8.8
175
Tostitos
246740 rows × 10 columns
Let provide exploratory about qp dataset ( which contain
Lifestage,Premium_Customer)
qp
LYLTY_CARD_NBR
LIFESTAGE PREMIUM_CUSTOMER
0
1000
YOUNG SINGLES/COUPLES
Premium
1
1002
YOUNG SINGLES/COUPLES
Mainstream
2
1003
YOUNG FAMILIES
Budget
3
1004
OLDER SINGLES/COUPLES
Mainstream
4
1005 MIDAGE SINGLES/COUPLES
Mainstream
...
...
...
...
72632
- MIDAGE SINGLES/COUPLES
Mainstream
72633
-
YOUNG FAMILIES
Mainstream
72634
-
YOUNG FAMILIES
Premium
72635
-
OLDER FAMILIES
Budget
72636
-
YOUNG SINGLES/COUPLES
Mainstream
72637 rows × 3 columns
# Doing some descriptive statistics
qp.describe(include='all')
LYLTY_CARD_NBR LIFESTAGE PREMIUM_CUSTOMER
count
-e+04
72637
72637
unique
NaN
7
3
top
NaN
RETIREES
Mainstream
freq
NaN
14805
29245
mean
-e+05
NaN
NaN
std
-e+04
NaN
NaN
min
-e+03
NaN
NaN
25%
-e+04
NaN
NaN
50%
-e+05
NaN
NaN
75%
-e+05
NaN
NaN
max
-e+06
NaN
NaN
Result: In this dataset we have 72637 rows and 3 columns
Exploring the Premium of customer with histogram
plt.hist(qp['PREMIUM_CUSTOMER'])
f = plt.figure(figsize=(4000,4000))
We can see that the Mainstream is the most customers we have in our store
Lifestage of the customer
qp['LIFESTAGE'].value_counts()
RETIREES
14805
OLDER SINGLES/COUPLES
14609
YOUNG SINGLES/COUPLES
14441
OLDER FAMILIES
9780
YOUNG FAMILIES
9178
MIDAGE SINGLES/COUPLES
7275
NEW FAMILIES
2549
Name: LIFESTAGE, dtype: int64
we can see that Retired people are the most in our store
Merging qp and qt dataframe together by LYLTY_CARD_NBR
customer_data = pd.merge(qt,qp,on='LYLTY_CARD_NBR')
customer_data.head()
DATE STORE_NBR LYLTY_CARD_NBR TXN_ID PROD_NBR PROD_NAME PROD_QTY TOT_SALES Pack_Size BRAND
LIFESTAGE
0
-
1
1000
1
5
Natural Chip
Compny
SeaSalt175g
1
-
1
1307
348
66
CCs Nacho
Cheese 175g
3
6.3
175
CCs
MIDAGE
SINGLES/COUPLES
2
-
1
1307
346
96
WW Original
Stacked
Chips 160g
2
3.8
160
WW
MIDAGE
SINGLES/COUPLES
3
-
1
1307
347
54
CCs Original
175g
1
2.1
175
CCs
MIDAGE
SINGLES/COUPLES
4
-
1
1343
383
61
Smiths
Crinkle Cut
Chips
Chicken 170g
2
2.9
170
Smiths
MIDAGE
SINGLES/COUPLES
2
6.0
175
Natural
YOUNG
SINGLES/COUPLES
Save the customer data to csv
customer_data.to_csv('Customer_Data.csv')
Let Perform Exploraatory Analyses on our New Dataset
Exploring the Premium of customer with histogram
Exploring the Premium of customer with histogram
plt.hist(customer_data['PREMIUM_CUSTOMER'])
f = plt.figure(figsize=(4000,4000))
Plotting Premium customer with total sales using pie chart
import plotly.express as px
px.pie(customer_data, values='TOT_SALES',names='PREMIUM_CUSTOMER')
Let Explore the Lifestage of the customer
px.pie(customer_data, values='TOT_SALES',names='LIFESTAGE')
Creating a dataframe to explore the relationship between lifestage and premium customer
met = customer_data[['TOT_SALES','LIFESTAGE','PREMIUM_CUSTOMER']]
met1 = met.groupby(['LIFESTAGE','PREMIUM_CUSTOMER'],as_index=False).sum()
met2 = met1.pivot(index='LIFESTAGE', columns='PREMIUM_CUSTOMER')
met2
TOT_SALES
PREMIUM_CUSTOMER
Budget Mainstream
Premium
LIFESTAGE
MIDAGE SINGLES/COUPLES
-
-
-
NEW FAMILIES
-
-
-
OLDER FAMILIES-
-
-
OLDER SINGLES/COUPLES-
-
RETIREES-
-
-
YOUNG FAMILIES-
-
-
-
-
YOUNG SINGLES/COUPLES
-
met2.plot(kind='bar',title='Lifestage and Premium customer relationship to Total Sales')
we can see from the visual that
For budget= The Older families have most purchase
For Mainsream = The YOUNG SINGLES/COUPLES have most purchase
For Premium = The YOUNG Older families have most purchase
Also, There are more Mainstream - young singles/couples and Mainstream - retirees who buy chips. This contributes to there being more
sales to these customer segments but this is not a major driver for the Budget - Older families segment.
Average number of units per customer by LIFESTAGE and
PREMIUM_CUSTOMER
met = customer_data[['TOT_SALES', 'LIFESTAGE', 'PREMIUM_CUSTOMER']]
met10 = met.groupby(['LIFESTAGE', 'PREMIUM_CUSTOMER'], as_index=False).mean()
met20 = met1.pivot(index='PREMIUM_CUSTOMER', columns='LIFESTAGE')
met20
TOT_SALES
LIFESTAGE
MIDAGE
SINGLES/COUPLES
NEW
FAMILIES
OLDER
FAMILIES
OLDER
RETIREES
SINGLES/COUPLES
YOUNG
FAMILIES
YOUNG
SINGLES/COUPLES
PREMIUM_CUSTOMER
Budget
-
-
-
-
-
-
57122.1
Mainstream
-
-
-
-
-
-
-
Premium
-
-
-
-
-
-
39052.3
Older families and young families in general buy more chips per customer
# Let create a plot
k= met20.plot(kind='barh',title='Average number of units per customer by LIFESTAGE and PREMIUM_CUSTOMER')
plt.legend(prop={'size':7})
plt.figure(figsize=(3000,50000))
plt.show()
Mainstream midage and young singles and couples are more willing to pay more per packet of chips compared to their budget and premium
counterparts. This may be due to premium shoppers being more likely to buy healthy snacks and when they buy chips, this is mainly for
entertainment purposes rather than their own consumption. This is also supported by there being fewer premium midage and young singles
and couples buying chips compared to their mainstream counterparts.
Checking for differences among the variables
Perform an independent t-test between mainstream vs premium and budget midage and young singles and couples
# Let show the customer data
customer_data
DATE STORE_NBR LYLTY_CARD_NBR TXN_ID PROD_NBR PROD_NAME PROD_QTY TOT_SALES Pack_Size BRAND
LIFESTAG
0
-
1
1000
1
5
Natural Chip
Compny
SeaSalt175g
2
6.0
175
Natural
YOUN
SINGLES/COUPLE
1
-
1
1307
348
66
CCs Nacho
Cheese 175g
3
6.3
175
CCs
MIDAG
SINGLES/COUPLE
2
-
1
1307
346
96
WW Original
Stacked
Chips 160g
2
3.8
160
WW
MIDAG
SINGLES/COUPLE
3
-
1
1307
347
54
CCs Original
175g
1
2.1
175
CCs
MIDAG
SINGLES/COUPLE
4
-
1
1343
383
61
Smiths
Crinkle Cut
Chips
Chicken 170g
2
2.9
170
Smiths
MIDAG
SINGLES/COUPLE
...
...
...
...
...
...
...
...
...
...
...
2
10.8
175
Kettle
YOUN
SINGLES/COUPLE
246735
-
272
272319
270088
89
Kettle Sweet
Chilli And
Sour Cream
175g
246736
-
272
272358
270154
74
Tostitos
Splash Of
Lime 175g
1
4.4
175
Tostitos
YOUN
SINGLES/COUPLE
246737
-
272
272379
270187
51
Doritos
Mexicana
170g
2
8.8
170
Doritos
YOUN
SINGLES/COUPLE
246738
-
272
272379
270188
42
Doritos Corn
Chip Mexican
Jalapeno
150g
2
7.8
150
Doritos
YOUN
SINGLES/COUPLE
246739
-
272
272380
270189
74
Tostitos
Splash Of
Lime 175g
2
8.8
175
Tostitos
YOUN
SINGLES/COUPLE
246740 rows × 12 columns
# Importing the scipy library
import scipy.stats as stats
# Performing an independent t-test between mainstream vs premium
pr = stats.ttest_ind(customer_data['TOT_SALES'][customer_data['PREMIUM_CUSTOMER'] == 'Mainstream'],
customer_data['TOT_SALES'][customer_data['PREMIUM_CUSTOMER'] =='Premium'])
pr
Ttest_indResult(statistic=-, pvalue=-e-13)
Result of the test shows that there is statistical significant difference between mainstream and premium customer. Since p-value is less than
0.05 and t=7.28
Let the LIFESTAGE of the customer
# Exploring the differences between Midage singles/couples and young singles/couples
pr1 = stats.ttest_ind(customer_data['TOT_SALES'][customer_data['LIFESTAGE'] == 'MIDAGE SINGLES/COUPLES'],
customer_data['TOT_SALES'][customer_data['LIFESTAGE'] == 'YOUNG SINGLES/COUPLES'])
pr1
Ttest_indResult(statistic=-, pvalue=-e-20)
Result of the test shows that there is statistical significant difference between Midage singles/couples and young singles/couples customer.
Since p-value is less than 0.05 and t=9.158
customer_data['LIFESTAGE'].value_counts()
OLDER SINGLES/COUPLES
50793
OLDER SINGLES/COUPLES
50793
RETIREES
46431
OLDER FAMILIES
45158
YOUNG FAMILIES
40494
YOUNG SINGLES/COUPLES
33969
MIDAGE SINGLES/COUPLES
23398
NEW FAMILIES
6497
Name: LIFESTAGE, dtype: int64
# Exploring the differences between Midage RETIREES and OLDER FAMILIES
pr2 = stats.ttest_ind(customer_data['TOT_SALES'][customer_data['LIFESTAGE'] == 'RETIREES'],
customer_data['TOT_SALES'][customer_data['LIFESTAGE'] == 'OLDER FAMILIES'])
pr2
Ttest_indResult(statistic=-, pvalue=-e-09)
Result of the test shows that there is statistical significant difference betweenMidage RETIREES and OLDER FAMILIES customer. Since pvalue is less than 0.05 and t=6.106
Performing apriori analysis to the buying patterns of different brands
in our dataset
We use this techique to find out if there are brands that these two customer segments prefer more than others or there buying pattern goes
together
import mlxtend
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules
df = customer_data[['LIFESTAGE', 'TXN_ID', 'BRAND', 'TOT_SALES']]
df['PROD_NAME'] = df['BRAND'].str.strip()
df.dropna(axis=0,subset=['TXN_ID'],inplace=True)
df['TXN_ID'] = df['TXN_ID'].astype('str')
C:\Users\User\AppData\Local\Temp/ipykernel_5028/-.py:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#retur
ning-a-view-versus-a-copy
C:\Users\User\anaconda3\lib\site-packages\pandas\util\_decorators.py:311: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#retur
ning-a-view-versus-a-copy
C:\Users\User\AppData\Local\Temp/ipykernel_5028/-.py:3: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#retur
ning-a-view-versus-a-copy
basket = (df.groupby(['TXN_ID', 'BRAND'])['TOT_SALES'].sum().unstack().reset_index().fillna(0).set_index('TXN_ID'
def encode_0_1(x):
if x <=0:
return 0
if x >=1:
return 1
basket_sets = basket.applymap(encode_0_1)
frequent_itemsets = apriori(basket_sets,min_support=0.01, use_colnames=True)
frequent_itemsets
support
itemsets
-
(CCs)
-
(Cheetos)
- (Cheezels-
(Cobs)
-
(Dorito)
-
(Doritos)
-
(Grain)
-
(Infuzions)
-
(Infzns)
-
(Kettle)
-
(Natural)
-
(Pringles)
-
(RRD)
-
(Smith)
-
(Smiths)
-
(Thins)
-
(Tostitos)
-
(Twisties)
-
(Tyrrells)
-
(WW)
From this pattern we can see that these are the similar pattern among the brands
Loading [MathJax]/jax/output/CommonHTML/fonts/TeX/fontdata.js