Ganiu Oluwasegun Aina

Overview of this project This project involve careful analysis of the two dataset which are customer data and transaction data In this project we will be performing data cleaning, wrangling, and data exploration as well as performing statiscal among the variables import pandas as pd import numpy as np Importing the dataset # for purchase behaviour qp = pd.read_csv('C:/Users/User/Documents/Quatum Intenship/purchaseBehaviour.csv') # for transaction data qt = pd.read_excel('C:/Users/User/Documents/Quatum Intenship/datEQVI_transaction_data.xlsx') qp.head() LYLTY_CARD_NBR LIFESTAGE PREMIUM_CUSTOMER 0 1000 YOUNG SINGLES/COUPLES Premium 1 1002 YOUNG SINGLES/COUPLES Mainstream 2 1003 YOUNG FAMILIES Budget 3 1004 OLDER SINGLES/COUPLES Mainstream 4 1005 MIDAGE SINGLES/COUPLES Mainstream qt.head() DATE STORE_NBR LYLTY_CARD_NBR TXN_ID PROD_NBR PROD_NAME PROD_QTY TOT_SALES - 1 1000 1 5 Natural Chip Compny SeaSalt175g 2 6.0 - 1 1307 348 66 CCs Nacho Cheese 175g 3 6.3 - 1 1343 383 61 Smiths Crinkle Cut Chips Chicken 170g 2 2.9 - 2 2373 974 69 Smiths Chip Thinly S/Cream&Onion 175g 5 15.0 - 2 2426 1038 3 13.8 108 Kettle Tortilla ChpsHny&Jlpno Chili 150g Performing Exploring Data Analysis qt.describe(include='all') C:\Users\User\AppData\Local\Temp/ipykernel_5028/-.py:1: FutureWarning: Treating datetime data as categor ical rather than numeric in `.describe` is deprecated and will be removed in a future version of pandas. Specify `datetime_is_numeric=True` to silence this warning and adopt the future behavior now. qt.describe(include='all') DATE count STORE_NBR LYLTY_CARD_NBR - TXN_ID PROD_NBR -e-e- PROD_NAME PROD_QTY TOT_SALES - unique 364 NaN NaN NaN NaN 114 NaN NaN top -:00:00 NaN NaN NaN NaN Kettle Mozzarella Basil & Pesto 175g NaN NaN freq 939 NaN NaN NaN NaN 3304 NaN NaN first -:00:00 NaN NaN NaN NaN NaN NaN NaN last -:00:00 NaN NaN NaN NaN NaN NaN NaN mean NaN - -e-e+05 - NaN - - std NaN - -e-e+04 - NaN - - min NaN 1.00000 -e-e+00 - NaN - - 25% NaN - -e-e+04 - NaN - - 50% NaN - -e-e+05 - NaN - - 75% NaN - -e-e+05 - NaN - - max NaN - -e-e+06 - NaN - - qp.describe(include='all') LYLTY_CARD_NBR LIFESTAGE PREMIUM_CUSTOMER count -e+04 72637 72637 unique NaN 7 3 top NaN RETIREES Mainstream freq NaN 14805 29245 mean -e+05 NaN NaN std -e+04 NaN NaN min -e+03 NaN NaN 25% -e+04 NaN NaN 50% -e+05 NaN NaN 75% -e+05 NaN NaN max -e+06 NaN NaN Checkikng for null variables in the dataset # For the transaction dataset qt.isnull().sum() DATE STORE_NBR LYLTY_CARD_NBR TXN_ID PROD_NBR PROD_NAME PROD_QTY TOT_SALES dtype: int64 - # For the Purchacse behaviour dataset qp.isnull().sum() LYLTY_CARD_NBR LIFESTAGE PREMIUM_CUSTOMER dtype: int64 0 0 0 Result from the checking of the null(NaN) in the dataset We found out that there is no null data in all the columns Checking the datatypes and converting to necessary data format qt DATE STORE_NBR LYLTY_CARD_NBR TXN_ID PROD_NBR PROD_NAME PROD_QTY TOT_SALES - 1 1000 1 5 Natural Chip Compny SeaSalt175g 2 6.0 - 1 1307 348 66 CCs Nacho Cheese 175g 3 6.3 - 1 1343 383 61 Smiths Crinkle Cut Chips Chicken 170g 2 2.9 - 2 2373 974 69 Smiths Chip Thinly S/Cream&Onion 175g 5 15.0 - 2 2426 1038 108 Kettle Tortilla ChpsHny&Jlpno Chili 150g 3 13.8 ... ... ... ... ... ... ... ... ... - 272 272319 270088 89 Kettle Sweet Chilli And Sour Cream 175g 2 10.8 - 272 272358 270154 74 Tostitos Splash Of Lime 175g 1 4.4 - 272 272379 270187 51 Doritos Mexicana 170g 2 8.8 - 272 272379 270188 42 Doritos Corn Chip Mexican Jalapeno 150g 2 7.8 - 272 272380 270189 74 2 8.8 Tostitos Splash Of Lime 175g 264836 rows × 8 columns For QT dataset qt.info() RangeIndex: 264836 entries, 0 to 264835 Data columns (total 8 columns): # Column Non-Null Count Dtype --- ----------------------0 DATE 264836 non-null datetime64[ns] 1 STORE_NBR 264836 non-null int64 2 LYLTY_CARD_NBR 264836 non-null int64 3 TXN_ID 264836 non-null int64 4 PROD_NBR 264836 non-null int64 5 PROD_NAME 264836 non-null object 6 PROD_QTY 264836 non-null int64 7 TOT_SALES 264836 non-null float64 dtypes: datetime64[ns](1), float64(1), int64(5), object(1) memory usage: 16.2+ MB Result We found out that out that there are 8 columns in this dataset and 264836 rows in the dataset We only need to convert Date column as (datetype) # Converting the DATE column #qt['DATE'] = pd.to_datetime(qt['DATE'],format='%D%m%y') #qt['DATE'] = pd.to_datetime(qt['DATE']).dt.date # Let confirm whether the DATE as been converted qt.info() RangeIndex: 264836 entries, 0 to 264835 Data columns (total 8 columns): # Column Non-Null Count Dtype --- ----------------------0 DATE 264836 non-null datetime64[ns] 1 STORE_NBR 264836 non-null int64 2 LYLTY_CARD_NBR 264836 non-null int64 3 TXN_ID 264836 non-null int64 4 PROD_NBR 264836 non-null int64 5 PROD_NAME 264836 non-null object 6 PROD_QTY 264836 non-null int64 7 TOT_SALES 264836 non-null float64 dtypes: datetime64[ns](1), float64(1), int64(5), object(1) memory usage: 16.2+ MB For Qp dataset qp LYLTY_CARD_NBR LIFESTAGE PREMIUM_CUSTOMER 0 1000 YOUNG SINGLES/COUPLES Premium 1 1002 YOUNG SINGLES/COUPLES Mainstream 2 1003 YOUNG FAMILIES Budget 3 1004 OLDER SINGLES/COUPLES Mainstream 4 1005 MIDAGE SINGLES/COUPLES Mainstream ... ... ... ... 72632 - MIDAGE SINGLES/COUPLES Mainstream 72633 - YOUNG FAMILIES Mainstream 72634 - YOUNG FAMILIES Premium 72635 - OLDER FAMILIES Budget 72636 - YOUNG SINGLES/COUPLES Mainstream 72637 rows × 3 columns # checking the necessary information qp.info() RangeIndex: 72637 entries, 0 to 72636 Data columns (total 3 columns): # Column Non-Null Count --- ------------------0 LYLTY_CARD_NBR 72637 non-null 1 LIFESTAGE 72637 non-null 2 PREMIUM_CUSTOMER 72637 non-null dtypes: int64(1), object(2) memory usage: 1.7+ MB Dtype ----int64 object object This dataset is good qt DATE STORE_NBR LYLTY_CARD_NBR TXN_ID PROD_NBR - 1 1000 1 5 - 1 1307 348 - 1 1343 383 - 2 2373 - 2 2426 ... PROD_NAME PROD_QTY TOT_SALES Natural Chip Compny SeaSalt175g 2 6.0 66 CCs Nacho Cheese 175g 3 6.3 61 Smiths Crinkle Cut Chips Chicken 170g 2 2.9 974 69 Smiths Chip Thinly S/Cream&Onion 175g 5 15.0 1038 108 Kettle Tortilla ChpsHny&Jlpno Chili 150g 3 13.8 ... ... ... ... ... ... ... ... - 272 272319 270088 89 Kettle Sweet Chilli And Sour Cream 175g 2 10.8 - 272 272358 270154 74 Tostitos Splash Of Lime 175g 1 4.4 - 272 272379 270187 51 Doritos Mexicana 170g 2 8.8 - 272 272379 270188 42 Doritos Corn Chip Mexican Jalapeno 150g 2 7.8 - 272 272380 270189 74 2 8.8 Tostitos Splash Of Lime 175g 264836 rows × 8 columns Removing Unwanted Character qt[Prod] columns qt['PROD_NAME'].str.replace(r'\'W','') C:\Users\User\AppData\Local\Temp/ipykernel_5028/-.py:1: FutureWarning: The default value of regex will c hange from True to False in a future version. qt['PROD_NAME'].str.replace(r'\'W','') 0 1 2 3 4 Natural Chip Compny SeaSalt175g CCs Nacho Cheese 175g Smiths Crinkle Cut Chips Chicken 170g Smiths Chip Thinly S/Cream&Onion 175g Kettle Tortilla ChpsHny&Jlpno Chili 150g ... 264831 Kettle Sweet Chilli And Sour Cream 175g 264832 Tostitos Splash Of Lime 175g 264833 Doritos Mexicana 170g 264834 Doritos Corn Chip Mexican Jalapeno 150g 264835 Tostitos Splash Of Lime 175g Name: PROD_NAME, Length: 264836, dtype: object Let use descriptive statistic to found the common words in each columns qt.describe(include='all') C:\Users\User\AppData\Local\Temp/ipykernel_5028/-.py:1: FutureWarning: Treating datetime data as categor ical rather than numeric in `.describe` is deprecated and will be removed in a future version of pandas. Specify `datetime_is_numeric=True` to silence this warning and adopt the future behavior now. qt.describe(include='all') DATE count STORE_NBR LYLTY_CARD_NBR - TXN_ID PROD_NBR PROD_NAME -e-e- PROD_QTY TOT_SALES - unique 364 NaN NaN NaN NaN 114 NaN NaN top -:00:00 NaN NaN NaN NaN Kettle Mozzarella Basil & Pesto 175g NaN NaN freq 939 NaN NaN NaN NaN 3304 NaN NaN first -:00:00 NaN NaN NaN NaN NaN NaN NaN last -:00:00 NaN NaN NaN NaN NaN NaN NaN mean NaN - -e-e+05 - NaN - - std NaN - -e-e+04 - NaN - - min NaN 1.00000 -e-e+00 - NaN - - 25% NaN - -e-e+04 - NaN - - 50% NaN - -e-e+05 - NaN - - 75% NaN - -e-e+05 - NaN - - max NaN - -e-e+06 - NaN - - Finding the maximum character in qt dataset qt.max() Filter the dataset based the PROD_NAME column in descendign order qt.sort_values(by='PROD_NAME',ascending=False) DATE STORE_NBR LYLTY_CARD_NBR TXN_ID PROD_NBR PROD_NAME PROD_QTY TOT_SALES - 102 102179 102608 35 Woolworths Mild Salsa 300g 2 3.0 - 89 89511 88678 35 Woolworths Mild Salsa 300g 2 3.0 - 62 62011 57829 35 Woolworths Mild Salsa 300g 2 3.0 - 62 62011 57830 35 Woolworths Mild Salsa 300g 2 3.0 - 62 62062 58155 35 Woolworths Mild Salsa 300g 2 3.0 ... ... ... ... ... ... ... ... - ... 104 104070 104150 94 Burger Rings 220g 2 4.6 - 205 205018 204075 94 Burger Rings 220g 1 2.3 - 271 271121 268911 94 Burger Rings 220g 2 4.6 - 104 104187 104863 94 Burger Rings 220g 2 4.6 - 28 28106 25157 94 Burger Rings 220g 2 4.6 264836 rows × 8 columns Removing Salsa Products from dataset There are salsa products in the dataset but we are only interested in the chips category, so let's remove them qt = qt[qt['PROD_NAME'].str.contains('Salsa')==False] qt = qt[qt['PROD_NAME'].str.contains('Salsa')==False] qt DATE STORE_NBR LYLTY_CARD_NBR TXN_ID PROD_NBR PROD_NAME PROD_QTY TOT_SALES - 1 1000 1 5 Natural Chip Compny SeaSalt175g 2 6.0 - 1 1307 348 66 CCs Nacho Cheese 175g 3 6.3 - 1 1343 383 61 Smiths Crinkle Cut Chips Chicken 170g 2 2.9 - 2 2373 974 69 Smiths Chip Thinly S/Cream&Onion 175g 5 15.0 - 2 2426 1038 108 Kettle Tortilla ChpsHny&Jlpno Chili 150g 3 13.8 ... ... ... ... ... ... ... ... ... - 272 272319 270088 89 Kettle Sweet Chilli And Sour Cream 175g 2 10.8 - 272 272358 270154 74 Tostitos Splash Of Lime 175g 1 4.4 - 272 272379 270187 51 Doritos Mexicana 170g 2 8.8 - 272 272379 270188 42 Doritos Corn Chip Mexican Jalapeno 150g 2 7.8 - 272 272380 270189 74 2 8.8 Tostitos Splash Of Lime 175g 246742 rows × 8 columns Let's Run Descriptive Statistic qt.describe() STORE_NBR LYLTY_CARD_NBR TXN_ID PROD_NBR PROD_QTY TOT_SALES count- -e-e- mean - -e-e+05 - - - std - -e-e+04 - - - min - -e-e+00 - - - 25% - -e-e+04 - - - 50% - -e-e+05 - - - 75% - -e-e+05 - - - max - -e-e+06 - - - Result of the descriptive Statistic We can see customer purchasing 200 qty of chips in PROD_QTY column. Let check this customer out qt.sort_values(by='PROD_QTY',ascending=False) DATE STORE_NBR LYLTY_CARD_NBR TXN_ID PROD_NBR PROD_NAME PROD_QTY TOT_SALES - 226 226000 226210 4 Dorito Corn Chp Supreme 380g 200 650.0 - 226 226000 226201 4 Dorito Corn Chp Supreme 380g 200 650.0 - 46 46296 42138 81 Pringles Original Crisps 134g 5 18.5 - 71 71142 69852 96 WW Original Stacked Chips 160g 5 9.5 - 55 55144 49328 44 Thins Chips Light& Tangy 175g 5 16.5 ... ... ... ... ... ... ... ... - 184 184055 186586 107 Smiths Crinkle Cut French OnionDip 150g 1 2.6 - 184 184152 187227 28 Thins Potato Chips Hot & Spicy 175g 1 3.3 - 174 174004 174974 114 Kettle Sensations Siracha Lime 150g 1 4.6 - 184 184205 187584 13 Red Rock Deli Thai Chilli&Lime 150g 1 2.7 - 114 114006 116914 111 Smiths Chip Thinly Cut Original 175g 1 3.0 ... 246742 rows × 8 columns Let drop this customer out qt = qt.drop(qt["PROD_QTY"].loc[qt["PROD_QTY"]==200].index) qt.max() DATE STORE_NBR LYLTY_CARD_NBR TXN_ID PROD_NBR PROD_NAME PROD_QTY TOT_SALES dtype: object -:00:- Woolworths Cheese Rings 190g 5 29.5 Let count the number of transaction by date qt.DATE.describe(include='all') C:\Users\User\AppData\Local\Temp/ipykernel_5028/-.py:1: FutureWarning: Treating datetime data as categor ical rather than numeric in `.describe` is deprecated and will be removed in a future version of pandas. Specify `datetime_is_numeric=True` to silence this warning and adopt the future behavior now. qt.DATE.describe(include='all') count 246740 unique 364 top-:00:00 freq 865 first-:00:00 last-:00:00 Name: DATE, dtype: object Creating a column of dates that includes every day from 1 Jul 2018 to 30 Jun 2019, qt DATE STORE_NBR LYLTY_CARD_NBR TXN_ID PROD_NBR - 1 1000 1 5 - 1 1307 348 - 1 1343 383 - 2 2373 - 2 2426 ... PROD_NAME PROD_QTY TOT_SALES Natural Chip Compny SeaSalt175g 2 6.0 66 CCs Nacho Cheese 175g 3 6.3 61 Smiths Crinkle Cut Chips Chicken 170g 2 2.9 974 69 Smiths Chip Thinly S/Cream&Onion 175g 5 15.0 1038 108 Kettle Tortilla ChpsHny&Jlpno Chili 150g 3 13.8 ... ... ... ... ... ... ... ... - 272 272319 270088 89 Kettle Sweet Chilli And Sour Cream 175g 2 10.8 - 272 272358 270154 74 Tostitos Splash Of Lime 175g 1 4.4 - 272 272379 270187 51 Doritos Mexicana 170g 2 8.8 - 272 272379 270188 42 Doritos Corn Chip Mexican Jalapeno 150g 2 7.8 - 272 272380 270189 74 2 8.8 246740 rows × 8 columns Checking insight about '-':'-' import matplotlib.pyplot as plt date_index = qt.set_index('DATE') date_index['-':'-'] Tostitos Splash Of Lime 175g STORE_NBR LYLTY_CARD_NBR TXN_ID PROD_NBR PROD_NAME PROD_QTY TOT_SALES DATE- 1 1000 1 5 Natural Chip Compny SeaSalt175g 2 6.0 - 1 1307 348 66 CCs Nacho Cheese 175g 3 6.3 - 1 1343 383 61 Smiths Crinkle Cut Chips Chicken 170g 2 2.9 - 2 2373 974 69 Smiths Chip Thinly S/Cream&Onion 175g 5 15.0 - 2 2426 1038 108 Kettle Tortilla ChpsHny&Jlpno Chili 150g 3 13.8 ... ... ... ... ... ... ... ... - 272 272319 270088 89 Kettle Sweet Chilli And Sour Cream 175g 2 10.8 - 272 272358 270154 74 Tostitos Splash Of Lime 175g 1 4.4 - 272 272379 270187 51 Doritos Mexicana 170g 2 8.8 - 272 272379 270188 42 Doritos Corn Chip Mexican Jalapeno 150g 2 7.8 - 272 272380 270189 74 2 8.8 Tostitos Splash Of Lime 175g 246740 rows × 7 columns date_index['TOT_SALES'].resample('D').count().plot(kind='line') plt.xlabel('Month') plt.ylabel('Total Sales') plt.title('Summary of Sales between july-') Text(0.5, 1.0, 'Summary of Sales between july-') We can see a lot of activity going on in Dec ending. Let check it out id = date_index['-':'-'] # Let plot the sales for december id['TOT_SALES'].resample('D').count().plot(kind='line',marker='*') plt.xlabel('Days') plt.ylabel('Total Sales') plt.title('Summary of Sales in December 2018') Text(0.5, 1.0, 'Summary of Sales in December 2018') We can see that the increase in sales occurs in the lead-up to Christmas and that there are zero sales on Christmas day itself. This is due to shops being closed on Christmas day. Let Explore PROD_NAME column Let check the types of chips in column and their counts qt.PROD_NAME.value_counts() Kettle Mozzarella Basil & Pesto 175g Kettle Tortilla ChpsHny&Jlpno Chili 150g Cobs Popd Swt/Chlli &Sr/Cream Chips 110g Tyrrells Crisps Ched & Chives 165g Cobs Popd Sea Salt Chips 110g Sunbites Whlegrn Crisps Frch/Onin 90g RRD Pc Sea Salt 165g NCC Sour Cream & Garden Chives 175g French Fries Potato Chips 175g WW Crinkle Cut Original 175g Name: PROD_NAME, Length: 105, dtype: int64 - ..- Let create a variable for number of variable in PROD_NAME as packet size qt['Pack_Size'] = qt['PROD_NAME'].astype('str').str.extractall( '(\d+)').unstack().fillna('').sum(axis=1).astype(int) # Let explore the Packet_Size dataset qt['Pack_Size'].describe(include='all') count- mean- std- min-%-%-%- max- Name: Pack_Size, dtype: float64 Exploration of Packet_Size dataset We can see that the min packet size is 70kg and the maximum is 380kg # Let plot histogram to view the data plt.hist(x=qt['Pack_Size'].values) plt.title('Summary of Packet_Size of chips') plt.xlabel('kg') Text(0.5, 0, 'kg') Let create a column for brand of chips by using their first string qt['BRAND'] = qt['PROD_NAME'].str.split(' ').str[0] # Let print out there value value_counts qt['BRAND'].value_counts() Kettle 41288 Smiths 27390 Pringles 25102 Doritos 22041 Thins 14075 RRD 11894 Infuzions 11057 WW 10320 Cobs 9693 Tostitos 9471 Twisties 9454 Tyrrells 6442 Grain 6272 Natural 6050 Cheezels 4603 CCs 4551 Red 4427 Dorito 3183 Infzns 3144 Smith 2963 Cheetos 2927 Snbts 1576 Burger 1564 Woolworths 1516 GrnWves 1468 Sunbites 1432 NCC 1419 French 1418 Name: BRAND, dtype: int64 Note: Some of the brand names look like they are of the same brands - such as RED and RRD, which are both Red Rock Deli chips. Let's combine these together. qt['BRAND']=qt['BRAND'].replace({'Red':'RRD'}) qt DATE STORE_NBR LYLTY_CARD_NBR TXN_ID PROD_NBR PROD_NAME PROD_QTY TOT_SALES Pack_Size BRAND 0 - 1 1000 1 5 Natural Chip Compny SeaSalt175g 2 6.0 175 Natural 1 - 1 1307 348 66 CCs Nacho Cheese 175g 3 6.3 175 CCs 2 - 1 1343 383 61 Smiths Crinkle Cut Chips Chicken 170g 2 2.9 170 Smiths 3 - 2 2373 974 69 Smiths Chip Thinly S/Cream&Onion 175g 5 15.0 175 Smiths 4 - 2 2426 1038 108 Kettle Tortilla ChpsHny&Jlpno Chili 150g 3 13.8 150 Kettle ... ... ... ... ... ... ... ... ... ... ... 264831 - 272 272319 270088 89 Kettle Sweet Chilli And Sour Cream 175g 2 10.8 175 Kettle 264832 - 272 272358 270154 74 Tostitos Splash Of Lime 175g 1 4.4 175 Tostitos 264833 - 272 272379 270187 51 Doritos Mexicana 170g 2 8.8 170 Doritos 264834 - 272 272379 270188 42 Doritos Corn Chip Mexican Jalapeno 150g 2 7.8 150 Doritos 264835 - 272 272380 270189 74 Tostitos Splash Of Lime 175g 2 8.8 175 Tostitos 246740 rows × 10 columns Let provide exploratory about qp dataset ( which contain Lifestage,Premium_Customer) qp LYLTY_CARD_NBR LIFESTAGE PREMIUM_CUSTOMER 0 1000 YOUNG SINGLES/COUPLES Premium 1 1002 YOUNG SINGLES/COUPLES Mainstream 2 1003 YOUNG FAMILIES Budget 3 1004 OLDER SINGLES/COUPLES Mainstream 4 1005 MIDAGE SINGLES/COUPLES Mainstream ... ... ... ... 72632 - MIDAGE SINGLES/COUPLES Mainstream 72633 - YOUNG FAMILIES Mainstream 72634 - YOUNG FAMILIES Premium 72635 - OLDER FAMILIES Budget 72636 - YOUNG SINGLES/COUPLES Mainstream 72637 rows × 3 columns # Doing some descriptive statistics qp.describe(include='all') LYLTY_CARD_NBR LIFESTAGE PREMIUM_CUSTOMER count -e+04 72637 72637 unique NaN 7 3 top NaN RETIREES Mainstream freq NaN 14805 29245 mean -e+05 NaN NaN std -e+04 NaN NaN min -e+03 NaN NaN 25% -e+04 NaN NaN 50% -e+05 NaN NaN 75% -e+05 NaN NaN max -e+06 NaN NaN Result: In this dataset we have 72637 rows and 3 columns Exploring the Premium of customer with histogram plt.hist(qp['PREMIUM_CUSTOMER']) f = plt.figure(figsize=(4000,4000)) We can see that the Mainstream is the most customers we have in our store Lifestage of the customer qp['LIFESTAGE'].value_counts() RETIREES 14805 OLDER SINGLES/COUPLES 14609 YOUNG SINGLES/COUPLES 14441 OLDER FAMILIES 9780 YOUNG FAMILIES 9178 MIDAGE SINGLES/COUPLES 7275 NEW FAMILIES 2549 Name: LIFESTAGE, dtype: int64 we can see that Retired people are the most in our store Merging qp and qt dataframe together by LYLTY_CARD_NBR customer_data = pd.merge(qt,qp,on='LYLTY_CARD_NBR') customer_data.head() DATE STORE_NBR LYLTY_CARD_NBR TXN_ID PROD_NBR PROD_NAME PROD_QTY TOT_SALES Pack_Size BRAND LIFESTAGE 0 - 1 1000 1 5 Natural Chip Compny SeaSalt175g 1 - 1 1307 348 66 CCs Nacho Cheese 175g 3 6.3 175 CCs MIDAGE SINGLES/COUPLES 2 - 1 1307 346 96 WW Original Stacked Chips 160g 2 3.8 160 WW MIDAGE SINGLES/COUPLES 3 - 1 1307 347 54 CCs Original 175g 1 2.1 175 CCs MIDAGE SINGLES/COUPLES 4 - 1 1343 383 61 Smiths Crinkle Cut Chips Chicken 170g 2 2.9 170 Smiths MIDAGE SINGLES/COUPLES 2 6.0 175 Natural YOUNG SINGLES/COUPLES Save the customer data to csv customer_data.to_csv('Customer_Data.csv') Let Perform Exploraatory Analyses on our New Dataset Exploring the Premium of customer with histogram Exploring the Premium of customer with histogram plt.hist(customer_data['PREMIUM_CUSTOMER']) f = plt.figure(figsize=(4000,4000)) Plotting Premium customer with total sales using pie chart import plotly.express as px px.pie(customer_data, values='TOT_SALES',names='PREMIUM_CUSTOMER') Let Explore the Lifestage of the customer px.pie(customer_data, values='TOT_SALES',names='LIFESTAGE') Creating a dataframe to explore the relationship between lifestage and premium customer met = customer_data[['TOT_SALES','LIFESTAGE','PREMIUM_CUSTOMER']] met1 = met.groupby(['LIFESTAGE','PREMIUM_CUSTOMER'],as_index=False).sum() met2 = met1.pivot(index='LIFESTAGE', columns='PREMIUM_CUSTOMER') met2 TOT_SALES PREMIUM_CUSTOMER Budget Mainstream Premium LIFESTAGE MIDAGE SINGLES/COUPLES - - - NEW FAMILIES - - - OLDER FAMILIES- - - OLDER SINGLES/COUPLES- - RETIREES- - - YOUNG FAMILIES- - - - - YOUNG SINGLES/COUPLES - met2.plot(kind='bar',title='Lifestage and Premium customer relationship to Total Sales') we can see from the visual that For budget= The Older families have most purchase For Mainsream = The YOUNG SINGLES/COUPLES have most purchase For Premium = The YOUNG Older families have most purchase Also, There are more Mainstream - young singles/couples and Mainstream - retirees who buy chips. This contributes to there being more sales to these customer segments but this is not a major driver for the Budget - Older families segment. Average number of units per customer by LIFESTAGE and PREMIUM_CUSTOMER met = customer_data[['TOT_SALES', 'LIFESTAGE', 'PREMIUM_CUSTOMER']] met10 = met.groupby(['LIFESTAGE', 'PREMIUM_CUSTOMER'], as_index=False).mean() met20 = met1.pivot(index='PREMIUM_CUSTOMER', columns='LIFESTAGE') met20 TOT_SALES LIFESTAGE MIDAGE SINGLES/COUPLES NEW FAMILIES OLDER FAMILIES OLDER RETIREES SINGLES/COUPLES YOUNG FAMILIES YOUNG SINGLES/COUPLES PREMIUM_CUSTOMER Budget - - - - - - 57122.1 Mainstream - - - - - - - Premium - - - - - - 39052.3 Older families and young families in general buy more chips per customer # Let create a plot k= met20.plot(kind='barh',title='Average number of units per customer by LIFESTAGE and PREMIUM_CUSTOMER') plt.legend(prop={'size':7}) plt.figure(figsize=(3000,50000)) plt.show() Mainstream midage and young singles and couples are more willing to pay more per packet of chips compared to their budget and premium counterparts. This may be due to premium shoppers being more likely to buy healthy snacks and when they buy chips, this is mainly for entertainment purposes rather than their own consumption. This is also supported by there being fewer premium midage and young singles and couples buying chips compared to their mainstream counterparts. Checking for differences among the variables Perform an independent t-test between mainstream vs premium and budget midage and young singles and couples # Let show the customer data customer_data DATE STORE_NBR LYLTY_CARD_NBR TXN_ID PROD_NBR PROD_NAME PROD_QTY TOT_SALES Pack_Size BRAND LIFESTAG 0 - 1 1000 1 5 Natural Chip Compny SeaSalt175g 2 6.0 175 Natural YOUN SINGLES/COUPLE 1 - 1 1307 348 66 CCs Nacho Cheese 175g 3 6.3 175 CCs MIDAG SINGLES/COUPLE 2 - 1 1307 346 96 WW Original Stacked Chips 160g 2 3.8 160 WW MIDAG SINGLES/COUPLE 3 - 1 1307 347 54 CCs Original 175g 1 2.1 175 CCs MIDAG SINGLES/COUPLE 4 - 1 1343 383 61 Smiths Crinkle Cut Chips Chicken 170g 2 2.9 170 Smiths MIDAG SINGLES/COUPLE ... ... ... ... ... ... ... ... ... ... ... 2 10.8 175 Kettle YOUN SINGLES/COUPLE 246735 - 272 272319 270088 89 Kettle Sweet Chilli And Sour Cream 175g 246736 - 272 272358 270154 74 Tostitos Splash Of Lime 175g 1 4.4 175 Tostitos YOUN SINGLES/COUPLE 246737 - 272 272379 270187 51 Doritos Mexicana 170g 2 8.8 170 Doritos YOUN SINGLES/COUPLE 246738 - 272 272379 270188 42 Doritos Corn Chip Mexican Jalapeno 150g 2 7.8 150 Doritos YOUN SINGLES/COUPLE 246739 - 272 272380 270189 74 Tostitos Splash Of Lime 175g 2 8.8 175 Tostitos YOUN SINGLES/COUPLE 246740 rows × 12 columns # Importing the scipy library import scipy.stats as stats # Performing an independent t-test between mainstream vs premium pr = stats.ttest_ind(customer_data['TOT_SALES'][customer_data['PREMIUM_CUSTOMER'] == 'Mainstream'], customer_data['TOT_SALES'][customer_data['PREMIUM_CUSTOMER'] =='Premium']) pr Ttest_indResult(statistic=-, pvalue=-e-13) Result of the test shows that there is statistical significant difference between mainstream and premium customer. Since p-value is less than 0.05 and t=7.28 Let the LIFESTAGE of the customer # Exploring the differences between Midage singles/couples and young singles/couples pr1 = stats.ttest_ind(customer_data['TOT_SALES'][customer_data['LIFESTAGE'] == 'MIDAGE SINGLES/COUPLES'], customer_data['TOT_SALES'][customer_data['LIFESTAGE'] == 'YOUNG SINGLES/COUPLES']) pr1 Ttest_indResult(statistic=-, pvalue=-e-20) Result of the test shows that there is statistical significant difference between Midage singles/couples and young singles/couples customer. Since p-value is less than 0.05 and t=9.158 customer_data['LIFESTAGE'].value_counts() OLDER SINGLES/COUPLES 50793 OLDER SINGLES/COUPLES 50793 RETIREES 46431 OLDER FAMILIES 45158 YOUNG FAMILIES 40494 YOUNG SINGLES/COUPLES 33969 MIDAGE SINGLES/COUPLES 23398 NEW FAMILIES 6497 Name: LIFESTAGE, dtype: int64 # Exploring the differences between Midage RETIREES and OLDER FAMILIES pr2 = stats.ttest_ind(customer_data['TOT_SALES'][customer_data['LIFESTAGE'] == 'RETIREES'], customer_data['TOT_SALES'][customer_data['LIFESTAGE'] == 'OLDER FAMILIES']) pr2 Ttest_indResult(statistic=-, pvalue=-e-09) Result of the test shows that there is statistical significant difference betweenMidage RETIREES and OLDER FAMILIES customer. Since pvalue is less than 0.05 and t=6.106 Performing apriori analysis to the buying patterns of different brands in our dataset We use this techique to find out if there are brands that these two customer segments prefer more than others or there buying pattern goes together import mlxtend from mlxtend.frequent_patterns import apriori from mlxtend.frequent_patterns import association_rules df = customer_data[['LIFESTAGE', 'TXN_ID', 'BRAND', 'TOT_SALES']] df['PROD_NAME'] = df['BRAND'].str.strip() df.dropna(axis=0,subset=['TXN_ID'],inplace=True) df['TXN_ID'] = df['TXN_ID'].astype('str') C:\Users\User\AppData\Local\Temp/ipykernel_5028/-.py:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#retur ning-a-view-versus-a-copy C:\Users\User\anaconda3\lib\site-packages\pandas\util\_decorators.py:311: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#retur ning-a-view-versus-a-copy C:\Users\User\AppData\Local\Temp/ipykernel_5028/-.py:3: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#retur ning-a-view-versus-a-copy basket = (df.groupby(['TXN_ID', 'BRAND'])['TOT_SALES'].sum().unstack().reset_index().fillna(0).set_index('TXN_ID' def encode_0_1(x): if x <=0: return 0 if x >=1: return 1 basket_sets = basket.applymap(encode_0_1) frequent_itemsets = apriori(basket_sets,min_support=0.01, use_colnames=True) frequent_itemsets support itemsets - (CCs) - (Cheetos) - (Cheezels- (Cobs) - (Dorito) - (Doritos) - (Grain) - (Infuzions) - (Infzns) - (Kettle) - (Natural) - (Pringles) - (RRD) - (Smith) - (Smiths) - (Thins) - (Tostitos) - (Twisties) - (Tyrrells) - (WW) From this pattern we can see that these are the similar pattern among the brands Loading [MathJax]/jax/output/CommonHTML/fonts/TeX/fontdata.js

Scheduled maintenance