Apurv Jain | Freelancer Customer Churn Prediction Using Rfm Analysis

customer churn prediction using RFM analysis

rn-analysis-using-rfm-segmentation August 8, 2024 [1]: import math import numpy as np import pandas as pd import matplotlib.pyplot as plt from datetime import datetime from matplotlib.dates import date2num import matplotlib.dates as mdates import datetime as dt import seaborn as sns import itertools import imageio import os import plotly.express as px import plotly.graph_objects as go C:\Users\Lenovo\anaconda3\lib\sitepackages\pandas\core\computation\expressions.py:20: UserWarning: Pandas requires version '2.7.3' or newer of 'numexpr' (version '2.7.1' currently installed). from pandas.core.computation.check import NUMEXPR_INSTALLED [4]: # Import date class from datetime module from datetime import date # Returns the current local date today = date.today() print("Today date is: ", today) type(today) Today date is: - [4]: datetime.date [2]: df = pd.read_csv(r"E:\IIT Delhi\HotelCustomersDataset.tsv", sep = '\t') [3]: df.head() [3]: 0 ID Nationality 1 PRT Age 51.0 DaysSinceCreation 150 1 \ 1 2 3 4 2 3 4 5 PRT DEU FRA FRA NaN- - NameHash 0x8E0A7AF39B633D5EA25C3B7EF4DFC5464B36DB7AF375… 0x21EDE41906B45079E75385B5AA33287CA09DE1AB86DE… 0x31C5E4B74E-FDB724AD578C02C4A723F4BA2B… 0xFF534C83C0EF23D1CE516BC80A65D-D27937D4… 0x9C1DEF02C9BE242842C1C1ABF2C5AA249A1EEB4763B4… \ 0 1 2 3 4 0 1 2 3 4 DocIDHash 0x-B729F7A7ABBED6C781A84CA4274D571003AC… 0x5FA1E0098A-C5A6B9FE9D49FD6DD47CCE7C26… 0xC7CF344F5B-B1337AC905CA188F1B5B3A5… 0xBD3823A9B4EC35D6CAF4B27AE423A677C0200DB61E82… 0xE175754CF77247B202DD0820F49407C762C14A603B3A… AverageLeadTime- 0 1 2 3 4 LodgingRevenue- SRMediumFloor 0 0 0 0 0 0 1 2 3 4 SRBathtub 0 0 0 0 0 0 1 2 3 4 OtherRevenue- SRShower 0 0 0 0 0 SRAwayFromElevator 0 0 0 0 0 SRCrib 0 0 0 0 0 BookingsCanceled 1 0 0 0 0 SRKingSizeBed 0 0 0 0 0 SRNoAlcoholInMiniBar 0 0 0 0 0 [4]: round(df.isnull().sum()/len(df.index)*100,- 2 \ SRTwinBed SRNearElevator- SRQuietRoom 0 0 0 0 0 [5 rows x 31 columns] [4]: ID Nationality Age … … … … … … \ \ DaysSinceCreation NameHash DocIDHash AverageLeadTime LodgingRevenue OtherRevenue BookingsCanceled BookingsNoShowed BookingsCheckedIn PersonsNights RoomNights DaysSinceLastStay DaysSinceFirstStay DistributionChannel MarketSegment SRHighFloor SRLowFloor SRAccessibleRoom SRMediumFloor SRBathtub SRShower SRCrib SRKingSizeBed SRTwinBed SRNearElevator SRAwayFromElevator SRNoAlcoholInMiniBar SRQuietRoom dtype: float64 - [5]: df.drop('NameHash', axis=1, inplace=True) df.drop('DocIDHash', axis=1, inplace=True) 0.1 Data Cleaning [6]: df.shape [6]: (83590, 29) [7]: df.nunique() [7]: ID Nationality Age DaysSinceCreation AverageLeadTime LodgingRevenue - OtherRevenue BookingsCanceled BookingsNoShowed BookingsCheckedIn PersonsNights RoomNights DaysSinceLastStay DaysSinceFirstStay DistributionChannel MarketSegment SRHighFloor SRLowFloor SRAccessibleRoom SRMediumFloor SRBathtub SRShower SRCrib SRKingSizeBed SRTwinBed SRNearElevator SRAwayFromElevator SRNoAlcoholInMiniBar SRQuietRoom dtype: int64 - [8]: df.info() RangeIndex: 83590 entries, 0 to 83589 Data columns (total 29 columns): # Column Non-Null Count --- ------------------0 ID 83590 non-null 1 Nationality 83590 non-null 2 Age 79811 non-null 3 DaysSinceCreation 83590 non-null 83590 non-null 4 AverageLeadTime 5 LodgingRevenue 83590 non-null 6 OtherRevenue 83590 non-null 7 BookingsCanceled 83590 non-null 8 BookingsNoShowed 83590 non-null 9 BookingsCheckedIn 83590 non-null 10 PersonsNights 83590 non-null 11 RoomNights 83590 non-null 12 DaysSinceLastStay 83590 non-null 13 DaysSinceFirstStay 83590 non-null 14 DistributionChannel 83590 non-null 4 Dtype ----int64 object float64 int64 int64 float64 float64 int64 int64 int64 int64 int64 int64 int64 object 15 MarketSegment 83590 non-null 16 SRHighFloor 83590 non-null 17 SRLowFloor 83590 non-null 83590 non-null 18 SRAccessibleRoom 19 SRMediumFloor 83590 non-null 20 SRBathtub 83590 non-null 21 SRShower 83590 non-null 22 SRCrib 83590 non-null 23 SRKingSizeBed 83590 non-null 24 SRTwinBed 83590 non-null 25 SRNearElevator 83590 non-null 26 SRAwayFromElevator 83590 non-null 27 SRNoAlcoholInMiniBar 83590 non-null 28 SRQuietRoom 83590 non-null dtypes: float64(3), int64(23), object(3) memory usage: 18.5+ MB object int64 int64 int64 int64 int64 int64 int64 int64 int64 int64 int64 int64 int64 [9]: df.describe() [9]: count mean std min 25% 50% 75% max ID- Age- - count mean std min 25% 50% 75% max LodgingRevenue- count mean std min 25% 50% 75% BookingsCheckedIn- DaysSinceCreation- OtherRevenue- BookingsCanceled- PersonsNights- … … … … … … … … 5 AverageLeadTime- - \ BookingsNoShowed- SRMediumFloor- SRBathtub- \ \ max - count mean std min 25% 50% 75% max SRShower- count mean std min 25% 50% 75% max SRNearElevator- - SRCrib- … - SRKingSizeBed- SRAwayFromElevator- - SRTwinBed- SRNoAlcoholInMiniBar- \ SRQuietRoom- [8 rows x 26 columns] 0.2 Outlier Treatment [10]: q1 = df['DaysSinceLastStay'].quantile(0.25) q3 = df['DaysSinceLastStay'].quantile(0.75) iqr = q3-q1 outer_bound = q3+1.5*iqr lower_bound = q1-1.5*iqr df = df[(df['DaysSinceLastStay']>lower_bound) & (df['DaysSinceLastStay'] <␣ ↪outer_bound)] df.shape [10]: (83590, 29) [11]: q1 = df['Age'].quantile(0.25) q3 = df['Age'].quantile(0.75) iqr = q3-q1 outer_bound = q3+1.5*iqr lower_bound = q1-1.5*iqr df = df[(df['Age']>lower_bound) & (df['Age'] < outer_bound)] df.shape [11]: (79782, 29) 6 0.3 Univariate and Bivariate analysis [12]: fig, ax = plt.subplots(figsize=(9, 6)) sns.kdeplot(df['Age'], shade=True, color='#AB49A2', alpha=0.8, linewidth=3,␣ ↪edgecolor='black') sns.despine() fig.set_facecolor('#F5F5F5') ax.set_facecolor('#F5F5F5') ax.set_ylabel('') ax.set_xlabel('Age') ax.grid(False) # ax.yaxis.set_ticklabels([]) ax.grid(which='both', axis='y', color='black', linestyle=':', dashes=(2,10)) ax.text(60, 0.025, 'Age Distribution.', fontsize=23, fontweight='bold',␣ ↪fontfamily='monospace') ax.text(60, 0.0222, 'Most of the customers is \nbetween 30 and 60 years old.',␣ ↪fontsize=13, fontfamily='monospace') plt.show() [13]: df = df[df['Age'] > 0] # Remove negative & null values df = df[df['Age'] <= 100] df = df[df['AverageLeadTime'] >= 0] #df = df[df['DaysSinceLastStay'] >= 0] 7 df["Total_revenue"]=df["LodgingRevenue"]+df["OtherRevenue"]## getting total␣ ↪revenue # Creating column age group for further analysis df['Age_group']=pd.cut(df. ↪Age,bins=[0,15,35,50,65,80,100],labels=["0-15","16-35","36-50","51-65","65-80","80+"]) # Add Canceled Booking Percentage Column df['Percent Canceled'] = df['BookingsCanceled']/ ↪(df['BookingsCanceled']+df['BookingsNoShowed']+ ␣ ↪df['BookingsCheckedIn'])*100 # Add Successfull bookings Column df['Total Successful Bookings'] = df['BookingsNoShowed']+df['BookingsCheckedIn'] df['extra_Amenities'] =␣ ↪df['SRAccessibleRoom']+df['SRAwayFromElevator']+df['SRBathtub']+df['SRCrib']␣ ↪+ df['SRHighFloor']+\ ␣ ↪df['SRKingSizeBed']+df['SRLowFloor']+df['SRMediumFloor']+df['SRNearElevator']+df['SRNoAlcoho df['SRQuietRoom']+df['SRShower']+df['SRTwinBed'] df.head() [13]: 0 2 3 4 5 ID Nationality 1 PRT 3 DEU 4 FRA 5 FRA 6 JPN 0 2 3 4 5 OtherRevenue- 0 2 3 4 5 SRTwinBed 0 0 0 0 0 Age- DaysSinceCreation- BookingsCanceled 1 0 0 0 0 SRNearElevator 0 0 0 0 0 SRQuietRoom Total_revenue- AverageLeadTime- BookingsNoShowed 0 0 0 0 0 SRAwayFromElevator 0 0 0 0 0 Age_group- BookingsCheckedIn 3 0 1 0 1 … … … … … … SRNoAlcoholInMiniBar 0 0 0 0 0 \ Percent Canceled 25.0 NaN 0.0 8 LodgingRevenue- \ \ \ 4 5 0 2 3 4 5 0 0 0.0 254.0 Total Successful Bookings 3 0 1 0 1 - NaN 0.0 extra_Amenities 0 0 0 0 0 [5 rows x 34 columns] [14]: # count plot on two categorical variable plt.figure(figsize=(12, 10)) fig = sns.countplot(x ='Nationality', data = df, order=df['Nationality']. ↪value_counts(sort=True).index[:20] , facecolor = 'lightseagreen') for i in fig.patches: # get_x pulls left or right; get_height pushes up or down fig.text(i.get_x()+.01, i.get_height()+0.1, \ str(round((i.get_height()), 2)), fontsize=10, color='black', rotation=0) plt.title('# of Visitors from diff Countries(Top20)', fontsize =15, fontweight␣ ↪= 'bold') # Show the plot plt.show() 9 [15]: top_nations = (df.groupby('Nationality').size()>500) ## Filetring on top ␣ ↪countries with no of visitors >500 top_nations = top_nations.index[top_nations == True] top_nations = list(top_nations) df1 = df[df['Nationality'].isin(top_nations)] df1.shape [15]: (73812, 34) [16]: q1 = df1['Total_revenue'].quantile(0.25) q3 = df1['Total_revenue'].quantile(0.75) iqr = q3-q1 outer_bound = q3+1.5*iqr lower_bound = q1-1.5*iqr df1 = df1[(df1['Total_revenue']>lower_bound) & (df1['Total_revenue'] <␣ ↪outer_bound)] df1.shape 10 [16]: (70511, 34) [17]: df1.Nationality.nunique() [17]: 23 [18]: avg_by_age = df1.groupby('Age_group').mean() sum_by_age = df1.groupby('Age_group').sum() avg_by_nat = df1.groupby('Nationality').mean() [19]: nation_age_group = pd.crosstab( df1['Nationality'] , df1['Age_group'], margins=␣ ↪True, margins_name='Total') nation_age_group [19]: Age_group Nationality AUS AUT BEL BRA CAN CHE CHN DEU DNK ESP FIN FRA GBR IRL ISR ITA NLD NOR POL PRT RUS SWE USA Total 0-15 16-35 36-50 51-65 65-80 80+ Total - - - - - - - [20]: df_age_group = df1.groupby(['Age_group'])['Total_revenue']. ↪agg(['count','mean','median']) df_age_group [20]: count mean median Age_group 11 -+ - - - [21]: df_age_group['Age_group'] = df_age_group.index df_age_group.index = list(range(0,6)) df_age_group [21]: 0 1 2 3 4 5 count- mean- median Age_group-+ [22]: fig, ax1 = plt.subplots(figsize = (10,8)) ax1.plot( df_age_group['Age_group'], df_age_group['count'], color = 'blue',␣ ↪linewidth = 3) ax2 = ax1.twinx() ax2.plot(df_age_group['Age_group'], df_age_group['mean'], color = 'red',␣ ↪linewidth = 3) ax3 = ax1.twinx() ax3.plot(df_age_group['Age_group'], df_age_group['median'], color = 'green',␣ ↪linewidth = 3) ax3.spines['right'].set_position(('outward', 60)) ax.set_xlabel('# of visitors', color = 'blue') ax.set_ylabel('Mean' ,color = 'red') ax2.set_ylabel('Median', color = 'green') plt.show() 12 [ ]: [23]: df_mkt_segment = df1.groupby(['MarketSegment'])['Total_revenue']. ↪agg(['count','mean','median']) df_mkt_segment [23]: MarketSegment Aviation Complementary Corporate Direct Groups Other Travel Agent/Operator count mean median - - - [24]: df1.info() Int64Index: 70511 entries, 0 to 83589 Data columns (total 34 columns): # Column Non-Null Count --- ------------------- 13 Dtype ----- 0 ID 70511 non-null int64 1 Nationality 70511 non-null object 2 Age 70511 non-null float- non-null int64 3 DaysSinceCreation 4 AverageLeadTime 70511 non-null int64 5 LodgingRevenue 70511 non-null float64 6 OtherRevenue 70511 non-null float64 7 BookingsCanceled 70511 non-null int64 8 BookingsNoShowed 70511 non-null int64 9 BookingsCheckedIn 70511 non-null int64 10 PersonsNights 70511 non-null int64 11 RoomNights 70511 non-null int64 12 DaysSinceLastStay 70511 non-null int64 13 DaysSinceFirstStay 70511 non-null int64 14 DistributionChannel 70511 non-null object 15 MarketSegment 70511 non-null object 16 SRHighFloor 70511 non-null int64 17 SRLowFloor 70511 non-null int64 18 SRAccessibleRoom 70511 non-null int64 19 SRMediumFloor 70511 non-null int64 20 SRBathtub 70511 non-null int64 21 SRShower 70511 non-null int64 22 SRCrib 70511 non-null int64 23 SRKingSizeBed 70511 non-null int64 24 SRTwinBed 70511 non-null int64 25 SRNearElevator 70511 non-null int64 26 SRAwayFromElevator 70511 non-null int64 27 SRNoAlcoholInMiniBar 70511 non-null int64 28 SRQuietRoom 70511 non-null int64 29 Total_revenue 70511 non-null float64 30 Age_group 70511 non-null category 31 Percent Canceled 52913 non-null float64 32 Total Successful Bookings 70511 non-null int64 33 extra_Amenities 70511 non-null int64 dtypes: category(1), float64(5), int64(25), object(3) memory usage: 20.9+ MB [25]: fig, ax = plt.subplots(figsize=(8,6)) sns.distplot(df1['Total Successful Bookings'],color = 'orange' ,hist = True) sns.despine() ax.set_ylabel('') ax.set_xlabel('Bookings') ax.grid(which='both', axis='y', color='black', linestyle=':', dashes=(2,10)) ax.text(2, 1, 'Distribution of Bookings', fontsize=15, fontweight='bold',␣ ↪fontfamily='monospace') plt.show() 14 C:\Users\Lenovo\anaconda3\lib\site-packages\seaborn\distributions.py:2551: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning) [26]: fig, ax = plt.subplots(figsize=(10, 8)) sns.distplot(df1['AverageLeadTime'], color='#AB49A2', hist = True) sns.despine() fig.set_facecolor('#F5F5F5') ax.set_facecolor('#F5F5F5') ax.set_ylabel('') ax.set_xlabel('Days') ax.grid(False) ax.grid(which='both', axis='y', color='black', linestyle=':', dashes=(2,10)) ax.text(250, 0.0150, 'Average lead time.', fontsize=23, fontweight='bold',␣ ↪fontfamily='monospace') plt.show() C:\Users\Lenovo\anaconda3\lib\site-packages\seaborn\distributions.py:2551: 15 FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning) [27]: fig, ax = plt.subplots(figsize=(10,8)) sns.distplot(df1['DaysSinceLastStay'], hist = True) sns.despine() fig.set_facecolor('ivory') ax.set_facecolor('ivory') ax.set_ylabel('') ax.set_xlabel('Days') ax.grid(which='both', axis='y', color='black', linestyle=':', dashes=(2,10)) ax.text(200, 0.0085, 'Distribution of Days Since Last Stayed', fontsize=15,␣ ↪fontweight='bold', fontfamily='monospace') plt.show() C:\Users\Lenovo\anaconda3\lib\site-packages\seaborn\distributions.py:2551: FutureWarning: `distplot` is a deprecated function and will be removed in a 16 future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning) [ ]: [28]: import plotly.express as px [29]: df.nunique() [29]: ID Nationality Age DaysSinceCreation AverageLeadTime LodgingRevenue OtherRevenue BookingsCanceled - 17 BookingsNoShowed BookingsCheckedIn PersonsNights RoomNights DaysSinceLastStay DaysSinceFirstStay DistributionChannel MarketSegment SRHighFloor SRLowFloor SRAccessibleRoom SRMediumFloor SRBathtub SRShower SRCrib SRKingSizeBed SRTwinBed SRNearElevator SRAwayFromElevator SRNoAlcoholInMiniBar SRQuietRoom Total_revenue Age_group Percent Canceled Total Successful Bookings extra_Amenities dtype: int64 - [30]: plt.figure(figsize=(20, 18)) plt.subplot(2,4,1) px.box( df1, y = 'AverageLeadTime') plt.xlabel("AverageLeadTime",fontsize=15) plt.subplot(2,4,2) sns.boxplot( y = 'LodgingRevenue', data =df1) plt.xlabel("LodgingRevenue",fontsize=15) plt.subplot(2,4,3) sns.boxplot( y = 'OtherRevenue', data =df1) plt.xlabel("OtherRevenue",fontsize=15) plt.subplot(2,4,4) sns.boxplot( y = 'DaysSinceCreation', data =df1) plt.xlabel("DaysSinceCreation",fontsize=15) 18 plt.subplot(2,4,5) sns.boxplot( y = 'BookingsCheckedIn', data =df1) plt.xlabel("BookingsCheckedIn",fontsize=15) plt.subplot(2,4,6) sns.boxplot( y = 'RoomNights', data =df1) plt.xlabel("RoomNights",fontsize=15) plt.subplot(2,4,7) sns.boxplot( y = 'DaysSinceFirstStay', data =df1) plt.xlabel("DaysSinceFirstStay",fontsize=15) plt.subplot(2,4,8) sns.boxplot( y = 'Age', data =df1) plt.xlabel("Age",fontsize=15) plt.show() 19 [ ]: [31]: ## Top 10 countries with highest customers df1['Nationality'].value_counts().sort_values(ascending=False) [31]: FRA DEU GBR PRT ESP USA ITA BEL BRA NLD CHE IRL - CAN 1481 AUT 1411 SWE 1153 ISR 864 CHN 854 NOR 732 POL 731 AUS 690 FIN 613 DNK 588 RUS 552 Name: Nationality, dtype: int64 [32]: df1.MarketSegment.value_counts() [32]: Other 40562 Travel Agent/Operator 10591 Direct 9497 Groups 7666 Corporate 1658 Complementary 397 Aviation 140 Name: MarketSegment, dtype: int64 [33]: df1.groupby(['Nationality'])['Total_revenue'].agg('mean').head() [33]: Nationality AUS- AUT- BEL- BRA- CAN- Name: Total_revenue, dtype: float64 [34]: Countries_df= df1. ↪pivot_table(values=['Total_revenue'],index=['Nationality'],aggfunc=np.mean) Countries_df = Countries_df.sort_values(by='Total_revenue',ascending=False) Countries_df [34]: Total_revenue Nationality SWE NOR BEL DNK POL NLD - 21 AUT GBR CHE FRA DEU ITA IRL RUS FIN ESP AUS BRA CHN ISR USA PRT CAN - [35]: fig = Countries_df[:10].plot(kind='bar',figsize=(12,8), width=0.8,edgecolor='g') for i in fig .patches: # get_x pulls left or right; get_height pushes up or down fig.text(i.get_x()+.01, i.get_height()+0.3, \ str(round((i.get_height()), 2)), fontsize=15, color='black', rotation=0) plt.title("Avg Spendings of Highest Speding countries (Top 10)",fontsize= 15) plt.show() 22 [36]: freq_visitors = df1['Nationality'].value_counts().head(10) fig = px.pie( values = freq_visitors.values,names = freq_visitors.keys(),␣ ↪title='Most Frequent Visitors') fig.show() [37]: freq_vistors_age_group = df1['Age_group'].value_counts() fig = px.pie( values = freq_vistors_age_group.values,names =␣ ↪freq_vistors_age_group.keys(), title='Most Frequent Visitors') fig.show() [38]: Age_df= df1.pivot_table(values=['Total_revenue'],index=['Age_group'],aggfunc=np. ↪mean) Age_df = Age_df.sort_values(by='Total_revenue',ascending=False) Age_df [38]: Total_revenue Age_group-+ - 23 - - [39]: fig = Age_df.plot(kind='bar',figsize=(8,7), width=0.8,edgecolor='g') for i in fig .patches: # get_x pulls left or right; get_height pushes up or down fig.text(i.get_x()+.01, i.get_height()+0.3, \ str(round((i.get_height()), 2)), fontsize=12, color='black', rotation=0) plt.title("Avg Spendings of diff Age Groups",fontsize= 15, fontweight = 'bold') plt.show() [40]: df1.info() 24 Int64Index: 70511 entries, 0 to 83589 Data columns (total 34 columns): Non-Null Count Dtype # Column --- ------------------- ----0 ID 70511 non-null int64 1 Nationality 70511 non-null object 2 Age 70511 non-null float64 3 DaysSinceCreation 70511 non-null int64 4 AverageLeadTime 70511 non-null int64 5 LodgingRevenue 70511 non-null float64 6 OtherRevenue 70511 non-null float64 7 BookingsCanceled 70511 non-null int64 8 BookingsNoShowed 70511 non-null int64 9 BookingsCheckedIn 70511 non-null int64 10 PersonsNights 70511 non-null int64 11 RoomNights 70511 non-null int64 12 DaysSinceLastStay 70511 non-null int64 13 DaysSinceFirstStay 70511 non-null int64 14 DistributionChannel 70511 non-null object 15 MarketSegment 70511 non-null object 16 SRHighFloor 70511 non-null int64 17 SRLowFloor 70511 non-null int64 18 SRAccessibleRoom 70511 non-null int64 19 SRMediumFloor 70511 non-null int64 20 SRBathtub 70511 non-null int64 21 SRShower 70511 non-null int64 22 SRCrib 70511 non-null int64 23 SRKingSizeBed 70511 non-null int64 24 SRTwinBed 70511 non-null int64 25 SRNearElevator 70511 non-null int64 26 SRAwayFromElevator 70511 non-null int64 27 SRNoAlcoholInMiniBar 70511 non-null int64 28 SRQuietRoom 70511 non-null int64 29 Total_revenue 70511 non-null float64 30 Age_group 70511 non-null category 31 Percent Canceled 52913 non-null float64 32 Total Successful Bookings 70511 non-null int64 33 extra_Amenities 70511 non-null int64 dtypes: category(1), float64(5), int64(25), object(3) memory usage: 20.9+ MB [41]: df1.extra_Amenities.describe() [41]: count mean std - 25 min-%-%-%- max- Name: extra_Amenities, dtype: float64 [42]: plt.figure(figsize=(8,6)) figx=sns.barplot(x='Age_group', y='extra_Amenities',estimator=np.mean, data=df1) plt.xlabel("Age Groups",fontsize=18,color='black') plt.ylabel("Ameneties",fontsize=18,color='black') plt.title("Age Groups vs avg Ameneties",fontsize=20,color='black') # for i in figx.patches: # # get_x pulls left or right; get_height pushes up or down # figx.text(i.get_x()+.01, i.get_height()+0.008, \ # str(round((i.get_height()), 2)), fontsize=18, color='black', # rotation=0) plt.show() 26 [43]: Avg_lead_time= df1. ↪pivot_table(values=['AverageLeadTime'],index=['Age_group'],aggfunc=np.mean) Avg_lead_time = Avg_lead_time.sort_values(by='AverageLeadTime',ascending=False) Avg_lead_time [43]: AverageLeadTime Age_group- - [44]: fig = Avg_lead_time.plot(kind='bar',figsize=(8,7), width=0.8,edgecolor='g') for i in fig .patches: # get_x pulls left or right; get_height pushes up or down fig.text(i.get_x()+.01, i.get_height()+0.3, \ str(round((i.get_height()), 2)), fontsize=12, color='black', rotation=0) plt.title("Highest Avg Lead Time for diff Age Groups (Top 10)",fontsize= 12) plt.show() 27 [45]: # Percent Canceled Booking_Cancelled= df1.pivot_table(values=['Percent␣ ↪Canceled'],index=['Age_group'],aggfunc=np.mean) Booking_Cancelled = Booking_Cancelled.sort_values(by='Percent Canceled',␣ ↪ascending = False) Booking_Cancelled [45]: Percent Canceled Age_group- - 28 0-15 80+ - [46]: df1.info() Int64Index: 70511 entries, 0 to 83589 Data columns (total 34 columns): # Column Non-Null Count Dtype --- ------------------- ----0 ID 70511 non-null int64 1 Nationality 70511 non-null object 2 Age 70511 non-null float64 3 DaysSinceCreation 70511 non-null int64 4 AverageLeadTime 70511 non-null int64 5 LodgingRevenue 70511 non-null float64 6 OtherRevenue 70511 non-null float64 7 BookingsCanceled 70511 non-null int64 8 BookingsNoShowed 70511 non-null int64 9 BookingsCheckedIn 70511 non-null int64 10 PersonsNights 70511 non-null int64 11 RoomNights 70511 non-null int64 12 DaysSinceLastStay 70511 non-null int64 13 DaysSinceFirstStay 70511 non-null int64 14 DistributionChannel 70511 non-null object 15 MarketSegment 70511 non-null object 16 SRHighFloor 70511 non-null int64 17 SRLowFloor 70511 non-null int64 18 SRAccessibleRoom 70511 non-null int64 19 SRMediumFloor 70511 non-null int64 20 SRBathtub 70511 non-null int64 21 SRShower 70511 non-null int64 22 SRCrib 70511 non-null int64 23 SRKingSizeBed 70511 non-null int64 24 SRTwinBed 70511 non-null int64 25 SRNearElevator 70511 non-null int64 26 SRAwayFromElevator 70511 non-null int64 27 SRNoAlcoholInMiniBar 70511 non-null int64 28 SRQuietRoom 70511 non-null int64 29 Total_revenue 70511 non-null float64 30 Age_group 70511 non-null category 52913 non-null float64 31 Percent Canceled 32 Total Successful Bookings 70511 non-null int64 33 extra_Amenities 70511 non-null int64 dtypes: category(1), float64(5), int64(25), object(3) memory usage: 20.9+ MB 29 [47]: plt.figure(figsize=(15,12),dpi=100) fig2 = sns.barplot(hue='MarketSegment', y='AverageLeadTime',␣ ↪x='Age_group',data= df1, estimator=np.mean) # for i in fig2 .patches: # #get_x pulls left or right; get_height pushes up or down # fig2.text(i.get_x()+.01, i.get_height()+0.3, \ # str(round((i.get_height()), 0)), fontsize=10, color='black', # rotation=0) plt.title("Avg lead time according to Market Segment and diff Age Groups",␣ ↪size= 15) plt.show() [48]: plt.figure(figsize=(8,6),dpi=100) fig2 = sns.barplot(x='MarketSegment', y='ID',data= df1, estimator=len) for i in fig2 .patches: # get_x pulls left or right; get_height pushes up or down fig2.text(i.get_x()+.01, i.get_height()+0.3, \ 30 str(round((i.get_height()), 0)), fontsize=12, color='black', rotation=0) plt.xticks(rotation =90) plt.title("frequency of request by market segment") plt.show() [49]: no_show =df1['BookingsNoShowed'].sum() cancel = df1['BookingsCanceled'].sum() done = df1['BookingsCheckedIn'].sum() status = ['Did not show', 'Cancelled', 'Checked_in'] values = [no_show, cancel, done] 31 [50]: plt.figure(figsize=(15,12),dpi=100) fig2 = sns.barplot(x='BookingsNoShowed', y='ID', hue='Age_group',data= df1,␣ ↪estimator=len) for i in fig2 .patches: #get_x pulls left or right; get_height pushes up or down fig2.text(i.get_x()+.01, i.get_height()+0.3, \ str(round((i.get_height()), 0)), fontsize=12, color='black', rotation=0) plt.title("frequency of BookingsNoShowed of diff Age Groups") plt.show() posx posx posx posx posx posx posx posx posx posx and and and and and and and and and and posy posy posy posy posy posy posy posy posy posy should should should should should should should should should should be be be be be be be be be be finite finite finite finite finite finite finite finite finite finite values values values values values values values values values values 32 [51]: plt.figure(figsize=(10,8)) figx = sns.barplot(data=avg_by_age, x=avg_by_age.index, y='RoomNights', color='darkcyan', alpha=0.85, linewidth=2, edgecolor='black',␣ ↪estimator=np.mean) plt.xlabel("Age Groups",fontsize=18,color='black') plt.ylabel("Rooms per night",fontsize=18,color='black') plt.title("Avg Rooms per night for diff Age Groups",fontsize=20,color='black') for i in figx.patches: # get_x pulls left or right; get_height pushes up or down figx.text(i.get_x()+.01, i.get_height()+0.003, \ str(round((i.get_height()), 2)), fontsize=15, color='black', rotation=0) plt.show() 33 [52]: plt.figure(figsize=(10,8)) figx = sns.barplot(data=avg_by_age, x=avg_by_age.index, y='PersonsNights', color='darkorchid', alpha=0.85, linewidth=2, edgecolor='black',␣ ↪estimator=np.mean) plt.xlabel("Age Groups",fontsize=18,color='black') plt.ylabel("No of People",fontsize=18,color='black') plt.title("Average number of people per night by age␣ ↪groups",fontsize=20,color='black') for i in figx.patches: # get_x pulls left or right; get_height pushes up or down figx.text(i.get_x()+.05, i.get_height()+0.003, \ str(round((i.get_height()), 2)), fontsize=18, color='black', rotation=0) plt.show() 34 [53]: dist_channel = df1['DistributionChannel'].value_counts() fig = px.pie( values = dist_channel.values,names = dist_channel.keys(),␣ ↪title='Distribution Channel', hole = 0.4) fig.show() [54]: mkt_segment = df1['MarketSegment'].value_counts() fig = px.pie( values = mkt_segment.values,names = mkt_segment.keys(),␣ ↪title='Market Segment', hole = 0.4) fig.show() [55]: df.nunique() [55]: ID Nationality Age DaysSinceCreation AverageLeadTime LodgingRevenue OtherRevenue BookingsCanceled BookingsNoShowed BookingsCheckedIn PersonsNights RoomNights DaysSinceLastStay DaysSinceFirstStay DistributionChannel MarketSegment SRHighFloor SRLowFloor SRAccessibleRoom SRMediumFloor SRBathtub SRShower SRCrib SRKingSizeBed SRTwinBed SRNearElevator SRAwayFromElevator SRNoAlcoholInMiniBar SRQuietRoom Total_revenue Age_group Percent Canceled Total Successful Bookings extra_Amenities dtype: int64 - 35 [ ]: [56]: binary_cols = ['SRHighFloor', 'SRLowFloor', 'SRAccessibleRoom', 'SRMediumFloor', 'SRBathtub', 'SRShower', 'SRCrib', 'SRKingSizeBed',␣ ↪'SRTwinBed', 'SRNearElevator', 'SRAwayFromElevator',␣ ↪'SRNoAlcoholInMiniBar', 'SRQuietRoom'] [57]: fig, axes = plt.subplots(3, 4, figsize=(15, 12), sharex=True, sharey=True,␣ ↪squeeze=False) # fig.set_facecolor('#F5F5F5') for i, ax in enumerate(axes.flatten()): col = binary_cols[i] counts = df1[col].value_counts() g = ax.bar(counts.index, counts.values, linewidth=2) for p in ax.patches: ax.annotate(format(p.get_height()), (p.get_x() + p.get_width() / 2., \ p.get_height()), ha = 'center',␣ ↪va = 'top', \ xytext = (0, 10), textcoords =␣ ↪'offset points') ax.set_title(f'{col}') # # # # plt.xticks([0, 1], ['No', 'Yes']) ax.xaxis.grid(False, which='major') ax.yaxis.grid(True, linestyle=':', color='black', dashes=(3, 8)) ax.spines['right'].set_visible(False) ax.spines['top'].set_visible(False) plt.tight_layout() plt.show() 36 [58]: ['SRHighFloor', 'SRLowFloor', 'SRAccessibleRoom', 'SRMediumFloor', 'SRBathtub', 'SRShower', 'SRCrib', 'SRKingSizeBed',␣ ↪'SRTwinBed', 'SRNearElevator', 'SRAwayFromElevator',␣ ↪'SRNoAlcoholInMiniBar', 'SRQuietRoom'] [58]: ['SRHighFloor', 'SRLowFloor', 'SRAccessibleRoom', 'SRMediumFloor', 'SRBathtub', 'SRShower', 'SRCrib', 'SRKingSizeBed', 'SRTwinBed', 'SRNearElevator', 'SRAwayFromElevator', 'SRNoAlcoholInMiniBar', 'SRQuietRoom'] 37 [ ]: [59]: df1['BookingsCanceled'].value_counts() [59]:- Name: BookingsCanceled, dtype: int64 0.4 0.4.1 RFM Analysis for Hotel Customers Segmentation For RFM analysis (Recency, Frequency, Monetary). Columns for references–: For Recency score - DaysSinceLastStay For Frequency Score - Total Successful Bookings For Monetary Score - Total Revenue For Monetary Score [60]: df_rfm = df1.copy() [61]: sns.histplot(x = df1['DaysSinceLastStay']) [61]: 38 Making DaysSinceLastStay = -1 to max. so that it falls in the lowest category and the most recent visitors get the max score [62]: # Change -1 value df_rfm.loc[df_rfm['DaysSinceLastStay'] == -1, 'DaysSinceLastStay'] =␣ ↪df_rfm['DaysSinceLastStay'].max() df_rfm.head() [62]: 0 2 3 4 7 ID Nationality 1 PRT 3 DEU 4 FRA 5 FRA 8 FRA 0 2 3 4 7 OtherRevenue- 0 2 3 4 7 SRTwinBed 0 0 0 0 0 0 2 3 4 7 0 2 3 4 7 Age- DaysSinceCreation- BookingsCanceled 1 0 0 0 0 SRNearElevator 0 0 0 0 0 SRQuietRoom Total_revenue- Total Successful Bookings 3 0 1 0 1 AverageLeadTime- BookingsNoShowed 0 0 0 0 0 SRAwayFromElevator 0 0 0 0 0 Age_group- LodgingRevenue- BookingsCheckedIn 3 0 1 0 1 … … … … … … SRNoAlcoholInMiniBar 0 0 0 0 0 \ Percent Canceled 25.0 NaN 0.0 NaN 0.0 \ \ \ extra_Amenities 0 0 0 0 1 [5 rows x 34 columns] [ ]: Creating 5 bins to categorize the customers with the most recent vistors getting highest number 39 [63]: df_rfm['Recency_score'] = pd.qcut(df_rfm['DaysSinceLastStay'],6,labels =␣ ↪[5,4,3,2,1], duplicates= 'drop') print(df_rfm[['Recency_score', 'DaysSinceLastStay']].head(5),'\n') print(df_rfm['Recency_score'].value_counts()) 0 2 3 4 7 Recency_score 5 1 1 1 1 DaysSinceLastStay- - Name: Recency_score, dtype: int64 Creating 5 bins to categorize the customers with the highest spenders getting highest number [64]: df_rfm['Monetary_score'] = pd.qcut(df_rfm['Total_revenue'],6,labels =␣ ↪[1,2,3,4,5], duplicates='drop') print(df_rfm[['Monetary_score', 'Total_revenue']].head(), '\n') print(df_rfm['Monetary_score'].value_counts()) 0 2 3 4 7 Monetary_score 4 1 3 1 5 Total_revenue- - Name: Monetary_score, dtype: int64 [65]: df_rfm['frequency_score'] = [1 if x<2 else 2 if 2 <=x < 5 else 3 for x in␣ ↪df_rfm['Total Successful Bookings']] df_rfm['frequency_score'] = df_rfm['frequency_score'].astype('category') df_rfm[['frequency_score', 'Total Successful Bookings']].head() 40 [65]: 0 2 3 4 7 frequency_score 2 1 1 1 1 Total Successful Bookings 3 0 1 0 1 [66]: df_rfm['frequency_score'].value_counts() [66]:- Name: frequency_score, dtype: int64 [67]: df_rfm.info() Int64Index: 70511 entries, 0 to 83589 Data columns (total 37 columns): # Column Non-Null Count --- ------------------0 ID 70511 non-null 1 Nationality 70511 non-null 2 Age 70511 non-null 3 DaysSinceCreation 70511 non-null 4 AverageLeadTime 70511 non-null 5 LodgingRevenue 70511 non-null 6 OtherRevenue 70511 non-null 7 BookingsCanceled 70511 non-null 8 BookingsNoShowed 70511 non-null 9 BookingsCheckedIn 70511 non-null 10 PersonsNights 70511 non-null 11 RoomNights 70511 non-null 12 DaysSinceLastStay 70511 non-null 13 DaysSinceFirstStay 70511 non-null 14 DistributionChannel 70511 non-null 15 MarketSegment 70511 non-null 16 SRHighFloor 70511 non-null 17 SRLowFloor 70511 non-null 18 SRAccessibleRoom 70511 non-null 19 SRMediumFloor 70511 non-null 20 SRBathtub 70511 non-null 21 SRShower 70511 non-null 22 SRCrib 70511 non-null 23 SRKingSizeBed 70511 non-null 24 SRTwinBed 70511 non-null 25 SRNearElevator 70511 non-null 41 Dtype ----int64 object float64 int64 int64 float64 float64 int64 int64 int64 int64 int64 int64 int64 object object int64 int64 int64 int64 int64 int64 int64 int64 int64 int64 26 SRAwayFromElevator 70511 non-null int64 27 SRNoAlcoholInMiniBar 70511 non-null int64 28 SRQuietRoom 70511 non-null int- non-null float64 29 Total_revenue 30 Age_group 70511 non-null category 31 Percent Canceled 52913 non-null float64 32 Total Successful Bookings 70511 non-null int64 33 extra_Amenities 70511 non-null int64 34 Recency_score 70511 non-null category 35 Monetary_score 70511 non-null category 36 frequency_score 70511 non-null category dtypes: category(4), float64(5), int64(25), object(3) memory usage: 21.1+ MB [68]: df_rfm_final =␣ ↪df_rfm[['ID','Nationality','Age_group','DistributionChannel','DaysSinceLastStay','AverageLea 'Percent Canceled', 'Total Successful␣ ↪Bookings','Total_revenue','MarketSegment','frequency_score', 'Age', 'Monetary_score','Recency_score']] df_rfm_final.head() [68]: 0 2 3 4 7 ID Nationality Age_group 1 PRT 51-65 3 DEU 16-35 4 FRA 51-65 5 FRA 51-65 8 FRA 16-35 0 2 3 4 7 AverageLeadTime- 0 2 3 4 7 Total_revenue- 0 2 3 4 DistributionChannel Corporate Travel Agent/Operator Travel Agent/Operator Travel Agent/Operator Travel Agent/Operator Percent Canceled 25.0 NaN 0.0 NaN 0.0 DaysSinceLastStay- Total Successful Bookings 3 0 1 0 1 MarketSegment frequency_score Corporate 2 Travel Agent/Operator 1 Travel Agent/Operator 1 Travel Agent/Operator 1 Other 1 Recency_score 5 1 1 1 42 \ \ Age Monetary_score- \ 7 1 [69]: df_rfm_final['Recency_rank'] = df_rfm_final['Recency_score']. ↪rank(ascending=False) df_rfm_final['Frequency_rank'] = df_rfm_final['frequency_score']. ↪rank(ascending=True) df_rfm_final['Monetary_rank'] = df_rfm_final['Monetary_score']. ↪rank(ascending=True) # df.sort_values(by = ['Rank', 'Age'], ascending = [True, False], na_position =␣ ↪'first') # df_rfm_final.sort_values(by = ['frequency_score',␣ ↪'Monetary_score','Recency_score'], ascending= [False, False, False]) df_rfm_final.head() :1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandasdocs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy :2: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandasdocs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy :3: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandasdocs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy [69]: 0 2 3 ID Nationality Age_group 1 PRT 51-65 3 DEU 16-35 4 FRA 51-65 DistributionChannel Corporate Travel Agent/Operator Travel Agent/Operator 43 DaysSinceLastStay- \ 4 7 5 8 FRA FRA - 0 2 3 4 7 AverageLeadTime- 0 2 3 4 7 Total_revenue- 0 2 3 4 7 Recency_score 5 1 1 1 1 Travel Agent/Operator Travel Agent/Operator Percent Canceled 25.0 NaN 0.0 NaN 0.0 Total Successful Bookings 3 0 1 0 1 MarketSegment frequency_score Corporate 2 Travel Agent/Operator 1 Travel Agent/Operator 1 Travel Agent/Operator 1 Other 1 Recency_rank- - Frequency_rank- \ Age Monetary_score- \ Monetary_rank- [70]: # # normalizing the rank of the customers df_rfm_final['Recency_rank_norm'] = round((df_rfm_final['Recency_rank']/ ↪df_rfm_final['Recency_rank'].max())*100,2) df_rfm_final['Frequency_rank_norm'] = round((df_rfm_final['Frequency_rank']/ ↪df_rfm_final['Frequency_rank'].max())*100,2) df_rfm_final['Monetary_rank_norm'] = round((df_rfm_final['Monetary_rank']/ ↪df_rfm_final['Monetary_rank'].max())*100,2) df_rfm_final.drop(['Recency_rank', ↪1, inplace = True) 'Frequency_rank', 'Monetary_rank'], axis =␣ :3: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandasdocs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy :4: SettingWithCopyWarning: 44 A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandasdocs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy :5: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandasdocs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy C:\Users\Lenovo\anaconda3\lib\site-packages\pandas\core\frame.py:4163: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandasdocs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy [71]: df_rfm_final.info() Int64Index: 70511 entries, 0 to 83589 Data columns (total 17 columns): # Column Non-Null Count --- ------------------0 ID 70511 non-null 1 Nationality 70511 non-null 2 Age_group 70511 non-null 3 DistributionChannel 70511 non-null 4 DaysSinceLastStay 70511 non-null 5 AverageLeadTime 70511 non-null 6 Percent Canceled 52913 non-null 7 Total Successful Bookings 70511 non-null 8 Total_revenue 70511 non-null 9 MarketSegment 70511 non-null 10 frequency_score 70511 non-null 11 Age 70511 non-null 12 Monetary_score 70511 non-null 13 Recency_score 70511 non-null 14 Recency_rank_norm 70511 non-null 45 Dtype ----int64 object category object int64 int64 float64 int64 float64 object category float64 category category float64 15 Frequency_rank_norm 70511 non-null float64 16 Monetary_rank_norm 70511 non-null float64 dtypes: category(4), float64(6), int64(4), object(3) memory usage: 10.3+ MB [72]: colnames = ['DaysSinceLastStay','Total Successful Bookings','Total_revenue'] for col in colnames: fig, ax = plt.subplots(figsize=(12,3)) sns.distplot(df_rfm_final[col]) ax.set_title('Distribution of %s' % col) plt.show() C:\Users\Lenovo\anaconda3\lib\site-packages\seaborn\distributions.py:2551: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). C:\Users\Lenovo\anaconda3\lib\site-packages\seaborn\distributions.py:2551: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). 46 C:\Users\Lenovo\anaconda3\lib\site-packages\seaborn\distributions.py:2551: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). [73]: df_rfm_final.info() Int64Index: 70511 entries, 0 to 83589 Data columns (total 17 columns): # Column Non-Null Count --- ------------------0 ID 70511 non-null 1 Nationality 70511 non-null 2 Age_group 70511 non-null 3 DistributionChannel 70511 non-null 4 DaysSinceLastStay 70511 non-null 5 AverageLeadTime 70511 non-null 6 Percent Canceled 52913 non-null 47 Dtype ----int64 object category object int64 int64 float64 7 Total Successful Bookings 70511 non-null int64 8 Total_revenue 70511 non-null float64 9 MarketSegment 70511 non-null object 70511 non-null category 10 frequency_score 11 Age 70511 non-null float64 12 Monetary_score 70511 non-null category 13 Recency_score 70511 non-null category 14 Recency_rank_norm 70511 non-null float64 15 Frequency_rank_norm 70511 non-null float64 16 Monetary_rank_norm 70511 non-null float64 dtypes: category(4), float64(6), int64(4), object(3) memory usage: 10.3+ MB [ ]: 0.5 Calculating RFM score RFM score is calculated based upon recency, frequency, monetary value normalize ranks. Based upon this score we divide our customers. Here we rate them on a scale of 5. Formula used for calculating rfm score is : 0.15Recency score + 0.28Frequency score + 0.57 *Monetary score [74]: df_rfm_final['RFM_Score'] = 0.15*df_rfm_final['Recency_rank_norm']+0.28 * \ df_rfm_final['Frequency_rank_norm']+0.57*df_rfm_final['Monetary_rank_norm'] df_rfm_final['RFM_Score'] *= 0.05 df_rfm_final = df_rfm_final.round(2) df_rfm_final[['ID','Nationality', 'Age_group',␣ ↪'DistributionChannel','MarketSegment', 'RFM_Score']].head(15) :1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandasdocs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy :3: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandasdocs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy 48 [74]:- ID Nationality Age_group 1 PRT 51-65 3 DEU 16-35 4 FRA 51-65 5 FRA 51-65 8 FRA 16-35 9 FRA 36-50 10 IRL 16-35 11 IRL 16-35 12 FRA 51-65 13 FRA 51-65 14 ESP 36-50 15 ESP 51-65 16 FRA 65-80 17 FRA 65-80 FRA 65-80 18 - RFM_Score- DistributionChannel Corporate Travel Agent/Operator Travel Agent/Operator Travel Agent/Operator Travel Agent/Operator Travel Agent/Operator Travel Agent/Operator Travel Agent/Operator Travel Agent/Operator Travel Agent/Operator Direct Direct Travel Agent/Operator Travel Agent/Operator Travel Agent/Operator Travel Travel Travel Travel Travel Travel Travel Travel Travel Travel MarketSegment Corporate Agent/Operator Agent/Operator Agent/Operator Other Other Agent/Operator Agent/Operator Agent/Operator Agent/Operator Direct Direct Agent/Operator Agent/Operator Agent/Operator \ [75]: df_rfm_final.describe(percentiles=[0.25,0.45,0.75,.85,.90,0.99]) [75]: count mean std min 25% 45% 50% 75% 85% 90% ID- DaysSinceLastStay- AverageLeadTime- 49 Percent Canceled- \ 99% max - - count mean std min 25% 45% 50% 75% 85% 90% 99% max Total Successful Bookings- count mean std min 25% 45% 50% 75% 85% 90% 99% max Recency_rank_norm- count mean std min 25% 45% 50% 75% 85% 90% 99% max RFM_Score- - Total_revenue- Frequency_rank_norm- 50 Age- - \ Monetary_rank_norm- \ 0.6 Rating Customer based upon the RFM score rfm score >4 : Top Customer 4 > rfm score > 3.5 : High Value Customer 3.5>rfm score >2.5 : Medium value customer 2.5>rfm score>1.5 : Low-value customer rfm score<1.5 :Lost Customer [76]: df_rfm_final['Customer Segment']=df_rfm_final['RFM_Score'].apply(lambda x: ↪'Premium customers' if x>=4 else 'High value customers' if x<4 and x>=3.5 else 'Nuetral customer'␣ ↪if x<3.5 and x>=2.5 else 'Needs attention' if x <2.5 and x>1.5 else 'Likely to Churn␣ ↪out' ) [77]: df_rfm_final['RFM_Score'].describe(percentiles=[.5,.75,.9,.95,.98]) [77]: count- mean- std- min-%-%-%-%-%- max- Name: RFM_Score, dtype: float64 [78]: # count plot on two categorical variable plt.figure(figsize=(11, 7)) fig = sns.countplot(x ='Customer Segment', data = df_rfm_final,␣ ↪order=df_rfm_final['Customer Segment'].value_counts(sort=True).index[:20] , facecolor = 'lightseagreen') for i in fig.patches: # get_x pulls left or right; get_height pushes up or down fig.text(i.get_x()+.01, i.get_height()+0.1, \ str(round((i.get_height()), 2)), fontsize=15, color='black',␣ ↪fontweight = 'bold', rotation=0) plt.title('Customer Segment', fontsize =20, fontweight = 'bold') plt.xticks(rotation = 90 ,size = 15) plt.xlabel('Customer Segment', fontsize =15) [78]: Text(0.5, 0, 'Customer Segment') 51 [79]: df_rfm_final.head() [79]: 0 2 3 4 7 ID Nationality Age_group 1 PRT 51-65 3 DEU 16-35 4 FRA 51-65 5 FRA 51-65 8 FRA 16-35 0 2 3 4 7 AverageLeadTime- DistributionChannel Corporate Travel Agent/Operator Travel Agent/Operator Travel Agent/Operator Travel Agent/Operator Percent Canceled 25.0 NaN 0.0 NaN 0.0 DaysSinceLastStay- Total Successful Bookings 3 0 1 0 1 52 \ \ 0 2 3 4 7 0 2 3 4 7 0 2 3 4 7 Total_revenue- Recency_score 5 1 1 1 1 RFM_Score- MarketSegment frequency_score Corporate 2 Travel Agent/Operator 1 Travel Agent/Operator 1 Travel Agent/Operator 1 Other 1 Recency_rank_norm- Frequency_rank_norm- Age Monetary_score- Monetary_rank_norm- \ \ Customer Segment Premium customers Likely to Churn out Nuetral customer Likely to Churn out High value customers [80]: segment = df_rfm_final['Customer Segment'].value_counts() fig = px.pie( values = segment.values,names = segment.keys(), title='Customer␣ ↪Segment', hole = 0.5) fig.show() [81]: pt_cust_seg= df_rfm_final.pivot_table(values=['Total_revenue'],index=['Customer␣ ↪Segment'],aggfunc=np.mean) pt_cust_seg = pt_cust_seg.sort_values(by='Total_revenue',ascending=False) pt_cust_seg [81]: Total_revenue Customer Segment Premium customers High value customers Nuetral customer Needs attention Likely to Churn out - [82]: fig = pt_cust_seg.plot(kind='bar',figsize=(8,7), width=0.8,edgecolor='g') for i in fig .patches: # get_x pulls left or right; get_height pushes up or down fig.text(i.get_x()+.01, i.get_height()+0.9, \ str(round((i.get_height()), 2)), fontsize=15, color='black', rotation=0) plt.title("Avg Spending of diff Segments",fontsize= 16) 53 plt.show() [83]: df_rfm_final['Percent Canceled'].describe(percentiles=[0.5,.75,.85,.9,.95,.99]) [83]: count mean std min 50% - 54 75%-%-%-%-%- max- Name: Percent Canceled, dtype: float64 [84]: plt.figure(figsize=(8,7)) figx=sns.barplot(x= 'Customer Segment', y='Percent Canceled',estimator=np.mean,␣ ↪data= df_rfm_final) plt.title("Percentage of Bookings canceled by diff␣ ↪segments",fontsize=20,color='black') for i in figx.patches: # get_x pulls left or right; get_height pushes up or down figx.text(i.get_x()+.01, i.get_height()+0.001, \ str(round((i.get_height()), 2)), fontsize=15, color='black', rotation=0) plt.xticks(rotation = 90, size = 10) plt.show() 55 [85]: df_rfm_final.info() Int64Index: 70511 entries, 0 to 83589 Data columns (total 19 columns): # Column Non-Null Count --- ------------------0 ID 70511 non-null 1 Nationality 70511 non-null 2 Age_group 70511 non-null 3 DistributionChannel 70511 non-null 4 DaysSinceLastStay 70511 non-null 56 Dtype ----int64 object category object int64 5 AverageLeadTime 70511 non-null int64 6 Percent Canceled 52913 non-null float64 7 Total Successful Bookings 70511 non-null int- non-null float64 8 Total_revenue 9 MarketSegment 70511 non-null object 10 frequency_score 70511 non-null category 11 Age 70511 non-null float64 12 Monetary_score 70511 non-null category 13 Recency_score 70511 non-null category 14 Recency_rank_norm 70511 non-null float64 15 Frequency_rank_norm 70511 non-null float64 16 Monetary_rank_norm 70511 non-null float64 17 RFM_Score 70511 non-null float64 18 Customer Segment 70511 non-null object dtypes: category(4), float64(7), int64(4), object(4) memory usage: 11.4+ MB [86]: df_rfm_final['frequency_score'] = pd.to_numeric(df_rfm_final['frequency_score']␣ ↪, errors='coerce') df_rfm_final['Recency_score'] = pd.to_numeric(df_rfm_final['Recency_score'] ,␣ ↪errors='coerce') [87]: sns.lmplot(y='frequency_score', x='Recency_score', data=df_rfm_final,␣ ↪fit_reg=False, hue='Customer Segment', legend=False) plt.legend(loc='lower right') [87]: 57 [88]: cust_seg_Nationality = pd.crosstab(df_rfm_final['Nationality'] ,␣ ↪df_rfm_final['Customer Segment'] , margins= True, margins_name='Total') cust_seg_Nationality = pd.DataFrame(cust_seg_Nationality) cust_seg_Nationality['Countries'] = cust_seg_Nationality.index cust_seg_Nationality.index = list(range(0,24)) cust_seg_Nationality['Percent_Premium customers']␣ ↪=cust_seg_Nationality['Premium customers']/cust_seg_Nationality['Total'] cust_seg_Nationality['Percent_Premium customers'] =␣ ↪round(cust_seg_Nationality['Percent_Premium customers']*100,2) cust_seg_Nationality= cust_seg_Nationality[:22] [ ]: [89]: cust_seg_Nationality.sort_values(by='Percent_Premium customers', ascending=␣ ↪False, inplace=True) 58 [90]: plt.figure(figsize=(12,8)) fig2 = sns.barplot(x = 'Countries', y = 'Percent_Premium customers', data =␣ ↪cust_seg_Nationality,estimator=np.mean) for i in fig2 .patches: # # get_x pulls left or right; get_height pushes up or down fig2.text(i.get_x()+.01, i.get_height()+0.1, \ str(round((i.get_height()), 2)), fontsize=10, color='black', rotation=0) plt.title("Percentage of Premium Customers by Countries", size = 15, fontweight␣ ↪= 'bold') plt.show() [91]: cust_seg_age_group = pd.crosstab( df_rfm_final['Age_group'] ,␣ ↪df_rfm_final['Customer Segment'] , margins= True, margins_name='Total') cust_seg_age_group = pd.DataFrame(cust_seg_age_group ) cust_seg_age_group ['Age Group'] = cust_seg_age_group .index cust_seg_age_group .index = list(range(0,7)) 59 cust_seg_age_group ['Percent_Premium customers'] =cust_seg_age_group ['Premium␣ ↪customers']/cust_seg_age_group ['Total'] cust_seg_age_group ['Percent_Premium customers'] = round(cust_seg_age_group␣ ↪['Percent_Premium customers']*100,2) cust_seg_age_group = cust_seg_age_group [:6] cust_seg_age_group.sort_values(by='Percent_Premium customers', ascending=␣ ↪False, inplace=True) cust_seg_age_group [91]: Customer Segment 2 3 1 4 5 0 High value customers- Likely to Churn out- Customer Segment 2 3 1 4 5 0 Nuetral customer- Premium customers- Customer Segment 2 3 1 4 5 0 Percent_Premium customers- Needs attention- Total Age Group- \ [92]: plt.figure(figsize=(10,8)) fig2 = sns.barplot(x = 'Age Group', y = 'Percent_Premium customers', data =␣ ↪cust_seg_age_group,estimator=np.mean) for i in fig2 .patches: # # get_x pulls left or right; get_height pushes up or down fig2.text(i.get_x()+.01, i.get_height()+0.01, \ str(round((i.get_height()), 2)), fontsize=15, color='black', rotation=0) plt.title("Percentage of Premium Customers by Age Groups", size = 15,␣ ↪fontweight = 'bold') plt.show() 60 \ [93]: plt.figure(figsize=(15,8),facecolor='y',edgecolor='k') fig2 = sns.barplot(x='Nationality',y='ID',hue='Customer Segment', data =␣ ↪df_rfm_final[(df_rfm_final['Customer Segment']=='Premium customers')] ,␣ ↪estimator= len) for i in fig2 .patches: # # get_x pulls left or right; get_height pushes up or down fig2.text(i.get_x()+.001, i.get_height()+0.01, \ str(round((i.get_height()), 2)), fontsize=15, color='black', rotation=0) plt.title("# of Premium Customers by Nations", size = 15, fontweight = 'bold') # plt.ylabel("Count of Request id") plt.show() 61 [94]: df_rfm_final.head() [94]: 0 2 3 4 7 ID Nationality Age_group 1 PRT 51-65 3 DEU 16-35 4 FRA 51-65 5 FRA 51-65 8 FRA 16-35 DistributionChannel Corporate Travel Agent/Operator Travel Agent/Operator Travel Agent/Operator Travel Agent/Operator 0 2 3 4 7 AverageLeadTime- 0 2 3 4 7 Total_revenue- MarketSegment Corporate Travel Agent/Operator Travel Agent/Operator Travel Agent/Operator Other 0 2 3 Recency_score 5 1 1 Recency_rank_norm- Percent Canceled 25.0 NaN 0.0 NaN 0.0 DaysSinceLastStay- Total Successful Bookings 3 0 1 0 1 frequency_score 2 1 1 1 1 Frequency_rank_norm- 62 \ \ Age Monetary_score- Monetary_rank_norm- \ \ 4 7 0 2 3 4 7 1 1 RFM_Score- - - - Customer Segment Premium customers Likely to Churn out Nuetral customer Likely to Churn out High value customers [95]: plt.figure(figsize=(10,8)) fig2 = sns.barplot(x = 'Customer Segment', y = 'AverageLeadTime', data =␣ ↪df_rfm_final,estimator=np.mean) for i in fig2 .patches: # # get_x pulls left or right; get_height pushes up or down fig2.text(i.get_x()+.01, i.get_height()+0.01, \ str(round((i.get_height()), 2)), fontsize=15, color='black', rotation=0) plt.title("AverageLeadTime for diff 'Customer Segment", size = 15, fontweight =␣ ↪'bold') plt.xticks(rotation= 90, size = 12) plt.show() 63 [96]: plt.figure(figsize=(10,8)) fig2 = sns.barplot(x = 'Customer Segment', y = 'Age', data =␣ ↪df_rfm_final,estimator=np.mean) for i in fig2 .patches: # # get_x pulls left or right; get_height pushes up or down fig2.text(i.get_x()+.01, i.get_height()+0.01, \ str(round((i.get_height()), 2)), fontsize=15, color='black', rotation=0) plt.title("Mean Age for diff 'Customer Segment", size = 15, fontweight = 'bold') plt.xticks(rotation= 90, size = 12) plt.show() 64 0.7 Conclusions Most of the customer, bookings and revenue come from European region, which same region with the Hotel location. The customer from the other region probably less frequent to stay at the Hotel. Travel Agent/ Operator is the Distribution Channel mostly used by customers while Electronic Distribution is the least. The number of loyal customer is so low at this point, while the number of Customers categorized by Others and Lost Cheap Customers is very high. Most customers only 1 times stay and the customers who stay frequently is very rare compared to customers who frequently stay at the Hotel 65 0.8 Recommendations Recomendation 1) For the better sales at the future, there are several method for sales & marketing program to be evaluate later: 2) Company can create campaign for those 1 time stay Customer to become loyal customer such as discounted price on holiday occasion or business trip campaign. 3) Since Travel Agent/ Operator is the Distribution Channel most used by customers, company should increase campaign engagement by Travel Agent/ Operator channel. 4) Furthermore, Electronic Distribution is a things people access almost every time, engagement from Electronic Distribution can be useful for gaining engagement by using social media as a platform 5) For future analysis, we need to know the period when the customer usually stay at Hotel, to know the effective timing to start campaign. [97]: #def remove_outlier(df_in, col): # q1 = df_in[col].quantile(0.25) # q3 = df_in[col].quantile(0.75) # iqr = q3-q1 # outer_bound = q3+1.5*iqr # lower_bound = q1-1.5*iqr # df_out = df_in[(df_in[col]>lower_bound) & (df_in[col] < outer_bound)] # return df_out 66