customer churn prediction using RFM analysis
rn-analysis-using-rfm-segmentation
August 8, 2024
[1]: import math
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from datetime import datetime
from matplotlib.dates import date2num
import matplotlib.dates as mdates
import datetime as dt
import seaborn as sns
import itertools
import imageio
import os
import plotly.express as px
import plotly.graph_objects as go
C:\Users\Lenovo\anaconda3\lib\sitepackages\pandas\core\computation\expressions.py:20: UserWarning: Pandas requires
version '2.7.3' or newer of 'numexpr' (version '2.7.1' currently installed).
from pandas.core.computation.check import NUMEXPR_INSTALLED
[4]: # Import date class from datetime module
from datetime import date
# Returns the current local date
today = date.today()
print("Today date is: ", today)
type(today)
Today date is:
-
[4]: datetime.date
[2]: df = pd.read_csv(r"E:\IIT Delhi\HotelCustomersDataset.tsv", sep = '\t')
[3]: df.head()
[3]:
0
ID Nationality
1
PRT
Age
51.0
DaysSinceCreation
150
1
\
1
2
3
4
2
3
4
5
PRT
DEU
FRA
FRA
NaN-
-
NameHash
0x8E0A7AF39B633D5EA25C3B7EF4DFC5464B36DB7AF375…
0x21EDE41906B45079E75385B5AA33287CA09DE1AB86DE…
0x31C5E4B74E-FDB724AD578C02C4A723F4BA2B…
0xFF534C83C0EF23D1CE516BC80A65D-D27937D4…
0x9C1DEF02C9BE242842C1C1ABF2C5AA249A1EEB4763B4…
\
0
1
2
3
4
0
1
2
3
4
DocIDHash
0x-B729F7A7ABBED6C781A84CA4274D571003AC…
0x5FA1E0098A-C5A6B9FE9D49FD6DD47CCE7C26…
0xC7CF344F5B-B1337AC905CA188F1B5B3A5…
0xBD3823A9B4EC35D6CAF4B27AE423A677C0200DB61E82…
0xE175754CF77247B202DD0820F49407C762C14A603B3A…
AverageLeadTime-
0
1
2
3
4
LodgingRevenue-
SRMediumFloor
0
0
0
0
0
0
1
2
3
4
SRBathtub
0
0
0
0
0
0
1
2
3
4
OtherRevenue-
SRShower
0
0
0
0
0
SRAwayFromElevator
0
0
0
0
0
SRCrib
0
0
0
0
0
BookingsCanceled
1
0
0
0
0
SRKingSizeBed
0
0
0
0
0
SRNoAlcoholInMiniBar
0
0
0
0
0
[4]: round(df.isnull().sum()/len(df.index)*100,-
2
\
SRTwinBed SRNearElevator-
SRQuietRoom
0
0
0
0
0
[5 rows x 31 columns]
[4]: ID
Nationality
Age
…
…
…
…
…
…
\
\
DaysSinceCreation
NameHash
DocIDHash
AverageLeadTime
LodgingRevenue
OtherRevenue
BookingsCanceled
BookingsNoShowed
BookingsCheckedIn
PersonsNights
RoomNights
DaysSinceLastStay
DaysSinceFirstStay
DistributionChannel
MarketSegment
SRHighFloor
SRLowFloor
SRAccessibleRoom
SRMediumFloor
SRBathtub
SRShower
SRCrib
SRKingSizeBed
SRTwinBed
SRNearElevator
SRAwayFromElevator
SRNoAlcoholInMiniBar
SRQuietRoom
dtype: float64
-
[5]: df.drop('NameHash', axis=1, inplace=True)
df.drop('DocIDHash', axis=1, inplace=True)
0.1
Data Cleaning
[6]: df.shape
[6]: (83590, 29)
[7]: df.nunique()
[7]: ID
Nationality
Age
DaysSinceCreation
AverageLeadTime
LodgingRevenue
-
OtherRevenue
BookingsCanceled
BookingsNoShowed
BookingsCheckedIn
PersonsNights
RoomNights
DaysSinceLastStay
DaysSinceFirstStay
DistributionChannel
MarketSegment
SRHighFloor
SRLowFloor
SRAccessibleRoom
SRMediumFloor
SRBathtub
SRShower
SRCrib
SRKingSizeBed
SRTwinBed
SRNearElevator
SRAwayFromElevator
SRNoAlcoholInMiniBar
SRQuietRoom
dtype: int64
-
[8]: df.info()
RangeIndex: 83590 entries, 0 to 83589
Data columns (total 29 columns):
#
Column
Non-Null Count
--- ------------------0
ID
83590 non-null
1
Nationality
83590 non-null
2
Age
79811 non-null
3
DaysSinceCreation
83590 non-null
83590 non-null
4
AverageLeadTime
5
LodgingRevenue
83590 non-null
6
OtherRevenue
83590 non-null
7
BookingsCanceled
83590 non-null
8
BookingsNoShowed
83590 non-null
9
BookingsCheckedIn
83590 non-null
10 PersonsNights
83590 non-null
11 RoomNights
83590 non-null
12 DaysSinceLastStay
83590 non-null
13 DaysSinceFirstStay
83590 non-null
14 DistributionChannel
83590 non-null
4
Dtype
----int64
object
float64
int64
int64
float64
float64
int64
int64
int64
int64
int64
int64
int64
object
15 MarketSegment
83590 non-null
16 SRHighFloor
83590 non-null
17 SRLowFloor
83590 non-null
83590 non-null
18 SRAccessibleRoom
19 SRMediumFloor
83590 non-null
20 SRBathtub
83590 non-null
21 SRShower
83590 non-null
22 SRCrib
83590 non-null
23 SRKingSizeBed
83590 non-null
24 SRTwinBed
83590 non-null
25 SRNearElevator
83590 non-null
26 SRAwayFromElevator
83590 non-null
27 SRNoAlcoholInMiniBar 83590 non-null
28 SRQuietRoom
83590 non-null
dtypes: float64(3), int64(23), object(3)
memory usage: 18.5+ MB
object
int64
int64
int64
int64
int64
int64
int64
int64
int64
int64
int64
int64
int64
[9]: df.describe()
[9]:
count
mean
std
min
25%
50%
75%
max
ID-
Age-
-
count
mean
std
min
25%
50%
75%
max
LodgingRevenue-
count
mean
std
min
25%
50%
75%
BookingsCheckedIn-
DaysSinceCreation-
OtherRevenue-
BookingsCanceled-
PersonsNights-
…
…
…
…
…
…
…
…
5
AverageLeadTime-
-
\
BookingsNoShowed-
SRMediumFloor-
SRBathtub-
\
\
max
-
count
mean
std
min
25%
50%
75%
max
SRShower-
count
mean
std
min
25%
50%
75%
max
SRNearElevator-
-
SRCrib-
…
-
SRKingSizeBed-
SRAwayFromElevator-
-
SRTwinBed-
SRNoAlcoholInMiniBar-
\
SRQuietRoom-
[8 rows x 26 columns]
0.2
Outlier Treatment
[10]: q1 = df['DaysSinceLastStay'].quantile(0.25)
q3 = df['DaysSinceLastStay'].quantile(0.75)
iqr = q3-q1
outer_bound = q3+1.5*iqr
lower_bound = q1-1.5*iqr
df = df[(df['DaysSinceLastStay']>lower_bound) & (df['DaysSinceLastStay'] <␣
↪outer_bound)]
df.shape
[10]: (83590, 29)
[11]: q1 = df['Age'].quantile(0.25)
q3 = df['Age'].quantile(0.75)
iqr = q3-q1
outer_bound = q3+1.5*iqr
lower_bound = q1-1.5*iqr
df = df[(df['Age']>lower_bound) & (df['Age'] < outer_bound)]
df.shape
[11]: (79782, 29)
6
0.3
Univariate and Bivariate analysis
[12]: fig, ax = plt.subplots(figsize=(9, 6))
sns.kdeplot(df['Age'], shade=True, color='#AB49A2', alpha=0.8, linewidth=3,␣
↪edgecolor='black')
sns.despine()
fig.set_facecolor('#F5F5F5')
ax.set_facecolor('#F5F5F5')
ax.set_ylabel('')
ax.set_xlabel('Age')
ax.grid(False)
# ax.yaxis.set_ticklabels([])
ax.grid(which='both', axis='y', color='black', linestyle=':', dashes=(2,10))
ax.text(60, 0.025, 'Age Distribution.', fontsize=23, fontweight='bold',␣
↪fontfamily='monospace')
ax.text(60, 0.0222, 'Most of the customers is \nbetween 30 and 60 years old.',␣
↪fontsize=13, fontfamily='monospace')
plt.show()
[13]: df = df[df['Age'] > 0] # Remove negative & null values
df =
df[df['Age'] <= 100]
df = df[df['AverageLeadTime'] >= 0]
#df = df[df['DaysSinceLastStay'] >= 0]
7
df["Total_revenue"]=df["LodgingRevenue"]+df["OtherRevenue"]## getting total␣
↪revenue
# Creating column age group for further analysis
df['Age_group']=pd.cut(df.
↪Age,bins=[0,15,35,50,65,80,100],labels=["0-15","16-35","36-50","51-65","65-80","80+"])
# Add Canceled Booking Percentage Column
df['Percent Canceled'] = df['BookingsCanceled']/
↪(df['BookingsCanceled']+df['BookingsNoShowed']+
␣
↪df['BookingsCheckedIn'])*100
# Add Successfull bookings Column
df['Total Successful Bookings'] = df['BookingsNoShowed']+df['BookingsCheckedIn']
df['extra_Amenities'] =␣
↪df['SRAccessibleRoom']+df['SRAwayFromElevator']+df['SRBathtub']+df['SRCrib']␣
↪+ df['SRHighFloor']+\
␣
↪df['SRKingSizeBed']+df['SRLowFloor']+df['SRMediumFloor']+df['SRNearElevator']+df['SRNoAlcoho
df['SRQuietRoom']+df['SRShower']+df['SRTwinBed']
df.head()
[13]:
0
2
3
4
5
ID Nationality
1
PRT
3
DEU
4
FRA
5
FRA
6
JPN
0
2
3
4
5
OtherRevenue-
0
2
3
4
5
SRTwinBed
0
0
0
0
0
Age-
DaysSinceCreation-
BookingsCanceled
1
0
0
0
0
SRNearElevator
0
0
0
0
0
SRQuietRoom Total_revenue-
AverageLeadTime-
BookingsNoShowed
0
0
0
0
0
SRAwayFromElevator
0
0
0
0
0
Age_group-
BookingsCheckedIn
3
0
1
0
1
…
…
…
…
…
…
SRNoAlcoholInMiniBar
0
0
0
0
0
\
Percent Canceled
25.0
NaN
0.0
8
LodgingRevenue-
\
\
\
4
5
0
2
3
4
5
0
0
0.0
254.0
Total Successful Bookings
3
0
1
0
1
-
NaN
0.0
extra_Amenities
0
0
0
0
0
[5 rows x 34 columns]
[14]: # count plot on two categorical variable
plt.figure(figsize=(12, 10))
fig = sns.countplot(x ='Nationality', data = df, order=df['Nationality'].
↪value_counts(sort=True).index[:20]
, facecolor = 'lightseagreen')
for i in fig.patches:
# get_x pulls left or right; get_height pushes up or down
fig.text(i.get_x()+.01, i.get_height()+0.1, \
str(round((i.get_height()), 2)), fontsize=10, color='black',
rotation=0)
plt.title('# of Visitors from diff Countries(Top20)', fontsize =15, fontweight␣
↪= 'bold')
# Show the plot
plt.show()
9
[15]: top_nations = (df.groupby('Nationality').size()>500) ## Filetring on top ␣
↪countries with no of visitors >500
top_nations = top_nations.index[top_nations == True]
top_nations = list(top_nations)
df1 = df[df['Nationality'].isin(top_nations)]
df1.shape
[15]: (73812, 34)
[16]: q1 = df1['Total_revenue'].quantile(0.25)
q3 = df1['Total_revenue'].quantile(0.75)
iqr = q3-q1
outer_bound = q3+1.5*iqr
lower_bound = q1-1.5*iqr
df1 = df1[(df1['Total_revenue']>lower_bound) & (df1['Total_revenue'] <␣
↪outer_bound)]
df1.shape
10
[16]: (70511, 34)
[17]: df1.Nationality.nunique()
[17]: 23
[18]: avg_by_age = df1.groupby('Age_group').mean()
sum_by_age = df1.groupby('Age_group').sum()
avg_by_nat = df1.groupby('Nationality').mean()
[19]: nation_age_group = pd.crosstab( df1['Nationality'] , df1['Age_group'], margins=␣
↪True, margins_name='Total')
nation_age_group
[19]: Age_group
Nationality
AUS
AUT
BEL
BRA
CAN
CHE
CHN
DEU
DNK
ESP
FIN
FRA
GBR
IRL
ISR
ITA
NLD
NOR
POL
PRT
RUS
SWE
USA
Total
0-15
16-35
36-50
51-65
65-80
80+
Total
-
-
-
-
-
-
-
[20]: df_age_group = df1.groupby(['Age_group'])['Total_revenue'].
↪agg(['count','mean','median'])
df_age_group
[20]:
count
mean
median
Age_group
11
-+
-
-
-
[21]: df_age_group['Age_group'] = df_age_group.index
df_age_group.index = list(range(0,6))
df_age_group
[21]:
0
1
2
3
4
5
count-
mean-
median Age_group-+
[22]: fig, ax1 = plt.subplots(figsize = (10,8))
ax1.plot( df_age_group['Age_group'], df_age_group['count'], color = 'blue',␣
↪linewidth = 3)
ax2 = ax1.twinx()
ax2.plot(df_age_group['Age_group'], df_age_group['mean'], color = 'red',␣
↪linewidth = 3)
ax3 = ax1.twinx()
ax3.plot(df_age_group['Age_group'], df_age_group['median'], color = 'green',␣
↪linewidth = 3)
ax3.spines['right'].set_position(('outward', 60))
ax.set_xlabel('# of visitors', color = 'blue')
ax.set_ylabel('Mean' ,color = 'red')
ax2.set_ylabel('Median', color = 'green')
plt.show()
12
[ ]:
[23]: df_mkt_segment = df1.groupby(['MarketSegment'])['Total_revenue'].
↪agg(['count','mean','median'])
df_mkt_segment
[23]:
MarketSegment
Aviation
Complementary
Corporate
Direct
Groups
Other
Travel Agent/Operator
count
mean
median
-
-
-
[24]: df1.info()
Int64Index: 70511 entries, 0 to 83589
Data columns (total 34 columns):
#
Column
Non-Null Count
--- -------------------
13
Dtype
-----
0
ID
70511 non-null int64
1
Nationality
70511 non-null object
2
Age
70511 non-null float- non-null int64
3
DaysSinceCreation
4
AverageLeadTime
70511 non-null int64
5
LodgingRevenue
70511 non-null float64
6
OtherRevenue
70511 non-null float64
7
BookingsCanceled
70511 non-null int64
8
BookingsNoShowed
70511 non-null int64
9
BookingsCheckedIn
70511 non-null int64
10 PersonsNights
70511 non-null int64
11 RoomNights
70511 non-null int64
12 DaysSinceLastStay
70511 non-null int64
13 DaysSinceFirstStay
70511 non-null int64
14 DistributionChannel
70511 non-null object
15 MarketSegment
70511 non-null object
16 SRHighFloor
70511 non-null int64
17 SRLowFloor
70511 non-null int64
18 SRAccessibleRoom
70511 non-null int64
19 SRMediumFloor
70511 non-null int64
20 SRBathtub
70511 non-null int64
21 SRShower
70511 non-null int64
22 SRCrib
70511 non-null int64
23 SRKingSizeBed
70511 non-null int64
24 SRTwinBed
70511 non-null int64
25 SRNearElevator
70511 non-null int64
26 SRAwayFromElevator
70511 non-null int64
27 SRNoAlcoholInMiniBar
70511 non-null int64
28 SRQuietRoom
70511 non-null int64
29 Total_revenue
70511 non-null float64
30 Age_group
70511 non-null category
31 Percent Canceled
52913 non-null float64
32 Total Successful Bookings 70511 non-null int64
33 extra_Amenities
70511 non-null int64
dtypes: category(1), float64(5), int64(25), object(3)
memory usage: 20.9+ MB
[25]: fig, ax = plt.subplots(figsize=(8,6))
sns.distplot(df1['Total Successful Bookings'],color = 'orange' ,hist = True)
sns.despine()
ax.set_ylabel('')
ax.set_xlabel('Bookings')
ax.grid(which='both', axis='y', color='black', linestyle=':', dashes=(2,10))
ax.text(2, 1, 'Distribution of Bookings', fontsize=15, fontweight='bold',␣
↪fontfamily='monospace')
plt.show()
14
C:\Users\Lenovo\anaconda3\lib\site-packages\seaborn\distributions.py:2551:
FutureWarning: `distplot` is a deprecated function and will be removed in a
future version. Please adapt your code to use either `displot` (a figure-level
function with similar flexibility) or `histplot` (an axes-level function for
histograms).
warnings.warn(msg, FutureWarning)
[26]: fig, ax = plt.subplots(figsize=(10, 8))
sns.distplot(df1['AverageLeadTime'], color='#AB49A2', hist = True)
sns.despine()
fig.set_facecolor('#F5F5F5')
ax.set_facecolor('#F5F5F5')
ax.set_ylabel('')
ax.set_xlabel('Days')
ax.grid(False)
ax.grid(which='both', axis='y', color='black', linestyle=':', dashes=(2,10))
ax.text(250, 0.0150, 'Average lead time.', fontsize=23, fontweight='bold',␣
↪fontfamily='monospace')
plt.show()
C:\Users\Lenovo\anaconda3\lib\site-packages\seaborn\distributions.py:2551:
15
FutureWarning: `distplot` is a deprecated function and will be removed in a
future version. Please adapt your code to use either `displot` (a figure-level
function with similar flexibility) or `histplot` (an axes-level function for
histograms).
warnings.warn(msg, FutureWarning)
[27]: fig, ax = plt.subplots(figsize=(10,8))
sns.distplot(df1['DaysSinceLastStay'], hist = True)
sns.despine()
fig.set_facecolor('ivory')
ax.set_facecolor('ivory')
ax.set_ylabel('')
ax.set_xlabel('Days')
ax.grid(which='both', axis='y', color='black', linestyle=':', dashes=(2,10))
ax.text(200, 0.0085, 'Distribution of Days Since Last Stayed', fontsize=15,␣
↪fontweight='bold', fontfamily='monospace')
plt.show()
C:\Users\Lenovo\anaconda3\lib\site-packages\seaborn\distributions.py:2551:
FutureWarning: `distplot` is a deprecated function and will be removed in a
16
future version. Please adapt your code to use either `displot` (a figure-level
function with similar flexibility) or `histplot` (an axes-level function for
histograms).
warnings.warn(msg, FutureWarning)
[ ]:
[28]: import plotly.express as px
[29]: df.nunique()
[29]: ID
Nationality
Age
DaysSinceCreation
AverageLeadTime
LodgingRevenue
OtherRevenue
BookingsCanceled
-
17
BookingsNoShowed
BookingsCheckedIn
PersonsNights
RoomNights
DaysSinceLastStay
DaysSinceFirstStay
DistributionChannel
MarketSegment
SRHighFloor
SRLowFloor
SRAccessibleRoom
SRMediumFloor
SRBathtub
SRShower
SRCrib
SRKingSizeBed
SRTwinBed
SRNearElevator
SRAwayFromElevator
SRNoAlcoholInMiniBar
SRQuietRoom
Total_revenue
Age_group
Percent Canceled
Total Successful Bookings
extra_Amenities
dtype: int64
-
[30]: plt.figure(figsize=(20, 18))
plt.subplot(2,4,1)
px.box( df1, y = 'AverageLeadTime')
plt.xlabel("AverageLeadTime",fontsize=15)
plt.subplot(2,4,2)
sns.boxplot( y = 'LodgingRevenue', data =df1)
plt.xlabel("LodgingRevenue",fontsize=15)
plt.subplot(2,4,3)
sns.boxplot( y = 'OtherRevenue', data =df1)
plt.xlabel("OtherRevenue",fontsize=15)
plt.subplot(2,4,4)
sns.boxplot( y = 'DaysSinceCreation', data =df1)
plt.xlabel("DaysSinceCreation",fontsize=15)
18
plt.subplot(2,4,5)
sns.boxplot( y = 'BookingsCheckedIn', data =df1)
plt.xlabel("BookingsCheckedIn",fontsize=15)
plt.subplot(2,4,6)
sns.boxplot( y = 'RoomNights', data =df1)
plt.xlabel("RoomNights",fontsize=15)
plt.subplot(2,4,7)
sns.boxplot( y = 'DaysSinceFirstStay', data =df1)
plt.xlabel("DaysSinceFirstStay",fontsize=15)
plt.subplot(2,4,8)
sns.boxplot( y = 'Age', data =df1)
plt.xlabel("Age",fontsize=15)
plt.show()
19
[ ]:
[31]: ## Top 10 countries with highest customers
df1['Nationality'].value_counts().sort_values(ascending=False)
[31]: FRA
DEU
GBR
PRT
ESP
USA
ITA
BEL
BRA
NLD
CHE
IRL
-
CAN
1481
AUT
1411
SWE
1153
ISR
864
CHN
854
NOR
732
POL
731
AUS
690
FIN
613
DNK
588
RUS
552
Name: Nationality, dtype: int64
[32]: df1.MarketSegment.value_counts()
[32]: Other
40562
Travel Agent/Operator
10591
Direct
9497
Groups
7666
Corporate
1658
Complementary
397
Aviation
140
Name: MarketSegment, dtype: int64
[33]: df1.groupby(['Nationality'])['Total_revenue'].agg('mean').head()
[33]: Nationality
AUS-
AUT-
BEL-
BRA-
CAN-
Name: Total_revenue, dtype: float64
[34]: Countries_df= df1.
↪pivot_table(values=['Total_revenue'],index=['Nationality'],aggfunc=np.mean)
Countries_df = Countries_df.sort_values(by='Total_revenue',ascending=False)
Countries_df
[34]:
Total_revenue
Nationality
SWE
NOR
BEL
DNK
POL
NLD
-
21
AUT
GBR
CHE
FRA
DEU
ITA
IRL
RUS
FIN
ESP
AUS
BRA
CHN
ISR
USA
PRT
CAN
-
[35]: fig = Countries_df[:10].plot(kind='bar',figsize=(12,8), width=0.8,edgecolor='g')
for i in fig .patches:
# get_x pulls left or right; get_height pushes up or down
fig.text(i.get_x()+.01, i.get_height()+0.3, \
str(round((i.get_height()), 2)), fontsize=15, color='black',
rotation=0)
plt.title("Avg Spendings of Highest Speding countries (Top 10)",fontsize= 15)
plt.show()
22
[36]: freq_visitors = df1['Nationality'].value_counts().head(10)
fig = px.pie( values = freq_visitors.values,names = freq_visitors.keys(),␣
↪title='Most Frequent Visitors')
fig.show()
[37]: freq_vistors_age_group = df1['Age_group'].value_counts()
fig = px.pie( values = freq_vistors_age_group.values,names =␣
↪freq_vistors_age_group.keys(), title='Most Frequent Visitors')
fig.show()
[38]: Age_df= df1.pivot_table(values=['Total_revenue'],index=['Age_group'],aggfunc=np.
↪mean)
Age_df = Age_df.sort_values(by='Total_revenue',ascending=False)
Age_df
[38]:
Total_revenue
Age_group-+
-
23
-
-
[39]: fig = Age_df.plot(kind='bar',figsize=(8,7), width=0.8,edgecolor='g')
for i in fig .patches:
# get_x pulls left or right; get_height pushes up or down
fig.text(i.get_x()+.01, i.get_height()+0.3, \
str(round((i.get_height()), 2)), fontsize=12, color='black',
rotation=0)
plt.title("Avg Spendings of diff Age Groups",fontsize= 15, fontweight = 'bold')
plt.show()
[40]: df1.info()
24
Int64Index: 70511 entries, 0 to 83589
Data columns (total 34 columns):
Non-Null Count Dtype
#
Column
--- ------------------- ----0
ID
70511 non-null int64
1
Nationality
70511 non-null object
2
Age
70511 non-null float64
3
DaysSinceCreation
70511 non-null int64
4
AverageLeadTime
70511 non-null int64
5
LodgingRevenue
70511 non-null float64
6
OtherRevenue
70511 non-null float64
7
BookingsCanceled
70511 non-null int64
8
BookingsNoShowed
70511 non-null int64
9
BookingsCheckedIn
70511 non-null int64
10 PersonsNights
70511 non-null int64
11 RoomNights
70511 non-null int64
12 DaysSinceLastStay
70511 non-null int64
13 DaysSinceFirstStay
70511 non-null int64
14 DistributionChannel
70511 non-null object
15 MarketSegment
70511 non-null object
16 SRHighFloor
70511 non-null int64
17 SRLowFloor
70511 non-null int64
18 SRAccessibleRoom
70511 non-null int64
19 SRMediumFloor
70511 non-null int64
20 SRBathtub
70511 non-null int64
21 SRShower
70511 non-null int64
22 SRCrib
70511 non-null int64
23 SRKingSizeBed
70511 non-null int64
24 SRTwinBed
70511 non-null int64
25 SRNearElevator
70511 non-null int64
26 SRAwayFromElevator
70511 non-null int64
27 SRNoAlcoholInMiniBar
70511 non-null int64
28 SRQuietRoom
70511 non-null int64
29 Total_revenue
70511 non-null float64
30 Age_group
70511 non-null category
31 Percent Canceled
52913 non-null float64
32 Total Successful Bookings 70511 non-null int64
33 extra_Amenities
70511 non-null int64
dtypes: category(1), float64(5), int64(25), object(3)
memory usage: 20.9+ MB
[41]: df1.extra_Amenities.describe()
[41]: count
mean
std
-
25
min-%-%-%-
max-
Name: extra_Amenities, dtype: float64
[42]: plt.figure(figsize=(8,6))
figx=sns.barplot(x='Age_group', y='extra_Amenities',estimator=np.mean, data=df1)
plt.xlabel("Age Groups",fontsize=18,color='black')
plt.ylabel("Ameneties",fontsize=18,color='black')
plt.title("Age Groups vs avg Ameneties",fontsize=20,color='black')
# for i in figx.patches:
#
# get_x pulls left or right; get_height pushes up or down
#
figx.text(i.get_x()+.01, i.get_height()+0.008, \
#
str(round((i.get_height()), 2)), fontsize=18, color='black',
#
rotation=0)
plt.show()
26
[43]: Avg_lead_time= df1.
↪pivot_table(values=['AverageLeadTime'],index=['Age_group'],aggfunc=np.mean)
Avg_lead_time = Avg_lead_time.sort_values(by='AverageLeadTime',ascending=False)
Avg_lead_time
[43]:
AverageLeadTime
Age_group-
-
[44]: fig = Avg_lead_time.plot(kind='bar',figsize=(8,7), width=0.8,edgecolor='g')
for i in fig .patches:
# get_x pulls left or right; get_height pushes up or down
fig.text(i.get_x()+.01, i.get_height()+0.3, \
str(round((i.get_height()), 2)), fontsize=12, color='black',
rotation=0)
plt.title("Highest Avg Lead Time for diff Age Groups (Top 10)",fontsize= 12)
plt.show()
27
[45]: # Percent Canceled
Booking_Cancelled= df1.pivot_table(values=['Percent␣
↪Canceled'],index=['Age_group'],aggfunc=np.mean)
Booking_Cancelled = Booking_Cancelled.sort_values(by='Percent Canceled',␣
↪ascending = False)
Booking_Cancelled
[45]:
Percent Canceled
Age_group-
-
28
0-15
80+
-
[46]: df1.info()
Int64Index: 70511 entries, 0 to 83589
Data columns (total 34 columns):
#
Column
Non-Null Count Dtype
--- ------------------- ----0
ID
70511 non-null int64
1
Nationality
70511 non-null object
2
Age
70511 non-null float64
3
DaysSinceCreation
70511 non-null int64
4
AverageLeadTime
70511 non-null int64
5
LodgingRevenue
70511 non-null float64
6
OtherRevenue
70511 non-null float64
7
BookingsCanceled
70511 non-null int64
8
BookingsNoShowed
70511 non-null int64
9
BookingsCheckedIn
70511 non-null int64
10 PersonsNights
70511 non-null int64
11 RoomNights
70511 non-null int64
12 DaysSinceLastStay
70511 non-null int64
13 DaysSinceFirstStay
70511 non-null int64
14 DistributionChannel
70511 non-null object
15 MarketSegment
70511 non-null object
16 SRHighFloor
70511 non-null int64
17 SRLowFloor
70511 non-null int64
18 SRAccessibleRoom
70511 non-null int64
19 SRMediumFloor
70511 non-null int64
20 SRBathtub
70511 non-null int64
21 SRShower
70511 non-null int64
22 SRCrib
70511 non-null int64
23 SRKingSizeBed
70511 non-null int64
24 SRTwinBed
70511 non-null int64
25 SRNearElevator
70511 non-null int64
26 SRAwayFromElevator
70511 non-null int64
27 SRNoAlcoholInMiniBar
70511 non-null int64
28 SRQuietRoom
70511 non-null int64
29 Total_revenue
70511 non-null float64
30 Age_group
70511 non-null category
52913 non-null float64
31 Percent Canceled
32 Total Successful Bookings 70511 non-null int64
33 extra_Amenities
70511 non-null int64
dtypes: category(1), float64(5), int64(25), object(3)
memory usage: 20.9+ MB
29
[47]: plt.figure(figsize=(15,12),dpi=100)
fig2 = sns.barplot(hue='MarketSegment', y='AverageLeadTime',␣
↪x='Age_group',data= df1, estimator=np.mean)
# for i in fig2 .patches:
#
#get_x pulls left or right; get_height pushes up or down
#
fig2.text(i.get_x()+.01, i.get_height()+0.3, \
#
str(round((i.get_height()), 0)), fontsize=10, color='black',
#
rotation=0)
plt.title("Avg lead time according to Market Segment and diff Age Groups",␣
↪size= 15)
plt.show()
[48]: plt.figure(figsize=(8,6),dpi=100)
fig2 = sns.barplot(x='MarketSegment', y='ID',data= df1, estimator=len)
for i in fig2 .patches:
# get_x pulls left or right; get_height pushes up or down
fig2.text(i.get_x()+.01, i.get_height()+0.3, \
30
str(round((i.get_height()), 0)), fontsize=12, color='black',
rotation=0)
plt.xticks(rotation =90)
plt.title("frequency of request by market segment")
plt.show()
[49]: no_show =df1['BookingsNoShowed'].sum()
cancel = df1['BookingsCanceled'].sum()
done = df1['BookingsCheckedIn'].sum()
status = ['Did not show', 'Cancelled', 'Checked_in']
values = [no_show, cancel, done]
31
[50]: plt.figure(figsize=(15,12),dpi=100)
fig2 = sns.barplot(x='BookingsNoShowed', y='ID', hue='Age_group',data= df1,␣
↪estimator=len)
for i in fig2 .patches:
#get_x pulls left or right; get_height pushes up or down
fig2.text(i.get_x()+.01, i.get_height()+0.3, \
str(round((i.get_height()), 0)), fontsize=12, color='black',
rotation=0)
plt.title("frequency of BookingsNoShowed of diff Age Groups")
plt.show()
posx
posx
posx
posx
posx
posx
posx
posx
posx
posx
and
and
and
and
and
and
and
and
and
and
posy
posy
posy
posy
posy
posy
posy
posy
posy
posy
should
should
should
should
should
should
should
should
should
should
be
be
be
be
be
be
be
be
be
be
finite
finite
finite
finite
finite
finite
finite
finite
finite
finite
values
values
values
values
values
values
values
values
values
values
32
[51]: plt.figure(figsize=(10,8))
figx = sns.barplot(data=avg_by_age, x=avg_by_age.index, y='RoomNights',
color='darkcyan', alpha=0.85, linewidth=2, edgecolor='black',␣
↪estimator=np.mean)
plt.xlabel("Age Groups",fontsize=18,color='black')
plt.ylabel("Rooms per night",fontsize=18,color='black')
plt.title("Avg Rooms per night for diff Age Groups",fontsize=20,color='black')
for i in figx.patches:
# get_x pulls left or right; get_height pushes up or down
figx.text(i.get_x()+.01, i.get_height()+0.003, \
str(round((i.get_height()), 2)), fontsize=15, color='black',
rotation=0)
plt.show()
33
[52]: plt.figure(figsize=(10,8))
figx = sns.barplot(data=avg_by_age, x=avg_by_age.index, y='PersonsNights',
color='darkorchid', alpha=0.85, linewidth=2, edgecolor='black',␣
↪estimator=np.mean)
plt.xlabel("Age Groups",fontsize=18,color='black')
plt.ylabel("No of People",fontsize=18,color='black')
plt.title("Average number of people per night by age␣
↪groups",fontsize=20,color='black')
for i in figx.patches:
# get_x pulls left or right; get_height pushes up or down
figx.text(i.get_x()+.05, i.get_height()+0.003, \
str(round((i.get_height()), 2)), fontsize=18, color='black',
rotation=0)
plt.show()
34
[53]: dist_channel = df1['DistributionChannel'].value_counts()
fig = px.pie( values = dist_channel.values,names = dist_channel.keys(),␣
↪title='Distribution Channel', hole = 0.4)
fig.show()
[54]: mkt_segment = df1['MarketSegment'].value_counts()
fig = px.pie( values = mkt_segment.values,names = mkt_segment.keys(),␣
↪title='Market Segment', hole = 0.4)
fig.show()
[55]: df.nunique()
[55]: ID
Nationality
Age
DaysSinceCreation
AverageLeadTime
LodgingRevenue
OtherRevenue
BookingsCanceled
BookingsNoShowed
BookingsCheckedIn
PersonsNights
RoomNights
DaysSinceLastStay
DaysSinceFirstStay
DistributionChannel
MarketSegment
SRHighFloor
SRLowFloor
SRAccessibleRoom
SRMediumFloor
SRBathtub
SRShower
SRCrib
SRKingSizeBed
SRTwinBed
SRNearElevator
SRAwayFromElevator
SRNoAlcoholInMiniBar
SRQuietRoom
Total_revenue
Age_group
Percent Canceled
Total Successful Bookings
extra_Amenities
dtype: int64
-
35
[ ]:
[56]: binary_cols = ['SRHighFloor', 'SRLowFloor', 'SRAccessibleRoom', 'SRMediumFloor',
'SRBathtub', 'SRShower', 'SRCrib', 'SRKingSizeBed',␣
↪'SRTwinBed',
'SRNearElevator', 'SRAwayFromElevator',␣
↪'SRNoAlcoholInMiniBar', 'SRQuietRoom']
[57]: fig, axes = plt.subplots(3, 4, figsize=(15, 12), sharex=True, sharey=True,␣
↪squeeze=False)
# fig.set_facecolor('#F5F5F5')
for i, ax in enumerate(axes.flatten()):
col = binary_cols[i]
counts = df1[col].value_counts()
g = ax.bar(counts.index, counts.values, linewidth=2)
for p in ax.patches:
ax.annotate(format(p.get_height()), (p.get_x() + p.get_width() / 2., \
p.get_height()), ha = 'center',␣
↪va = 'top', \
xytext = (0, 10), textcoords =␣
↪'offset points')
ax.set_title(f'{col}')
#
#
#
#
plt.xticks([0, 1], ['No', 'Yes'])
ax.xaxis.grid(False, which='major')
ax.yaxis.grid(True, linestyle=':', color='black', dashes=(3, 8))
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)
plt.tight_layout()
plt.show()
36
[58]: ['SRHighFloor', 'SRLowFloor', 'SRAccessibleRoom', 'SRMediumFloor',
'SRBathtub', 'SRShower', 'SRCrib', 'SRKingSizeBed',␣
↪'SRTwinBed',
'SRNearElevator', 'SRAwayFromElevator',␣
↪'SRNoAlcoholInMiniBar', 'SRQuietRoom']
[58]: ['SRHighFloor',
'SRLowFloor',
'SRAccessibleRoom',
'SRMediumFloor',
'SRBathtub',
'SRShower',
'SRCrib',
'SRKingSizeBed',
'SRTwinBed',
'SRNearElevator',
'SRAwayFromElevator',
'SRNoAlcoholInMiniBar',
'SRQuietRoom']
37
[ ]:
[59]: df1['BookingsCanceled'].value_counts()
[59]:-
Name: BookingsCanceled, dtype: int64
0.4
0.4.1
RFM Analysis for Hotel Customers Segmentation
For RFM analysis (Recency, Frequency, Monetary). Columns for references–:
For Recency score - DaysSinceLastStay
For Frequency Score - Total Successful Bookings
For Monetary Score - Total Revenue For Monetary Score
[60]: df_rfm = df1.copy()
[61]: sns.histplot(x = df1['DaysSinceLastStay'])
[61]:
38
Making DaysSinceLastStay = -1 to max. so that it falls in the lowest category and the most recent
visitors get the max score
[62]: # Change -1 value
df_rfm.loc[df_rfm['DaysSinceLastStay'] == -1, 'DaysSinceLastStay'] =␣
↪df_rfm['DaysSinceLastStay'].max()
df_rfm.head()
[62]:
0
2
3
4
7
ID Nationality
1
PRT
3
DEU
4
FRA
5
FRA
8
FRA
0
2
3
4
7
OtherRevenue-
0
2
3
4
7
SRTwinBed
0
0
0
0
0
0
2
3
4
7
0
2
3
4
7
Age-
DaysSinceCreation-
BookingsCanceled
1
0
0
0
0
SRNearElevator
0
0
0
0
0
SRQuietRoom Total_revenue-
Total Successful Bookings
3
0
1
0
1
AverageLeadTime-
BookingsNoShowed
0
0
0
0
0
SRAwayFromElevator
0
0
0
0
0
Age_group-
LodgingRevenue-
BookingsCheckedIn
3
0
1
0
1
…
…
…
…
…
…
SRNoAlcoholInMiniBar
0
0
0
0
0
\
Percent Canceled
25.0
NaN
0.0
NaN
0.0
\
\
\
extra_Amenities
0
0
0
0
1
[5 rows x 34 columns]
[ ]:
Creating 5 bins to categorize the customers with the most recent vistors getting highest number
39
[63]: df_rfm['Recency_score'] = pd.qcut(df_rfm['DaysSinceLastStay'],6,labels =␣
↪[5,4,3,2,1], duplicates= 'drop')
print(df_rfm[['Recency_score', 'DaysSinceLastStay']].head(5),'\n')
print(df_rfm['Recency_score'].value_counts())
0
2
3
4
7
Recency_score
5
1
1
1
1
DaysSinceLastStay-
-
Name: Recency_score, dtype: int64
Creating 5 bins to categorize the customers with the highest spenders getting highest number
[64]: df_rfm['Monetary_score'] = pd.qcut(df_rfm['Total_revenue'],6,labels =␣
↪[1,2,3,4,5], duplicates='drop')
print(df_rfm[['Monetary_score', 'Total_revenue']].head(), '\n')
print(df_rfm['Monetary_score'].value_counts())
0
2
3
4
7
Monetary_score
4
1
3
1
5
Total_revenue-
-
Name: Monetary_score, dtype: int64
[65]: df_rfm['frequency_score'] = [1 if x<2 else 2 if 2 <=x < 5 else 3 for x in␣
↪df_rfm['Total Successful Bookings']]
df_rfm['frequency_score'] = df_rfm['frequency_score'].astype('category')
df_rfm[['frequency_score', 'Total Successful Bookings']].head()
40
[65]:
0
2
3
4
7
frequency_score
2
1
1
1
1
Total Successful Bookings
3
0
1
0
1
[66]: df_rfm['frequency_score'].value_counts()
[66]:-
Name: frequency_score, dtype: int64
[67]: df_rfm.info()
Int64Index: 70511 entries, 0 to 83589
Data columns (total 37 columns):
#
Column
Non-Null Count
--- ------------------0
ID
70511 non-null
1
Nationality
70511 non-null
2
Age
70511 non-null
3
DaysSinceCreation
70511 non-null
4
AverageLeadTime
70511 non-null
5
LodgingRevenue
70511 non-null
6
OtherRevenue
70511 non-null
7
BookingsCanceled
70511 non-null
8
BookingsNoShowed
70511 non-null
9
BookingsCheckedIn
70511 non-null
10 PersonsNights
70511 non-null
11 RoomNights
70511 non-null
12 DaysSinceLastStay
70511 non-null
13 DaysSinceFirstStay
70511 non-null
14 DistributionChannel
70511 non-null
15 MarketSegment
70511 non-null
16 SRHighFloor
70511 non-null
17 SRLowFloor
70511 non-null
18 SRAccessibleRoom
70511 non-null
19 SRMediumFloor
70511 non-null
20 SRBathtub
70511 non-null
21 SRShower
70511 non-null
22 SRCrib
70511 non-null
23 SRKingSizeBed
70511 non-null
24 SRTwinBed
70511 non-null
25 SRNearElevator
70511 non-null
41
Dtype
----int64
object
float64
int64
int64
float64
float64
int64
int64
int64
int64
int64
int64
int64
object
object
int64
int64
int64
int64
int64
int64
int64
int64
int64
int64
26 SRAwayFromElevator
70511 non-null int64
27 SRNoAlcoholInMiniBar
70511 non-null int64
28 SRQuietRoom
70511 non-null int- non-null float64
29 Total_revenue
30 Age_group
70511 non-null category
31 Percent Canceled
52913 non-null float64
32 Total Successful Bookings 70511 non-null int64
33 extra_Amenities
70511 non-null int64
34 Recency_score
70511 non-null category
35 Monetary_score
70511 non-null category
36 frequency_score
70511 non-null category
dtypes: category(4), float64(5), int64(25), object(3)
memory usage: 21.1+ MB
[68]: df_rfm_final =␣
↪df_rfm[['ID','Nationality','Age_group','DistributionChannel','DaysSinceLastStay','AverageLea
'Percent Canceled', 'Total Successful␣
↪Bookings','Total_revenue','MarketSegment','frequency_score', 'Age',
'Monetary_score','Recency_score']]
df_rfm_final.head()
[68]:
0
2
3
4
7
ID Nationality Age_group
1
PRT
51-65
3
DEU
16-35
4
FRA
51-65
5
FRA
51-65
8
FRA
16-35
0
2
3
4
7
AverageLeadTime-
0
2
3
4
7
Total_revenue-
0
2
3
4
DistributionChannel
Corporate
Travel Agent/Operator
Travel Agent/Operator
Travel Agent/Operator
Travel Agent/Operator
Percent Canceled
25.0
NaN
0.0
NaN
0.0
DaysSinceLastStay-
Total Successful Bookings
3
0
1
0
1
MarketSegment frequency_score
Corporate
2
Travel Agent/Operator
1
Travel Agent/Operator
1
Travel Agent/Operator
1
Other
1
Recency_score
5
1
1
1
42
\
\
Age Monetary_score-
\
7
1
[69]: df_rfm_final['Recency_rank'] = df_rfm_final['Recency_score'].
↪rank(ascending=False)
df_rfm_final['Frequency_rank'] = df_rfm_final['frequency_score'].
↪rank(ascending=True)
df_rfm_final['Monetary_rank'] = df_rfm_final['Monetary_score'].
↪rank(ascending=True)
# df.sort_values(by = ['Rank', 'Age'], ascending = [True, False], na_position =␣
↪'first')
# df_rfm_final.sort_values(by = ['frequency_score',␣
↪'Monetary_score','Recency_score'], ascending=
[False, False, False])
df_rfm_final.head()
:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandasdocs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
:2: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandasdocs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
:3: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandasdocs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
[69]:
0
2
3
ID Nationality Age_group
1
PRT
51-65
3
DEU
16-35
4
FRA
51-65
DistributionChannel
Corporate
Travel Agent/Operator
Travel Agent/Operator
43
DaysSinceLastStay-
\
4
7
5
8
FRA
FRA
-
0
2
3
4
7
AverageLeadTime-
0
2
3
4
7
Total_revenue-
0
2
3
4
7
Recency_score
5
1
1
1
1
Travel Agent/Operator
Travel Agent/Operator
Percent Canceled
25.0
NaN
0.0
NaN
0.0
Total Successful Bookings
3
0
1
0
1
MarketSegment frequency_score
Corporate
2
Travel Agent/Operator
1
Travel Agent/Operator
1
Travel Agent/Operator
1
Other
1
Recency_rank-
-
Frequency_rank-
\
Age Monetary_score-
\
Monetary_rank-
[70]: # # normalizing the rank of the customers
df_rfm_final['Recency_rank_norm'] = round((df_rfm_final['Recency_rank']/
↪df_rfm_final['Recency_rank'].max())*100,2)
df_rfm_final['Frequency_rank_norm'] = round((df_rfm_final['Frequency_rank']/
↪df_rfm_final['Frequency_rank'].max())*100,2)
df_rfm_final['Monetary_rank_norm'] = round((df_rfm_final['Monetary_rank']/
↪df_rfm_final['Monetary_rank'].max())*100,2)
df_rfm_final.drop(['Recency_rank',
↪1, inplace = True)
'Frequency_rank', 'Monetary_rank'], axis =␣
:3: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandasdocs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
:4: SettingWithCopyWarning:
44
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandasdocs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
:5: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandasdocs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
C:\Users\Lenovo\anaconda3\lib\site-packages\pandas\core\frame.py:4163:
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandasdocs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
[71]: df_rfm_final.info()
Int64Index: 70511 entries, 0 to 83589
Data columns (total 17 columns):
#
Column
Non-Null Count
--- ------------------0
ID
70511 non-null
1
Nationality
70511 non-null
2
Age_group
70511 non-null
3
DistributionChannel
70511 non-null
4
DaysSinceLastStay
70511 non-null
5
AverageLeadTime
70511 non-null
6
Percent Canceled
52913 non-null
7
Total Successful Bookings 70511 non-null
8
Total_revenue
70511 non-null
9
MarketSegment
70511 non-null
10 frequency_score
70511 non-null
11 Age
70511 non-null
12 Monetary_score
70511 non-null
13 Recency_score
70511 non-null
14 Recency_rank_norm
70511 non-null
45
Dtype
----int64
object
category
object
int64
int64
float64
int64
float64
object
category
float64
category
category
float64
15 Frequency_rank_norm
70511 non-null float64
16 Monetary_rank_norm
70511 non-null float64
dtypes: category(4), float64(6), int64(4), object(3)
memory usage: 10.3+ MB
[72]: colnames = ['DaysSinceLastStay','Total Successful Bookings','Total_revenue']
for col in colnames:
fig, ax = plt.subplots(figsize=(12,3))
sns.distplot(df_rfm_final[col])
ax.set_title('Distribution of %s' % col)
plt.show()
C:\Users\Lenovo\anaconda3\lib\site-packages\seaborn\distributions.py:2551:
FutureWarning:
`distplot` is a deprecated function and will be removed in a future version.
Please adapt your code to use either `displot` (a figure-level function with
similar flexibility) or `histplot` (an axes-level function for histograms).
C:\Users\Lenovo\anaconda3\lib\site-packages\seaborn\distributions.py:2551:
FutureWarning:
`distplot` is a deprecated function and will be removed in a future version.
Please adapt your code to use either `displot` (a figure-level function with
similar flexibility) or `histplot` (an axes-level function for histograms).
46
C:\Users\Lenovo\anaconda3\lib\site-packages\seaborn\distributions.py:2551:
FutureWarning:
`distplot` is a deprecated function and will be removed in a future version.
Please adapt your code to use either `displot` (a figure-level function with
similar flexibility) or `histplot` (an axes-level function for histograms).
[73]: df_rfm_final.info()
Int64Index: 70511 entries, 0 to 83589
Data columns (total 17 columns):
#
Column
Non-Null Count
--- ------------------0
ID
70511 non-null
1
Nationality
70511 non-null
2
Age_group
70511 non-null
3
DistributionChannel
70511 non-null
4
DaysSinceLastStay
70511 non-null
5
AverageLeadTime
70511 non-null
6
Percent Canceled
52913 non-null
47
Dtype
----int64
object
category
object
int64
int64
float64
7
Total Successful Bookings 70511 non-null int64
8
Total_revenue
70511 non-null float64
9
MarketSegment
70511 non-null object
70511 non-null category
10 frequency_score
11 Age
70511 non-null float64
12 Monetary_score
70511 non-null category
13 Recency_score
70511 non-null category
14 Recency_rank_norm
70511 non-null float64
15 Frequency_rank_norm
70511 non-null float64
16 Monetary_rank_norm
70511 non-null float64
dtypes: category(4), float64(6), int64(4), object(3)
memory usage: 10.3+ MB
[ ]:
0.5
Calculating RFM score
RFM score is calculated based upon recency, frequency, monetary value normalize ranks.
Based upon this score we divide our customers. Here we rate them on a scale of 5.
Formula used for calculating rfm score is : 0.15Recency score + 0.28Frequency score + 0.57 *Monetary score
[74]: df_rfm_final['RFM_Score'] = 0.15*df_rfm_final['Recency_rank_norm']+0.28 * \
df_rfm_final['Frequency_rank_norm']+0.57*df_rfm_final['Monetary_rank_norm']
df_rfm_final['RFM_Score'] *= 0.05
df_rfm_final = df_rfm_final.round(2)
df_rfm_final[['ID','Nationality', 'Age_group',␣
↪'DistributionChannel','MarketSegment', 'RFM_Score']].head(15)
:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandasdocs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
:3: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandasdocs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
48
[74]:-
ID Nationality Age_group
1
PRT
51-65
3
DEU
16-35
4
FRA
51-65
5
FRA
51-65
8
FRA
16-35
9
FRA
36-50
10
IRL
16-35
11
IRL
16-35
12
FRA
51-65
13
FRA
51-65
14
ESP
36-50
15
ESP
51-65
16
FRA
65-80
17
FRA
65-80
FRA
65-80
18
-
RFM_Score-
DistributionChannel
Corporate
Travel Agent/Operator
Travel Agent/Operator
Travel Agent/Operator
Travel Agent/Operator
Travel Agent/Operator
Travel Agent/Operator
Travel Agent/Operator
Travel Agent/Operator
Travel Agent/Operator
Direct
Direct
Travel Agent/Operator
Travel Agent/Operator
Travel Agent/Operator
Travel
Travel
Travel
Travel
Travel
Travel
Travel
Travel
Travel
Travel
MarketSegment
Corporate
Agent/Operator
Agent/Operator
Agent/Operator
Other
Other
Agent/Operator
Agent/Operator
Agent/Operator
Agent/Operator
Direct
Direct
Agent/Operator
Agent/Operator
Agent/Operator
\
[75]: df_rfm_final.describe(percentiles=[0.25,0.45,0.75,.85,.90,0.99])
[75]:
count
mean
std
min
25%
45%
50%
75%
85%
90%
ID-
DaysSinceLastStay-
AverageLeadTime-
49
Percent Canceled-
\
99%
max
-
-
count
mean
std
min
25%
45%
50%
75%
85%
90%
99%
max
Total Successful Bookings-
count
mean
std
min
25%
45%
50%
75%
85%
90%
99%
max
Recency_rank_norm-
count
mean
std
min
25%
45%
50%
75%
85%
90%
99%
max
RFM_Score-
-
Total_revenue-
Frequency_rank_norm-
50
Age-
-
\
Monetary_rank_norm-
\
0.6
Rating Customer based upon the RFM score
rfm score >4 : Top Customer
4 > rfm score > 3.5 : High Value Customer
3.5>rfm score >2.5 : Medium value customer
2.5>rfm score>1.5 : Low-value customer
rfm score<1.5 :Lost Customer
[76]: df_rfm_final['Customer Segment']=df_rfm_final['RFM_Score'].apply(lambda x:
↪'Premium customers' if x>=4
else 'High value customers' if x<4 and x>=3.5 else 'Nuetral customer'␣
↪if x<3.5 and x>=2.5
else 'Needs attention' if x <2.5 and x>1.5 else 'Likely to Churn␣
↪out' )
[77]: df_rfm_final['RFM_Score'].describe(percentiles=[.5,.75,.9,.95,.98])
[77]: count-
mean-
std-
min-%-%-%-%-%-
max-
Name: RFM_Score, dtype: float64
[78]: # count plot on two categorical variable
plt.figure(figsize=(11, 7))
fig = sns.countplot(x ='Customer Segment', data = df_rfm_final,␣
↪order=df_rfm_final['Customer Segment'].value_counts(sort=True).index[:20]
, facecolor = 'lightseagreen')
for i in fig.patches:
# get_x pulls left or right; get_height pushes up or down
fig.text(i.get_x()+.01, i.get_height()+0.1, \
str(round((i.get_height()), 2)), fontsize=15, color='black',␣
↪fontweight = 'bold',
rotation=0)
plt.title('Customer Segment', fontsize =20, fontweight = 'bold')
plt.xticks(rotation = 90 ,size = 15)
plt.xlabel('Customer Segment', fontsize =15)
[78]: Text(0.5, 0, 'Customer Segment')
51
[79]: df_rfm_final.head()
[79]:
0
2
3
4
7
ID Nationality Age_group
1
PRT
51-65
3
DEU
16-35
4
FRA
51-65
5
FRA
51-65
8
FRA
16-35
0
2
3
4
7
AverageLeadTime-
DistributionChannel
Corporate
Travel Agent/Operator
Travel Agent/Operator
Travel Agent/Operator
Travel Agent/Operator
Percent Canceled
25.0
NaN
0.0
NaN
0.0
DaysSinceLastStay-
Total Successful Bookings
3
0
1
0
1
52
\
\
0
2
3
4
7
0
2
3
4
7
0
2
3
4
7
Total_revenue-
Recency_score
5
1
1
1
1
RFM_Score-
MarketSegment frequency_score
Corporate
2
Travel Agent/Operator
1
Travel Agent/Operator
1
Travel Agent/Operator
1
Other
1
Recency_rank_norm-
Frequency_rank_norm-
Age Monetary_score-
Monetary_rank_norm-
\
\
Customer Segment
Premium customers
Likely to Churn out
Nuetral customer
Likely to Churn out
High value customers
[80]: segment = df_rfm_final['Customer Segment'].value_counts()
fig = px.pie( values = segment.values,names = segment.keys(), title='Customer␣
↪Segment', hole = 0.5)
fig.show()
[81]: pt_cust_seg= df_rfm_final.pivot_table(values=['Total_revenue'],index=['Customer␣
↪Segment'],aggfunc=np.mean)
pt_cust_seg = pt_cust_seg.sort_values(by='Total_revenue',ascending=False)
pt_cust_seg
[81]:
Total_revenue
Customer Segment
Premium customers
High value customers
Nuetral customer
Needs attention
Likely to Churn out
-
[82]: fig = pt_cust_seg.plot(kind='bar',figsize=(8,7), width=0.8,edgecolor='g')
for i in fig .patches:
# get_x pulls left or right; get_height pushes up or down
fig.text(i.get_x()+.01, i.get_height()+0.9, \
str(round((i.get_height()), 2)), fontsize=15, color='black',
rotation=0)
plt.title("Avg Spending of diff Segments",fontsize= 16)
53
plt.show()
[83]: df_rfm_final['Percent Canceled'].describe(percentiles=[0.5,.75,.85,.9,.95,.99])
[83]: count
mean
std
min
50%
-
54
75%-%-%-%-%-
max-
Name: Percent Canceled, dtype: float64
[84]: plt.figure(figsize=(8,7))
figx=sns.barplot(x= 'Customer Segment', y='Percent Canceled',estimator=np.mean,␣
↪data= df_rfm_final)
plt.title("Percentage of Bookings canceled by diff␣
↪segments",fontsize=20,color='black')
for i in figx.patches:
# get_x pulls left or right; get_height pushes up or down
figx.text(i.get_x()+.01, i.get_height()+0.001, \
str(round((i.get_height()), 2)), fontsize=15, color='black',
rotation=0)
plt.xticks(rotation = 90, size = 10)
plt.show()
55
[85]: df_rfm_final.info()
Int64Index: 70511 entries, 0 to 83589
Data columns (total 19 columns):
#
Column
Non-Null Count
--- ------------------0
ID
70511 non-null
1
Nationality
70511 non-null
2
Age_group
70511 non-null
3
DistributionChannel
70511 non-null
4
DaysSinceLastStay
70511 non-null
56
Dtype
----int64
object
category
object
int64
5
AverageLeadTime
70511 non-null int64
6
Percent Canceled
52913 non-null float64
7
Total Successful Bookings 70511 non-null int- non-null float64
8
Total_revenue
9
MarketSegment
70511 non-null object
10 frequency_score
70511 non-null category
11 Age
70511 non-null float64
12 Monetary_score
70511 non-null category
13 Recency_score
70511 non-null category
14 Recency_rank_norm
70511 non-null float64
15 Frequency_rank_norm
70511 non-null float64
16 Monetary_rank_norm
70511 non-null float64
17 RFM_Score
70511 non-null float64
18 Customer Segment
70511 non-null object
dtypes: category(4), float64(7), int64(4), object(4)
memory usage: 11.4+ MB
[86]: df_rfm_final['frequency_score'] = pd.to_numeric(df_rfm_final['frequency_score']␣
↪, errors='coerce')
df_rfm_final['Recency_score'] = pd.to_numeric(df_rfm_final['Recency_score'] ,␣
↪errors='coerce')
[87]: sns.lmplot(y='frequency_score', x='Recency_score', data=df_rfm_final,␣
↪fit_reg=False, hue='Customer Segment',
legend=False)
plt.legend(loc='lower right')
[87]:
57
[88]: cust_seg_Nationality = pd.crosstab(df_rfm_final['Nationality'] ,␣
↪df_rfm_final['Customer Segment']
, margins= True, margins_name='Total')
cust_seg_Nationality = pd.DataFrame(cust_seg_Nationality)
cust_seg_Nationality['Countries'] = cust_seg_Nationality.index
cust_seg_Nationality.index = list(range(0,24))
cust_seg_Nationality['Percent_Premium customers']␣
↪=cust_seg_Nationality['Premium customers']/cust_seg_Nationality['Total']
cust_seg_Nationality['Percent_Premium customers'] =␣
↪round(cust_seg_Nationality['Percent_Premium customers']*100,2)
cust_seg_Nationality= cust_seg_Nationality[:22]
[ ]:
[89]: cust_seg_Nationality.sort_values(by='Percent_Premium customers', ascending=␣
↪False, inplace=True)
58
[90]: plt.figure(figsize=(12,8))
fig2 = sns.barplot(x = 'Countries', y = 'Percent_Premium customers', data =␣
↪cust_seg_Nationality,estimator=np.mean)
for i in fig2 .patches:
# # get_x pulls left or right; get_height pushes up or down
fig2.text(i.get_x()+.01, i.get_height()+0.1, \
str(round((i.get_height()), 2)), fontsize=10, color='black',
rotation=0)
plt.title("Percentage of Premium Customers by Countries", size = 15, fontweight␣
↪= 'bold')
plt.show()
[91]: cust_seg_age_group = pd.crosstab( df_rfm_final['Age_group'] ,␣
↪df_rfm_final['Customer Segment']
, margins= True, margins_name='Total')
cust_seg_age_group = pd.DataFrame(cust_seg_age_group )
cust_seg_age_group ['Age Group'] = cust_seg_age_group .index
cust_seg_age_group .index = list(range(0,7))
59
cust_seg_age_group ['Percent_Premium customers'] =cust_seg_age_group ['Premium␣
↪customers']/cust_seg_age_group ['Total']
cust_seg_age_group ['Percent_Premium customers'] = round(cust_seg_age_group␣
↪['Percent_Premium customers']*100,2)
cust_seg_age_group = cust_seg_age_group [:6]
cust_seg_age_group.sort_values(by='Percent_Premium customers', ascending=␣
↪False, inplace=True)
cust_seg_age_group
[91]: Customer Segment
2
3
1
4
5
0
High value
customers-
Likely to Churn out-
Customer Segment
2
3
1
4
5
0
Nuetral customer-
Premium customers-
Customer Segment
2
3
1
4
5
0
Percent_Premium customers-
Needs attention-
Total Age Group-
\
[92]: plt.figure(figsize=(10,8))
fig2 = sns.barplot(x = 'Age Group', y = 'Percent_Premium customers', data =␣
↪cust_seg_age_group,estimator=np.mean)
for i in fig2 .patches:
# # get_x pulls left or right; get_height pushes up or down
fig2.text(i.get_x()+.01, i.get_height()+0.01, \
str(round((i.get_height()), 2)), fontsize=15, color='black',
rotation=0)
plt.title("Percentage of Premium Customers by Age Groups", size = 15,␣
↪fontweight = 'bold')
plt.show()
60
\
[93]: plt.figure(figsize=(15,8),facecolor='y',edgecolor='k')
fig2 = sns.barplot(x='Nationality',y='ID',hue='Customer Segment', data =␣
↪df_rfm_final[(df_rfm_final['Customer Segment']=='Premium customers')] ,␣
↪estimator= len)
for i in fig2 .patches:
# # get_x pulls left or right; get_height pushes up or down
fig2.text(i.get_x()+.001, i.get_height()+0.01, \
str(round((i.get_height()), 2)), fontsize=15, color='black',
rotation=0)
plt.title("# of Premium Customers by Nations", size = 15, fontweight = 'bold')
# plt.ylabel("Count of Request id")
plt.show()
61
[94]: df_rfm_final.head()
[94]:
0
2
3
4
7
ID Nationality Age_group
1
PRT
51-65
3
DEU
16-35
4
FRA
51-65
5
FRA
51-65
8
FRA
16-35
DistributionChannel
Corporate
Travel Agent/Operator
Travel Agent/Operator
Travel Agent/Operator
Travel Agent/Operator
0
2
3
4
7
AverageLeadTime-
0
2
3
4
7
Total_revenue-
MarketSegment
Corporate
Travel Agent/Operator
Travel Agent/Operator
Travel Agent/Operator
Other
0
2
3
Recency_score
5
1
1
Recency_rank_norm-
Percent Canceled
25.0
NaN
0.0
NaN
0.0
DaysSinceLastStay-
Total Successful Bookings
3
0
1
0
1
frequency_score
2
1
1
1
1
Frequency_rank_norm-
62
\
\
Age Monetary_score-
Monetary_rank_norm-
\
\
4
7
0
2
3
4
7
1
1
RFM_Score-
-
-
-
Customer Segment
Premium customers
Likely to Churn out
Nuetral customer
Likely to Churn out
High value customers
[95]: plt.figure(figsize=(10,8))
fig2 = sns.barplot(x = 'Customer Segment', y = 'AverageLeadTime', data =␣
↪df_rfm_final,estimator=np.mean)
for i in fig2 .patches:
# # get_x pulls left or right; get_height pushes up or down
fig2.text(i.get_x()+.01, i.get_height()+0.01, \
str(round((i.get_height()), 2)), fontsize=15, color='black',
rotation=0)
plt.title("AverageLeadTime for diff 'Customer Segment", size = 15, fontweight =␣
↪'bold')
plt.xticks(rotation= 90, size = 12)
plt.show()
63
[96]: plt.figure(figsize=(10,8))
fig2 = sns.barplot(x = 'Customer Segment', y = 'Age', data =␣
↪df_rfm_final,estimator=np.mean)
for i in fig2 .patches:
# # get_x pulls left or right; get_height pushes up or down
fig2.text(i.get_x()+.01, i.get_height()+0.01, \
str(round((i.get_height()), 2)), fontsize=15, color='black',
rotation=0)
plt.title("Mean Age for diff 'Customer Segment", size = 15, fontweight = 'bold')
plt.xticks(rotation= 90, size = 12)
plt.show()
64
0.7
Conclusions
Most of the customer, bookings and revenue come from European region, which same region with
the Hotel location.
The customer from the other region probably less frequent to stay at the Hotel.
Travel Agent/ Operator is the Distribution Channel mostly used by customers while Electronic
Distribution is the least.
The number of loyal customer is so low at this point, while the number of Customers categorized
by Others and Lost Cheap Customers is very high.
Most customers only 1 times stay and the customers who stay frequently is very rare compared to
customers who frequently stay at the Hotel
65
0.8
Recommendations
Recomendation 1) For the better sales at the future, there are several method for sales & marketing program to be evaluate later:
2) Company can create campaign for those 1 time stay Customer to become loyal customer such
as discounted price on holiday occasion or business trip campaign.
3) Since Travel Agent/ Operator is the Distribution Channel most used by customers, company
should increase campaign engagement by Travel Agent/ Operator channel.
4) Furthermore, Electronic Distribution is a things people access almost every time, engagement
from Electronic Distribution can be useful for gaining engagement by using social media as a
platform
5) For future analysis, we need to know the period when the customer usually stay at Hotel, to
know the effective timing to start campaign.
[97]: #def remove_outlier(df_in, col):
#
q1 = df_in[col].quantile(0.25)
#
q3 = df_in[col].quantile(0.75)
#
iqr = q3-q1
#
outer_bound = q3+1.5*iqr
#
lower_bound = q1-1.5*iqr
#
df_out = df_in[(df_in[col]>lower_bound) & (df_in[col] < outer_bound)]
#
return df_out
66