Machine Learning | Data science | Data analytics
Predictive Model That Predicts Customer Churn
Introduction
This project focuses on developing a predictive model to identify customers likely to churn, allowing businesses to implement targeted
retention strategies. By analyzing a dataset containing various customer attributes, we aimed to uncover the key factors driving customer
attrition. The project involved training multiple machine learning models, with the Random Forest classifier ultimately chosen for its
superior performance. Key steps included data preprocessing, handling imbalanced data, feature selection, model training, and
hyperparameter tuning. The results provided actionable insights into customer behavior, enabling businesses to tailor their efforts to
retain valuable customers and reduce churn.
I'll start by importing the necessary packages for my initial ETL (Extract, Transform, Load) process that involves data extraction, cleaning,
transformation, and loading into a suitable format for analysis. The following libraries will be used:
Import Packages
# Importing necessary packages for ETL process
import pandas as pd
# For data manipulation and analysis
import numpy as np
# For numerical operations
import matplotlib.pyplot as plt # For data visualization
import seaborn as sns
# For statistical data visualization
from sklearn.model_selection import train_test_split # For splitting data into training and testing sets
from sklearn.preprocessing import StandardScaler
# For data normalization
from sklearn.ensemble import RandomForestClassifier
# For building the predictive model
from sklearn.model_selection import GridSearchCV
# For hyperparameter tuning
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, precision_score, recall_score
import seaborn as sns
from datetime import datetime
# Shows plots in jupyter notebook
%matplotlib inline
# Set plot style
sns.set(color_codes=True)
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore',category=FutureWarning)
Load the data
This initial step involved installing the openpyxl package to handle Excel files and loading the Telco customer churn data into a pandas
DataFrame for analysis.
pip install openpyxl
Requirement already satisfied: openpyxl in c:\users\collins pc\anaconda3\envs\collonel\lib\site-packages (3.1.3)
Note: you may need to restart the kernel to use updated packages.
Requirement already satisfied: et-xmlfile in c:\users\collins pc\anaconda3\envs\collonel\lib\site-packages (from
openpyxl) (1.1.0)
telco_data= pd.read_excel('C:/Users/Collins PC/Downloads/Telco_customer_churn/churn.xlsx')
telco_data.head()
CustomerID Count Country
State
City
Zip
Code
Lat Long
Latitude
Longitude Gender ... Contract
0
3668QPYBK
1
United
Los-,
California- -
States
Angeles
-
1
9237HQITU
1
United
Los
California
90005
States
Angeles
2
9305CDSKC
3
4
Paperless
Billing
Male ...
Monthto-month
Yes
-,- -
-
Female ...
Monthto-month
Yes
1
United
Los-,
California- -
States
Angeles
-
Female ...
Monthto-month
Yes
7892POOKP
1
United
Los-,
California- -
States
Angeles
-
Female ...
Monthto-month
Yes
0280XJGEX
1
United
Los-,
California- -
States
Angeles
-
Male ...
Monthto-month
Yes
5 rows × 33 columns
Data Wrangling
The data wrangling process ivolves loading the Telco customer churn dataset and inspecting its structure, which consists of 7043 entries
and 33 columns. Initial checks revealed no missing values for most columns except for 'Churn Reason,' which had a significant number of
missing entries (5174 out of 7043). The 'Total Charges' column, originally an object type, was converted to numeric, resulting in 11
missing values due to non-numeric entries being coerced. Further analysis showed that the dataset is imbalanced, with 5174 non-churn
entries and 1869 churn entries. These steps set the foundation for subsequent data cleaning, transformation, and modeling efforts.
telco_data.info()
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 33 columns):
#
Column
Non-Null Count Dtype
--- ------------------- ----0
CustomerID
7043 non-null
object
1
Count
7043 non-null
int64
2
Country
7043 non-null
object
3
State
7043 non-null
object
4
City
7043 non-null
object
5
Zip Code
7043 non-null
int64
6
Lat Long
7043 non-null
object
7
Latitude
7043 non-null
float64
8
Longitude
7043 non-null
float64
9
Gender
7043 non-null
object
10 Senior Citizen
7043 non-null
object
11 Partner
7043 non-null
object
12 Dependents
7043 non-null
object
13 Tenure Months
7043 non-null
int64
14 Phone Service
7043 non-null
object
15 Multiple Lines
7043 non-null
object
16 Internet Service
7043 non-null
object
17 Online Security
7043 non-null
object
18 Online Backup
7043 non-null
object
19 Device Protection 7043 non-null
object
20 Tech Support
7043 non-null
object
21 Streaming TV
7043 non-null
object
22 Streaming Movies
7043 non-null
object
23 Contract
7043 non-null
object
24 Paperless Billing 7043 non-null
object
25 Payment Method
7043 non-null
object
26 Monthly Charges
7043 non-null
float64
27 Total Charges
7043 non-null
object
28 Churn Label
7043 non-null
object
29 Churn Value
7043 non-null
int64
30 Churn Score
7043 non-null
int64
31 CLTV
7043 non-null
int64
32 Churn Reason
1869 non-null
object
dtypes: float64(3), int64(6), object(24)
memory usage: 1.8+ MB
telco_data.isnull().sum()
Paym
Meth
Electro
Electro
Electro
trans
(automa
CustomerID
Count
Country
State
City
Zip Code
Lat Long
Latitude
Longitude
Gender
Senior Citizen
Partner
Dependents
Tenure Months
Phone Service
Multiple Lines
Internet Service
Online Security
Online Backup
Device Protection
Tech Support
Streaming TV
Streaming Movies
Contract
Paperless Billing
Payment Method
Monthly Charges
Total Charges
Churn Label
Churn Value
Churn Score
CLTV
Churn Reason
dtype: int64
-
telco_data['Churn Reason'].head()
0
Competitor made better offer
1
Moved
2
Moved
3
Moved
4
Competitor had better devices
Name: Churn Reason, dtype: object
# Convert 'Total Charges' to numeric, using errors='coerce' to handle spaces and non-numeric values.
telco_data['Total Charges'] = pd.to_numeric(telco_data['Total Charges'], errors='coerce')
telco_data.info()
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 33 columns):
#
Column
Non-Null Count Dtype
--- ------------------- ----0
CustomerID
7043 non-null
object
1
Count
7043 non-null
int64
2
Country
7043 non-null
object
3
State
7043 non-null
object
4
City
7043 non-null
object
5
Zip Code
7043 non-null
int64
6
Lat Long
7043 non-null
object
7
Latitude
7043 non-null
float64
8
Longitude
7043 non-null
float64
9
Gender
7043 non-null
object
10 Senior Citizen
7043 non-null
object
11 Partner
7043 non-null
object
12 Dependents
7043 non-null
object
13 Tenure Months
7043 non-null
int64
14 Phone Service
7043 non-null
object
15 Multiple Lines
7043 non-null
object
16 Internet Service
7043 non-null
object
17 Online Security
7043 non-null
object
18 Online Backup
7043 non-null
object
19 Device Protection 7043 non-null
object
20 Tech Support
7043 non-null
object
21 Streaming TV
7043 non-null
object
22 Streaming Movies
7043 non-null
object
23 Contract
7043 non-null
object
24 Paperless Billing 7043 non-null
object
25 Payment Method
7043 non-null
object
26 Monthly Charges
7043 non-null
float64
27 Total Charges
7032 non-null
float64
28 Churn Label
7043 non-null
object
29 Churn Value
7043 non-null
int64
30 Churn Score
7043 non-null
int64
31 CLTV
7043 non-null
int64
32 Churn Reason
1869 non-null
object
dtypes: float64(4), int64(6), object(23)
memory usage: 1.8+ MB
telco_data['Churn Value'].value_counts()
Churn Value-
Name: count, dtype: int64
Summary Statistics
telco_data.describe()
Count
count 7043.0
mean
Zip Code
Latitude
Longitude
Tenure
Months
Monthly
Charges
Total
Charges
Churn
Value
Churn
Score
CLTV
-
-
-
-
-
-
-
-
-
-
-
std
0.0
-
-
-
-
min
-
-
-
-
-
-
-
-
25%
-
-
-
-
-
-
-
-
50%
-
-
-
-
-
-
-
75%
-
-
-
-
-
-
-
max
-
-
-
-
-
-
-
The descriptive statistics of the Telco customer churn dataset provide key insights into its numerical features. The dataset includes 7043
entries for most columns, with the exception of 'Total Charges,' which has 7032 entries. The mean values for key features are as follows:
'Tenure Months' is 32.37, 'Monthly Charges' is 64.76, and 'CLTV' (Customer Lifetime Value) is 4400.30. The dataset shows variability,
with 'Tenure Months' ranging from 0 to 72, 'Monthly Charges' from 18.25 to 118.75, and 'Total Charges' from 18.80 to 8684.80. The
'Churn Value' indicates that approximately 26.5% of the customers have churned, as reflected in the mean value of 0.265. The 'Churn
Score' ranges from 5 to 100, with a mean of 58.70, indicating a broad distribution of customer churn risk scores. These statistics highlight
the diversity and range of customer behaviors captured in the dataset, which is essential for developing a robust churn prediction model.
Understanding the Data
col = list(telco_data.columns)
categorical_features = []
numerical_features = []
for i in col:
unique_values = telco_data[i].unique()
if len(unique_values) > 6:
numerical_features.append(i)
else:
categorical_features.append(i)
print(f"Column: {i}")
print(f"Unique Values ({len(unique_values)}): {unique_values}")
print("-" * 50)
print('Categorical Features:', *categorical_features)
print('Numerical Features:', *numerical_features)
Column: CustomerID
Unique Values (7043): ['3668-QPYBK' '9237-HQITU' '9305-CDSKC' ... '2234-XADUH' '4801-JZAZL'
'3186-AJIEK']
-------------------------------------------------Column: Count
Unique Values (1): [1]
-------------------------------------------------Column: Country
Unique Values (1): ['United States']
-------------------------------------------------Column: State
Unique Values (1): ['California']
-------------------------------------------------Column: City
Unique Values (1129): ['Los Angeles' 'Beverly Hills' 'Huntington Park' ... 'Standish' 'Tulelake'
'Olympic Valley']
-------------------------------------------------Column: Zip Code
Unique Values (1652): [- ..-]
-------------------------------------------------Column: Lat Long
Unique Values (1652): ['-, -' '-, -' '-, -'
... '-, -' '-, -'
'-, -']
-------------------------------------------------Column: Latitude
Unique Values (1652): [- ..-]
-------------------------------------------------Column: Longitude
Unique Values (1651): [- - - ... - -
-]
-------------------------------------------------Column: Gender
Unique Values (2): ['Male' 'Female']
-------------------------------------------------Column: Senior Citizen
Unique Values (2): ['No' 'Yes']
-------------------------------------------------Column: Partner
Unique Values (2): ['No' 'Yes']
-------------------------------------------------Column: Dependents
Unique Values (2): ['No' 'Yes']
-------------------------------------------------Column: Tenure Months
Unique Values (73): [-]
-------------------------------------------------Column: Phone Service
Unique Values (2): ['Yes' 'No']
-------------------------------------------------Column: Multiple Lines
Unique Values (3): ['No' 'Yes' 'No phone service']
-------------------------------------------------Column: Internet Service
Unique Values (3): ['DSL' 'Fiber optic' 'No']
-------------------------------------------------Column: Online Security
Unique Values (3): ['Yes' 'No' 'No internet service']
-------------------------------------------------Column: Online Backup
Unique Values (3): ['Yes' 'No' 'No internet service']
-------------------------------------------------Column: Device Protection
Unique Values (3): ['No' 'Yes' 'No internet service']
-------------------------------------------------Column: Tech Support
Unique Values (3): ['No' 'Yes' 'No internet service']
-------------------------------------------------Column: Streaming TV
Unique Values (3): ['No' 'Yes' 'No internet service']
-------------------------------------------------Column: Streaming Movies
Unique Values (3): ['No' 'Yes' 'No internet service']
-------------------------------------------------Column: Contract
Unique Values (3): ['Month-to-month' 'Two year' 'One year']
-------------------------------------------------Column: Paperless Billing
Unique Values (2): ['Yes' 'No']
-------------------------------------------------Column: Payment Method
Unique Values (4): ['Mailed check' 'Electronic check' 'Bank transfer (automatic)'
'Credit card (automatic)']
-------------------------------------------------Column: Monthly Charges
Unique Values (1585): [- ..- ]
-------------------------------------------------Column: Total Charges
Unique Values (6531): [- ..- ]
-------------------------------------------------Column: Churn Label
Unique Values (2): ['Yes' 'No']
-------------------------------------------------Column: Churn Value
Unique Values (2): [1 0]
-------------------------------------------------Column: Churn Score
Unique Values (85): [-]
-------------------------------------------------Column: CLTV
Unique Values (3438): [- ..-]
-------------------------------------------------Column: Churn Reason
Unique Values (21): ['Competitor made better offer' 'Moved' 'Competitor had better devices'
'Competitor offered higher download speeds'
'Competitor offered more data' 'Price too high' 'Product dissatisfaction'
'Service dissatisfaction' 'Lack of self-service on Website'
'Network reliability' 'Limited range of services'
'Lack of affordable download/upload speed' 'Long distance charges'
'Extra data charges' "Don't know" 'Poor expertise of online support'
'Poor expertise of phone support' 'Attitude of service provider'
'Attitude of support person' 'Deceased' nan]
-------------------------------------------------Categorical Features: Count Country State Gender Senior Citizen Partner Dependents Phone Service Multiple Lines
Internet Service Online Security Online Backup Device Protection Tech Support Streaming TV Streaming Movies Cont
ract Paperless Billing Payment Method Churn Label Churn Value
Numerical Features: CustomerID City Zip Code Lat Long Latitude Longitude Tenure Months Monthly Charges Total Cha
rges Churn Score CLTV Churn Reason
Summary of Findings:
The dataset contains a mix of categorical and numerical features, with some initial observations needing clarification. Below are the key
points from the analysis:
Categorical Features:
Correctly Identified: Country, State, Gender, Senior Citizen, Partner, Dependents, Phone Service, Multiple Lines, Internet Service,
Online Security, Online Backup, Device Protection, Tech Support, Streaming TV, Streaming Movies, Contract, Paperless Billing,
Payment Method, Churn Label. Incorrectly Identified: Churn Reason: Despite having 21 unique values, it is a categorical variable
describing reasons for customer churn. City: Despite having 1129 unique values, it represents the names of cities and should be
classified as a categorical feature.
Numerical Features:
Correctly Identified: Churn Value,Zip Code, Lat Long, Latitude, Longitude, Tenure Months, Monthly Charges, Total Charges, Churn
Score, CLTV.
Incorrectly Identified:
CustomerID: Despite having 7043 unique values, it is an identifier and should not be treated as a numerical feature.
Geographical Insight:
The data indicates that all customers are from the United States, specifically from the state of California. This is evident from the unique
values in the Country and State columns, which are both singularly populated with 'United States' and 'California', respectively.
Detailed Insights:
Numerical Variables:
Variables such as Zip Code, Lat Long, Latitude, and Longitude provide precise geographic information about the customers. Variables
like Tenure Months, Monthly Charges, and Total Charges offer crucial data for understanding customer behavior and financial aspects.
Categorical Variables:
Variables such as Gender, Senior Citizen, Partner, and Dependents provide demographic information about the customers. Variables like
Phone Service, Internet Service, Contract, and Payment Method give insights into the services customers use and their preferences.
Churn Analysis:
The Churn Reason column is essential for understanding why customers leave, despite being misclassified initially as numerical. Churn
Label and Churn Value correctly identify customers who have churned, aiding in churn prediction and prevention strategies.
Conclusion:
The dataset predominantly consists of well-categorized variables with some needing reclassification for accurate analysis. The
geographical scope limited to California can help tailor state-specific strategies. Overall, this dataset offers valuable insights for
demographic analysis, service usage patterns, and churn prediction. Proper classification and understanding of each variable are crucial
for effective data analysis and subsequent decision-making.
Data Visualization
The provided code defines two functions: plot_stacked_bars and annotate_stacked_bars. The plot_stacked_bars function aims to create
a stacked bar plot from the given DataFrame, with annotations added to each bar representing the respective values. It allows for
customization of the plot's title, size, rotation of x-axis labels, and legend placement. The annotate_stacked_bars function is a helper
function used internally to add value annotations to the bars of the stacked plot. It iterates over the plotted bars, calculates the annotation
value, and annotates non-zero values onto the bars. Together, these functions facilitate the visualization of stacked bar plots, particularly
useful for comparing proportions of different categories across multiple groups, such as retention and churn rates in business contexts.
def plot_stacked_bars(dataframe, title_, size_=(18, 10), rot_=0, legend_="upper right"):
"""
Plot stacked bars with annotations
"""
ax = dataframe.plot(
kind="bar",
stacked=True,
figsize=size_,
rot=rot_,
title=title_
)
# Annotate bars
annotate_stacked_bars(ax, textsize=14)
# Rename legend
plt.legend(["Retention", "Churn"], loc=legend_)
# Labels
plt.ylabel("Company base (%)")
plt.show()
def annotate_stacked_bars(ax, pad=0.99, colour="white", textsize=13):
"""
Add value annotations to the bars
"""
# Iterate over the plotted rectanges/bars
for p in ax.patches:
# Calculate annotation
value = str(round(p.get_height(),1))
# If value is 0 do not annotate
if value == '0.0':
continue
ax.annotate(
value,
((p.get_x()+ p.get_width()/2)*pad-0.05, (p.get_y()+p.get_height()/2)*pad),
color=colour,
size=textsize
)
churn = telco_data[['CustomerID', 'Churn Value']]
churn.columns = ['CustomerID', 'Churn Value']
churn_total = churn.groupby(churn['Churn Value']).count()
churn_percentage = churn_total / churn_total.sum() * 100
The code below plot_stacked_bars(churn_percentage.transpose(), "Churning status", (5, 5), legend_="lower right") utilizes the
plot_stacked_bars function to generate a stacked bar plot representing the churning status. The churn rate, calculated to be 26.5% based
on the descriptive statistics analysis, is depicted in this visualization. The plot size is set to (5, 5), and the legend is positioned in the
lower-right corner for improved readability. The purpose of this plot is to visually convey the distribution of churn and retention rates
across different categories or groups, providing valuable insights into customer behavior within the dataset.
plot_stacked_bars(churn_percentage.transpose(), "Churning status", (5, 5), legend_="lower right")
Location
location = telco_data[['State', 'City', 'Churn Value','CustomerID']]
location = location.groupby([location['State'], location['Churn Value']])['CustomerID'].count().unstack(level=1).fill
location_churn = (location.div(location.sum(axis=1), axis=0) * 100).sort_values(by=[1], ascending=False)
plot_stacked_bars(location_churn, 'Location churn', rot_=30)
# divide categorical columns in list to easily plot them
customer_info = telco_data[["Gender", "Senior Citizen", "Partner", "Dependents","Churn Value","CustomerID"]]
services = telco_data[["Phone Service", "Multiple Lines", "Internet Service", "Online Security",
"Online Backup", "Device Protection", "Tech Support", "Streaming TV",
"Streaming Movies"]]
billing_info = telco_data[["Contract", "Paperless Billing", "Payment Method"]]
Customers Segmentation
The provided code segment aggregates customer information based on two key variables, 'Senior Citizen' status and 'Churn Value,'
which indicates whether a customer has churned or not. By grouping the data and counting the occurrences of customer IDs for each
combination of 'Senior Citizen' and 'Churn Value,' the code constructs a contingency table. This table is then transformed to represent the
percentage of churned customers relative to the total number of customers within each 'Senior Citizen' group. The resulting DataFrame,
customers_churn, showcases the churn rates for both senior and non-senior citizens, sorted in descending order based on the churn rate
for churned customers ('Churn Value' equals 1). This analysis provides insights into how churn rates vary between senior and non-senior
citizen customers, aiding in understanding the impact of age on customer attrition.
customers = customer_info.groupby([customer_info['Senior Citizen'],customer_info['Churn Value']])['CustomerID'].count
customers_churn = (customers.div(customers.sum(axis=1), axis=0) * 100).sort_values(by=[1], ascending=False)
plot_stacked_bars(customers_churn, 'customers_churn', rot_=30)
customers = customer_info.groupby([customer_info['Partner'],customer_info['Churn Value']])['CustomerID'].count().unst
customers_churn = (customers.div(customers.sum(axis=1), axis=0) * 100).sort_values(by=[1], ascending=False)
plot_stacked_bars(customers_churn, 'customers_churn', rot_=30)
customers = customer_info.groupby([customer_info['Dependents'],customer_info['Churn Value']])['CustomerID'].count()
customers_churn = (customers.div(customers.sum(axis=1), axis=0) * 100).sort_values(by=[1], ascending=False)
plot_stacked_bars(customers_churn, 'customers_churn', rot_=30)
Customer Sumary Report
Several customer factors influence the churn rate as I explored below
Senior Citizen Status: Customers who are not senior citizens have a notably higher churn rate of 41.7%, compared to senior citizens,
who exhibit a lower churn rate of 23.6%.
Partner Status: Customers without a partner experience a churn rate of 33%, while those with a partner have a lower churn rate of
19.7%.
Dependents: Customers without dependents have a higher churn rate of 32.6%, whereas customers with dependents demonstrate a
significantly lower churn rate of 6.5%.
Proposed Improvements
Senior Citizen Status:
Implement targeted retention programs specifically tailored to non-senior citizen customers to address their needs and concerns,
potentially offering personalized incentives or discounts to encourage loyalty.
Partner Status:
Focus on enhancing the customer experience for those without partners by offering bundled services or loyalty rewards. Additionally,
consider implementing referral programs to incentivize existing customers to refer their partners, thereby potentially reducing churn
among this group.
Dependents:
Develop family-oriented packages or services aimed at attracting and retaining customers with dependents. Highlight the value of family
plans or discounts to emphasize the benefits of staying with the company for the long term.
Overall, prioritize personalized communication and offerings tailored to each customer segment's unique characteristics and preferences
to improve retention rates and foster long-term customer loyalty.
Services
services_info = telco_data[["Phone Service", "Multiple Lines", "Internet Service", "Online Security",
"Online Backup", "Device Protection", "Tech Support", "Streaming TV",
"Streaming Movies","Churn Value","CustomerID"]]
The provided code segment organizes service-related data based on two criteria: 'Phone Service' status and 'Churn Value,' denoting
customer churn. By grouping the data accordingly and tallying the occurrences of customer IDs for each combination of 'Phone Service'
and 'Churn Value,' the code constructs a contingency table. This table is then transformed to represent the percentage of churned
customers relative to the total number of customers within each 'Phone Service' category. The resulting DataFrame, services_churn,
displays the churn rates for customers with and without phone service, sorted in descending order based on the churn rate for churned
customers ('Churn Value' equals 1). This analysis offers insights into how churn rates vary based on phone service subscription status,
aiding in understanding the impact of this service on customer retention.
services = services_info.groupby(['Phone Service', 'Churn Value'])['CustomerID'].count().unstack(level=1).fillna(0)
services_churn = (services.div(services.sum(axis=1), axis=0) * 100).sort_values(by=[1], ascending=False)
plot_stacked_bars(services_churn, 'services_churn', rot_=30)
The provided code iterates over each service in the dataset, excluding 'Churn Value' and 'CustomerID,' and calculates the churn
percentages for each service. It groups the data by each service and churn value, constructs a contingency table, and then transforms it
to represent the percentage of churned customers relative to the total number of customers for each service category. The resulting
churn percentages are then visualized using stacked bar plots, with each plot illustrating the retention and churn rates for a specific
service. This iterative process allows for a comprehensive examination of churn behavior across different services offered by the
company, facilitating insights into service-specific customer retention challenges.
import matplotlib.pyplot as plt
# Grouping by each service and calculating churn percentages for each service
for service in services_info.columns[:-2]: # Exclude 'Churn Value' and 'CustomerID'
services = services_info.groupby([service, 'Churn Value'])['CustomerID'].count().unstack(level=1).fillna(0)
services_churn = (services.div(services.sum(axis=1), axis=0) * 100).sort_values(by=[1], ascending=False)
plot_stacked_bars(services_churn, f'Retention and Churn Rate for {service}')
services_info
Phone
Service
Multiple
Lines
Internet
Service
Online
Security
Online
Backup
Device
Protection
Tech
Support
Streaming
TV
Streaming
Movies
Churn
CustomerID
Value
0
Yes
No
DSL
Yes
Yes
No
No
No
No
1
3668QPYBK
1
Yes
No
Fiber
optic
No
No
No
No
No
No
1
9237HQITU
2
Yes
Yes
Fiber
optic
No
No
Yes
No
Yes
Yes
1
9305CDSKC
3
Yes
Yes
Fiber
optic
No
No
Yes
Yes
Yes
Yes
1
7892POOKP
4
Yes
Yes
Fiber
optic
No
Yes
Yes
No
Yes
Yes
1
0280XJGEX
...
...
...
...
...
...
...
...
...
...
...
...
7038
Yes
No
No
No
internet
service
No
internet
service
No internet
service
No
internet
service
No internet
service
No internet
service
0
2569WGERO
7039
Yes
Yes
DSL
Yes
No
Yes
Yes
Yes
Yes
0
6840RESVB
7040
Yes
Yes
Fiber
optic
No
Yes
Yes
No
Yes
Yes
0
2234XADUH
7041
No
No phone
service
DSL
Yes
No
No
No
No
No
0
4801-JZAZL
7042
Yes
No
Fiber
optic
Yes
No
Yes
Yes
Yes
Yes
0
3186-AJIEK
7043 rows × 11 columns
Services churn report
Multiple Lines and Internet Service Availability:
Customers with multiple lines and access to internet service are less likely to churn, with a churn rate reduction of approximately 24.93%
compared to those without these services.
Type of Internet Service:
The type of internet service also significantly impacts churn rate. Customers with fiber optic internet service exhibit a higher churn rate of
41.89% compared to DSL users, who have a churn rate of 18.96%.
Online Security, Online Backup, Device Protection, and Tech Support:
Customers who do not have online security, online backup, device protection, or tech support services are more likely to churn. The
absence of these services correlates with higher churn rates: online security (41.77% without, 14.61% with), online backup (39.93%
without, 21.53% with), device protection (39.13% without, 22.50% with), and tech support (41.64% without, 15.17% with).
Streaming Services:
The availability of streaming services also affects churn rate. Customers with access to streaming TV and streaming movies are less
likely to churn compared to those without these services: streaming TV (33.52% without, 30.07% with) and streaming movies (33.68%
without, 29.84% with).
Understanding these factors provides valuable insights for strategizing customer retention efforts. Telco can focus on improving and
promoting services such as online security, device protection, and streaming options to mitigate churn and enhance customer
satisfaction. Additionally, offering reliable and diverse internet service options, such as DSL alongside fiber optic, can contribute to
retaining more customers.
Proposed Improvements
Improving Service Offerings:
Enhance the availability and quality of multiple lines and internet service, as customers with access to these services are less likely to
churn. Consider upgrading infrastructure and expanding coverage to ensure reliable and high-speed internet access for all customers.
Internet Service Type:
Evaluate the performance and reliability of fiber optic internet service to address the high churn rate associated with it. Consider offering
incentives or discounts to DSL users to encourage them to remain loyal to the service.
Enhanced Security and Support Services:
Invest in robust online security, online backup, device protection, and tech support services to reduce churn rates. Emphasize the
importance of these services in protecting customers' data and providing timely assistance to address any issues they may encounter.
Streaming Services:
Enhance the availability and diversity of streaming TV and streaming movie options to retain customers. Consider partnering with content
providers to offer exclusive content or bundles that appeal to a wide range of preferences and interests.
By addressing these key factors and improving service offerings, Telco can effectively reduce churn rates and foster long-term customer
satisfaction and loyalty.
Billing Information
billing_info = telco_data[["Contract", "Paperless Billing", "Payment Method","Churn Value","CustomerID"]]
The provided code loops through each billing information category, excluding 'Churn Value' and 'CustomerID,' and computes the churn
percentages for each category. It groups the data by each billing category and churn value, constructs a contingency table, and then
transforms it to represent the percentage of churned customers relative to the total number of customers for each billing category. The
resulting churn percentages are visualized using stacked bar plots, with each plot illustrating the retention and churn rates for a specific
billing category. This iterative process enables a detailed examination of churn behavior across various billing-related factors, facilitating
insights into their impact on customer retention.
# Grouping by each billing information and calculating churn percentages for each category
for service in billing_info.columns[:-2]: # Exclude 'Churn Value' and 'CustomerID'
services = billing_info.groupby([service, 'Churn Value'])['CustomerID'].count().unstack(level=1).fillna(0)
services_churn = (services.div(services.sum(axis=1), axis=0) * 100).sort_values(by=[1], ascending=False)
plot_stacked_bars(services_churn, f'Retention and Churn Rate for {service}')
Billing Information Report
Contract Type:
Customers with a month-to-month contract exhibit a significantly higher churn rate of 42.71% compared to those with one-year contracts
(11.27%) and two-year contracts (2.83%). This suggests that offering longer-term contracts may help reduce churn rates.
Paperless Billing:
Customers who opt for paperless billing have a higher churn rate of 33.57% compared to those who prefer traditional billing methods
(16.33%). Including billing receipts in a more appealing and convinient manner could encourage customers and build trust in the payment
systems.
Payment Method:
Customers who use electronic checks as their payment method have the highest churn rate at 45.29%, followed by those who use mailed
checks (19.11%). In contrast, customers who use bank transfer (automatic) and credit card (automatic) have lower churn rates of 16.71%
and 15.24%, respectively. Offering incentives or discounts for customers to switch to automatic payment methods could help reduce
churn rates associated with electronic checks.
Proposed Improvements
Contract Type:
To mitigate the high churn rate associated with month-to-month contracts, businesses could introduce incentives for customers to sign up
for longer-term contracts, such as discounted rates or additional benefits. Emphasizing the advantages of stability and predictability that
come with longer commitments may also help retain customers.
Paperless Billing:
To address the higher churn rate among customers using paperless billing, companies could focus on improving the user experience and
providing clearer billing information online. Implementing user-friendly interfaces, providing detailed billing breakdowns, and offering
personalized billing notifications could enhance customer satisfaction and trust in the paperless billing process.
Payment Method:
Since electronic checks have the highest churn rate, companies could incentivize customers to switch to more reliable payment methods
such as bank transfers or credit card payments. Offering discounts, rewards, or exclusive deals for customers who opt for automatic
payment methods could encourage the transition and reduce churn associated with electronic checks. Additionally, ensuring seamless
and secure payment processes can help build trust and confidence in the chosen payment methods.
Top Churn Features
Here is a list of the highest churn rates based on my analysis so far let me do a little feature engineering and dive deeper into the
feature analysis
Payment Method:
Electronic Check: 45.29% Mailed Check: 19.11% Bank Transfer (Automatic): 16.71% Credit Card (Automatic): 15.24%
Contract Type:
Month-to-Month: 42.71% Paperless Billing: 33.57% One Year: 11.27% Two Year: 2.83%
Internet Service:
Fiber Optic: 41.89% DSL: 18.96% Online Security:
No: 41.77% Yes: 14.61%
Tech Support:
No: 41.64% Yes: 15.17%
Streaming TV:
No: 33.52% Yes: 30.07%
Streaming Movies:
No: 33.68% Yes: 29.84%
Device Protection:
No: 39.13% Yes: 22.50%
Online Backup:
No: 39.93% Yes: 21.53%
Partner:
No: 33.00% Yes: 19.70%
Senior Citizen:
No: 41.70% Yes: 23.60%
Paperless Billing:
Yes: 33.57% No: 16.33%
Dependents:
No: 32.60% Yes: 6.50%
Feature Engineering
Numerical Analysis
Tenure
The provided code groups the Telco data by 'Tenure Months' and calculates the mean of the 'Churn Value' within each group. The results
are sorted in descending order based on the churn value, indicating the average churn rate across different tenure periods. This analysis
offers insights into how customer churn varies depending on the length of their tenure with the company.
telco_data.groupby(['Tenure Months']).agg({'Churn Value': 'mean'}).sort_values(by='Churn Value', ascending=False)
Churn Value
Tenure Months
1
-
2
-
5
-
4
-
3
-
...
...
63
-
64
-
71
-
72
-
0
-
73 rows × 1 columns
The data indicates that customers who have been clients for just 1 month have a high churn rate of approximately 62%. This likelihood
decreases as tenure increases. Notably, customers at the 2-month mark show a significant drop to around 52%, and the trend continues
with a more gradual decrease. Interestingly, at 5 months, the churn rate is about 48%, while at 4 months, it’s approximately 47%,
showing a small but meaningful difference. This suggests that moving past the 4-month mark could be a critical milestone for customer
retention. As clients stay longer, the likelihood of churn significantly drops, reaching as low as around 5.6% at 63 months, and further
down to about 1.7% at 72 months. Clients who have been with the company for 72 months show the lowest churn rate, highlighting the
importance of long-term client relationships in reducing churn.
telco_data.describe(include='number')
Count
count 7043.0
mean
Zip Code
Latitude
Longitude
Tenure
Months
Monthly
Charges
Total
Charges
Churn
Value
Churn
Score
CLTV
-
-
-
-
-
-
-
-
-
-
-
std
0.0
-
-
-
-
min
-
-
-
-
-
-
-
-
25%
-
-
-
-
-
-
-
-
50%
-
-
-
-
-
-
-
75%
-
-
-
-
-
-
-
max
-
-
-
-
-
-
-
numeric=telco_data[['Zip Code','Latitude','Longitude','Tenure Months','Monthly Charges','Total Charges','Churn Value'
numeric
Zip Code
Latitude
Longitude Tenure Months Monthly Charges Total Charges
Churn Value Churn Score CLTV
0
- -
2
53.85
108.15
1
86
3239
1
- -
2
70.70
151.65
1
67
2701
2
- -
8
99.65
820.50
1
86
5372
3
- -
28
104.80
3046.05
1
84
5003
4
- -
49
103.70
5036.30
1
89
5340
...
...
...
...
...
...
...
7038
- -
72
21.15
1419.40
0
45
5306
7039
- -
24
84.80
1990.50
0
59
2140
7040
- -
72
103.20
7362.90
0
71
5560
7041
- -
11
29.60
346.45
0
59
2793
7042
- -
66
105.65
6844.50
0
38
5097
...
...
7043 rows × 9 columns
...
Data Distribution
from matplotlib.colors import ListedColormap
colors = ["windows blue", "amber", "coral", "faded green"]
# plot them as a palette
sns.palplot(sns.xkcd_palette(colors))
numeric_features =['Tenure Months','Monthly Charges','Total Charges','CLTV']
The provided code utilizes Seaborn to create a histogram plot for each numeric feature in the dataset. It sets the color palette using
XKCD colors and creates a figure with a size of 15x5 inches. For each numeric feature, it generates a subplot within a row, ensuring that
each feature has its histogram. The histograms display the distribution of the respective numeric feature values, overlaid with kernel
density estimation (KDE) curves for visual representation. The titles of the subplots indicate the name of the feature being visualized.
Finally, the plots are displayed with tight layout to avoid overlap using plt.tight_layout(), followed by plt.show() to render the plots. This
visualization provides an overview of the distributions of numeric features in the dataset, aiding in understanding their characteristics and
potential relationships.
cmap = sns.xkcd_palette(colors)
plt.figure(figsize=(15, 5))
for i in range(len(numeric_features)):
plt.subplot(1, 4, i + 1)
sns.histplot(numeric[numeric_features[i]], color=cmap[i % len(colors)], kde=True)
title = 'Distribution: ' + numeric_features[i]
plt.title(title)
plt.tight_layout()
plt.show()
The variables "tenure" and "MonthlyCharges" exhibit bimodal distributions, with notable peaks in the ranges of 0 to 70 for "tenure" and 20
to 80 for "MonthlyCharges". This suggests that there are two distinct groups within these variables.
In contrast, the "TotalCharges" variable demonstrates a positively skewed distribution, indicating that most of the data points are
clustered towards the lower end, with a long tail extending towards higher values.
import pandas as pd
import numpy as np
import cufflinks as cf
from plotly.offline import init_notebook_mode, iplot
# using plotly to plot the boxplot
numeric[numeric_features].iplot(kind='box', title="Boxplots of Numeric Features")
Boxplots of Numeric Features
Tenure Months
Monthly Charges
8000
Total Charges
CLTV
6000
4000
2000
0
Tenure Months
Monthly Charges
Total Charges
CLTV
Export to plot.ly »
The box plots of the features "Tenure Months", "Monthly Charges", "Total Charges", and "CLTV" reveal uneven distributions. The
variability and range of values across these features suggest that the data is not uniformly scaled. This unevenness can obscure the true
relationships between these features and the churn rate. To ensure that the features are on a comparable scale and to provide a clearer
representation of their relationships, it is recommended to apply scaling techniques, such as standardization or normalization. Scaling will
help in mitigating the differences in magnitude and variance, leading to more accurate and interpretable results in subsequent analyses.
import pandas as pd
import numpy as np
import cufflinks as cf
from plotly.offline import init_notebook_mode, iplot
import plotly.express as px
# Configure cufflinks to work offline
cf.go_offline()
init_notebook_mode(connected=True)
# Create individual box plots for each numeric feature versus churn rate
for feature in numeric_features:
fig = px.box(numeric, x='Churn Value', y=feature, color='Churn Value', title=f'{feature} vs Churn Rate')
fig.show()
Tenure Months vs Churn Rate
Churn Value
70
1
0
60
Tenure Months
-
1
Churn Value
Monthly Charges vs Churn Rate
Churn Value
120
1
0
Monthly Charges
100
80
60
40
20
0
1
Churn Value
Total Charges vs Churn Rate
Churn Value
1
8000
0
Total Charges
6000
4000
2000
0
0
1
Churn Value
CLTV vs Churn Rate
Churn Value
1
6000
0
CLTV
5000
4000
3000
2000
0
1
Churn Value
Logarithmic Scalling to standardise the features
Log-scaled box plots provide insights into data with wide ranges or skewed distributions by highlighting relative differences and
multiplicative factors. When interpreting, focus on the log-transformed nature of the axes and understand that differences in log values
correspond to multiplicative differences in the original data.Interpreting log-scaled values on a box plot involves understanding how the
transformation affects the data and how to read the modified axes.
Understanding Log Scaling:
Purpose: Log scaling is applied to reduce skewness and handle a wide range of values, making patterns in the data more apparent. It is
particularly useful when data spans several orders of magnitude.
Transformation: The log transformation compresses larger values more than smaller ones, which helps in visualizing data that would
otherwise be dominated by large outliers as witnessed above with Total Charges Vs Churn and Tenure Months Vs Churn boxplots.
Reading the Box Plot:
Axes Interpretation: The y-axis (or x-axis if horizontal) represents the log-transformed values. If you used np.log1p for transformation, it
means the axis represents log(1 + value). Logarithmic Nature: The spacing on the axis is not linear. For instance, the difference
between log(1) and log(10) is the same as the difference between log(10) and log(100). Comparing Medians and Quartiles: The box
plot elements (median, quartiles, and whiskers) should be interpreted similarly to non-transformed data, but keep in mind the axis scale is
logarithmic. Median (Central Line): Indicates the central tendency of the log-transformed data. Interquartile Range (Box): Shows the
spread of the middle 50% of the data on the log scale. Whiskers: Extend to the minimum and maximum values within 1.5 times the
interquartile range from the quartiles. Outliers: Data points outside the whiskers are plotted individually and represent extreme values.
Comparing Groups:
When comparing groups (such as "Retention" vs. "Churn"), observe the following:
Central Tendencies: Compare the medians of the groups. Differences in medians indicate differences in central tendencies on the log
scale.
Spread and Variability: Compare the IQRs and whiskers. Differences in these indicate differences in variability and spread on the log
scale. Relative Differences: Since the axis is logarithmic, differences between values should be interpreted in terms of ratios or
multiplicative factors rather than absolute differences.
# Apply log scaling to the numeric features
log_scaled_numeric = numeric[numeric_features].apply(lambda x: np.log1p(x))
# Add the Churn Value column back to the log-scaled dataframe
log_scaled_numeric['Churn Value'] = numeric['Churn Value']
# Configure cufflinks to work offline
cf.go_offline()
init_notebook_mode(connected=True)
# Plotting the log-scaled boxplots
for feature in numeric_features:
log_scaled_numeric.iplot(kind='box', y=feature, title=f'Log-Scaled {feature} Boxplot', asFigure=True).show()
Log-Scaled Tenure Months Boxplot
Tenure Months
Monthly Charges
8
Total Charges
CLTV
Churn Value
6
4
2
0
Te
nu
re
Mo
nt
hs
Mo
nt
hly
Ch
ar
ge
s
To
tal
Ch
ar
ge
s
CL
TV
Ch
ur
nV
alu
e
Log-Scaled Monthly Charges Boxplot
Tenure Months
Monthly Charges
8
Total Charges
CLTV
Churn Value
6
4
2
0
Te
nu
re
Mo
nt
hs
Mo
nt
hly
Ch
ar
ge
s
To
tal
Ch
ar
ge
s
CL
TV
Ch
ur
nV
alu
e
Log-Scaled Total Charges Boxplot
Tenure Months
Monthly Charges
8
Total Charges
CLTV
Churn Value
6
4
2
0
Te
nu
re
Mo
nt
hs
Mo
nt
hly
Ch
ar
ge
s
To
tal
Ch
ar
ge
s
CL
TV
Ch
ur
nV
alu
e
Log-Scaled CLTV Boxplot
Tenure Months
Monthly Charges
8
Total Charges
CLTV
Churn Value
6
4
2
0
Te
nu
re
Mo
nt
hs
Mo
nt
hly
Ch
ar
ge
s
To
tal
Ch
ar
ge
s
CL
TV
Ch
ur
nV
alu
e
# Apply log scaling to the numeric features
log_scaled_numeric = numeric[numeric_features].apply(lambda x: np.log1p(x))
# Add the Churn Value column back to the log-scaled dataframe
log_scaled_numeric['Churn Value'] = numeric['Churn Value']
# Configure cufflinks to work offline
cf.go_offline()
init_notebook_mode(connected=True)
# Create individual box plots for each log-scaled feature versus churn value
for feature in numeric_features:
fig = px.box(log_scaled_numeric, x='Churn Value', y=feature, color='Churn Value', title=f'Log-Scaled {feature} vs
fig.show()
Log-Scaled Tenure Months vs Churn Value
4.5
Churn Value
1
4
0
3.5
Tenure Months
-
1
Churn Value
Log-Scaled Monthly Charges vs Churn Value
Churn Value
1
0
Monthly Charges
4.5
4
3.5
3
0
1
Churn Value
Log-Scaled Total Charges vs Churn Value
Churn Value
9
1
0
Total Charges
8
7
6
5
4
3
0
1
Churn Value
Log-Scaled CLTV vs Churn Value
8.8
Churn Value
1
0
8.6
CLTV
8.4
8.2
8
7.8
7.6
0
1
Churn Value
The analysis shows as the monthly charges increase so does the churn rate increase as seen in the mean logarithmic rate of 4.390
exhibiting higher churn rates compared to lower rates. There are some outliers within the data that indicate that further investigation
should be conducted on the monthly charges data.Tenure Months also has outliers towards the lower part of the whiskers indicating a
further need for analysis as well.
Analysis reveals that as 'Tenure Months' and 'CLTV' increase, 'Churn Value' tends to decrease, indicating longer customer tenure and
higher CLTV correlate with lower churn rates. Conversely, higher 'Monthly Charges' and 'Total Charges' do not exhibit a clear trend with
churn, suggesting that price alone may not be a significant factor in customer retention. This insight suggests that strategies aimed at
increasing tenure and CLTV may be effective in reducing churn rates, while further investigation is warranted to understand the
relationship between pricing and churn more comprehensively.
log_scaled_numeric
Tenure Months Monthly Charges Total Charges
CLTV Churn Value
0
-
-
-
1
1
-
-
-
1
2
-
-
-
1
3
-
-
-
1
4
-
-
-
1
...
...
...
7038
-
7039
...
...
...
-
-
0
-
-
-
0
7040
-
-
-
0
7041
-
-
-
0
7042
-
-
-
0
7043 rows × 5 columns
Data Preparation and Preprocessing
# divide categorical columns in list to easily plot them
customer_info = telco_data[["Gender", "Senior Citizen", "Partner", "Dependents","Churn Value","CustomerID"]]
services = telco_data[["Phone Service", "Multiple Lines", "Internet Service", "Online Security",
"Online Backup", "Device Protection", "Tech Support", "Streaming TV",
"Streaming Movies"]]
billing_info = telco_data[["Contract", "Paperless Billing", "Payment Method"]]
customers=telco_data[["Gender", "Senior Citizen", "Partner"]]
services_data=telco_data[["Phone Service", "Multiple Lines", "Internet Service", "Online Security",
"Online Backup", "Device Protection", "Tech Support", "Streaming TV",
"Streaming Movies"]]
billing= telco_data[["Contract", "Paperless Billing", "Payment Method"]]
billing
Contract Paperless Billing
Payment Method
0 Month-to-month
Yes
Mailed check
1 Month-to-month
Yes
Electronic check
2 Month-to-month
Yes
Electronic check
3 Month-to-month
Yes
Electronic check
4 Month-to-month
Yes
Bank transfer (automatic)
...
...
...
...
7038
Two year
Yes
Bank transfer (automatic)
7039
One year
Yes
Mailed check
7040
One year
Yes
Credit card (automatic)
7041 Month-to-month
Yes
Electronic check
7042
Yes
Bank transfer (automatic)
Two year
7043 rows × 3 columns
categorical_features=["Gender", "Senior Citizen", "Partner","Dependents","Phone Service", "Multiple Lines", "Internet
"Online Backup", "Device Protection", "Tech Support", "Streaming TV",
"Streaming Movies","Contract", "Paperless Billing", "Payment Method"]
The code below utilizes the LabelEncoder from the scikit-learn library to transform categorical features within the Telco dataset. It first
creates a deep copy of the original DataFrame to preserve the data integrity. Then, it initializes the LabelEncoder and iterates through
each categorical feature in the dataset. For each feature, it applies the label encoding transformation using the fit_transform method,
which assigns numerical labels to the categories. It also prints the unique values before and after encoding, providing insight into the
transformation. Finally, it displays the transformed DataFrame, showcasing the encoded categorical features. This process is useful for
preparing categorical data for machine learning algorithms that require numerical inputs.
from sklearn.preprocessing import LabelEncoder
# Create a deep copy of the DataFrame
cat = telco_data.copy(deep=True)
# Initialize LabelEncoder
le = LabelEncoder()
print('Label Encoder Transformation')
for feature in categorical_features:
cat[feature] = le.fit_transform(cat[feature])
print(f"{feature} : {cat[feature].unique()} = {le.inverse_transform(cat[feature].unique())}")
# Display the transformed DataFrame
print(cat.head())
Label Encoder Transformation
Gender : [1 0] = ['Male' 'Female']
Senior Citizen : [0 1] = ['No' 'Yes']
Partner : [0 1] = ['No' 'Yes']
Dependents : [0 1] = ['No' 'Yes']
Phone Service : [1 0] = ['Yes' 'No']
Multiple Lines : [0 2 1] = ['No' 'Yes' 'No phone service']
Internet Service : [0 1 2] = ['DSL' 'Fiber optic' 'No']
Online Security : [2 0 1] = ['Yes' 'No' 'No internet service']
Online Backup : [2 0 1] = ['Yes' 'No' 'No internet service']
Device Protection : [0 2 1] = ['No' 'Yes' 'No internet service']
Tech Support : [0 2 1] = ['No' 'Yes' 'No internet service']
Streaming TV : [0 2 1] = ['No' 'Yes' 'No internet service']
Streaming Movies : [0 2 1] = ['No' 'Yes' 'No internet service']
Contract : [0 2 1] = ['Month-to-month' 'Two year' 'One year']
Paperless Billing : [1 0] = ['Yes' 'No']
Payment Method : [3 2 0 1] = ['Mailed check' 'Electronic check' 'Bank transfer (automatic)'
'Credit card (automatic)']
CustomerID Count
Country
State
City Zip Code \
0 3668-QPYBK
1 United States California Los Angeles--HQITU
1 United States California Los Angeles--CDSKC
1 United States California Los Angeles--POOKP
1 United States California Los Angeles--XJGEX
1 United States California Los Angeles
90015
0
1
2
3
4
Lat Long-, -, -, -, -, -
0
1
2
3
4
Paperless Billing
1
1
1
1
1
0
1
2
3
4
Churn Label
Yes
Yes
Yes
Yes
Yes
Latitude-
Payment Method
3
2
2
2
0
Churn Value
1
1
1
1
1
Longitude
-
-
-
-
-
Gender
1
0
0
0
1
Monthly Charges-
Churn Score-
CLTV-
...
...
...
...
...
...
Contract
0
0
0
0
0
Total Charges-
\
\
Churn Reason
Competitor made better offer
Moved
Moved
Moved
Competitor had better devices
[5 rows x 33 columns]
cat.columns
Index(['CustomerID', 'Count', 'Country', 'State', 'City', 'Zip Code',
'Lat Long', 'Latitude', 'Longitude', 'Gender', 'Senior Citizen',
'Partner', 'Dependents', 'Tenure Months', 'Phone Service',
'Multiple Lines', 'Internet Service', 'Online Security',
'Online Backup', 'Device Protection', 'Tech Support', 'Streaming TV',
'Streaming Movies', 'Contract', 'Paperless Billing', 'Payment Method',
'Monthly Charges', 'Total Charges', 'Churn Label', 'Churn Value',
'Churn Score', 'CLTV', 'Churn Reason'],
dtype='object')
cat.drop(['CustomerID', 'Count', 'Country', 'State', 'City', 'Zip Code',
'Lat Long', 'Latitude', 'Longitude','Monthly Charges', 'Total Charges', 'Churn Label', 'Churn Value',
'Churn Score', 'CLTV', 'Churn Reason'], axis=1,inplace=True)
cat
Gender
Senior
Tenure Phone Multiple Internet
Online Online
Device
Tech Streaming Strea
Partner Dependents
Citizen
Months Service
Lines Service Security Backup Protection Support
TV
0
1
0
0
0
2
1
0
0
2
2
0
0
0
1
0
0
0
1
2
1
0
1
0
0
0
0
0
2
0
0
0
1
8
1
2
1
0
0
2
0
2
3
0
0
1
1
28
1
2
1
0
0
2
2
2
4
1
0
0
1
49
1
2
1
0
2
2
0
2
...
...
...
...
...
...
...
...
...
...
...
...
...
...
7038
0
0
0
0
72
1
0
2
1
1
1
1
1
7039
1
0
1
1
24
1
2
0
2
0
2
2
2
7040
0
0
1
1
72
1
2
1
0
2
2
0
2
7041
0
0
1
1
11
0
1
0
2
0
0
0
0
7042
1
0
0
0
66
1
0
1
2
0
2
2
2
7043 rows × 17 columns
The provided code concatenates two clean and analyzed dataframes into one unified dataframe, facilitating easier interpretation by
machine learning algorithms. By combining the dataframes, the algorithm gains access to a comprehensive set of features, potentially
enhancing its predictive performance. Additionally, scaling is applied to the data to ensure even distribution across all features. Scaling is
crucial for algorithms sensitive to feature magnitudes, such as those based on distance metrics like k-nearest neighbors or those
employing gradient descent optimization. By scaling the data, each feature is transformed to have a similar scale, preventing features
with larger magnitudes from dominating the algorithm's learning process. This normalization process promotes fair and effective feature
representation, contributing to more accurate and reliable model predictions. Overall, the concatenated dataframe and scaled data
collectively optimize the dataset for machine learning tasks, improving the algorithm's ability to discern patterns and make informed
predictions.
final_df = pd.concat([cat, log_scaled_numeric], axis=1)
final_df
Gender
Senior
Tenure Phone Multiple Internet
Online Online
Streaming Streaming
Partner Dependents
...
Contract
Citizen
Months Service
Lines Service Security Backup
TV
Movies
0
1
0
0
0
2
1
0
0
2
2 ...
0
0
0
1
0
0
0
1
2
1
0
1
0
0 ...
0
0
0
2
0
0
0
1
8
1
2
1
0
0 ...
2
2
0
3
0
0
1
1
28
1
2
1
0
0 ...
2
2
0
4
1
0
0
1
49
1
2
1
0
2 ...
2
2
0
...
...
...
...
...
...
...
...
...
...
... ...
...
...
...
7038
0
0
0
0
72
1
0
2
1
1 ...
1
1
2
7039
1
0
1
1
24
1
2
0
2
0 ...
2
2
1
7040
0
0
1
1
72
1
2
1
0
2 ...
2
2
1
7041
0
0
1
1
11
0
1
0
2
0 ...
0
0
0
7042
1
0
0
0
66
1
0
1
2
0 ...
2
2
2
7043 rows × 22 columns
This code checks for any duplicates in the final dataframe
#show if there are full duplicates
final_df.duplicated().sum()
0
This code checks for the sum of null values and shows that the data contains 11 null values in Total Charges columns. Tsk! something
needs to be done about it
final_df.isnull().sum()
Gender
Senior Citizen
Partner
Dependents
Tenure Months
Phone Service
Multiple Lines
Internet Service
Online Security
Online Backup
Device Protection
Tech Support
Streaming TV
Streaming Movies
Contract
Paperless Billing
Payment Method
Tenure Months
Monthly Charges
Total Charges
CLTV
Churn Value
dtype: int64
-
You guessed right i did something about it. I imputed the null values with their mean, cool right!
# Impute median values to the null values in 'Total Charges'
median_total_charges = final_df['Total Charges'].median()
final_df['Total Charges'].fillna(median_total_charges, inplace=True)
final_df.isnull().sum()
Gender
Senior Citizen
Partner
Dependents
Tenure Months
Phone Service
Multiple Lines
Internet Service
Online Security
Online Backup
Device Protection
Tech Support
Streaming TV
Streaming Movies
Contract
Paperless Billing
Payment Method
Tenure Months
Monthly Charges
Total Charges
CLTV
Churn Value
dtype: int64
-
Modelling
1.Import Packages
Let's start by importing the warnings.These warnings typically indicate potential changes in behavior or deprecation of certain features
that might occur in future versions of libraries or Python itself. By filtering out these warnings with warnings.filterwarnings("ignore",
category=FutureWarning), the code ensures a cleaner output without these specific warning messages, which can help improve
readability during code execution.
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)
The code imports essential libraries and modules for data analysis and machine learning in Python. It includes pandas and numpy for
data manipulation and numerical computations, matplotlib for plotting, and sklearn for machine learning tasks. Specific modules from
sklearn are imported for data preprocessing, model training, evaluation, and ensemble methods such as random forests, k-nearest
neighbors, support vector machines, gradient boosting, and AdaBoost. These tools collectively enable comprehensive data analysis and
facilitate the development of machine learning models for various tasks.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from matplotlib.colors import ListedColormap
from sklearn.feature_extraction.text import CountVectorizer
from
from
from
from
from
from
from
sklearn import metrics
sklearn.model_selection import train_test_split
sklearn.model_selection import cross_val_score
sklearn.ensemble import RandomForestClassifier
sklearn.neighbors import KNeighborsClassifier
sklearn.svm import SVC
sklearn.ensemble import GradientBoostingClassifier, AdaBoostClassifier
Let's Build Some Classification Models
# Define the names of the classifiers
names = ['Nearest Neighbors','Linear SVM', 'RBF SVM','RandomForest',
'AdaBoost','GradientBoost']
# Define the classifiers with their respective hyperparameters
classifiers = [
GradientBoostingClassifier(),
KNeighborsClassifier(3),
SVC(kernel="linear", C=0.025),
SVC(gamma=2, C=1),
RandomForestClassifier(max_depth=5, n_estimators=10, max_features=1),
AdaBoostClassifier()
]
# Make a copy of our data
train_df = final_df.copy()
# Separate target variable from independent variables
y = final_df['Churn Value']
X = final_df.drop(columns=['Churn Value'])
print(X.shape)
print(y.shape)
(7043, 21)
(7043,)
The code divides the dataset into training and testing subsets using the train_test_split function. It assigns features and target labels for
both sets, specifying that 25% of the data will be allocated to the testing set while maintaining the random state for reproducibility. Finally,
it prints the shapes of the training and testing sets to confirm the split.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)
(5282, 21)
(5282,)
(1761, 21)
(1761,)
The code initializes empty lists and dictionaries to store evaluation metrics, trained models, confusion matrices, and classification reports
for each classifier. It iterates over each classifier, fitting the model to the training data, predicting on both training and testing sets, and
calculating evaluation metrics such as accuracy, precision, recall, and F1 scores. The results are saved to respective dictionaries and
appended to the results list. Finally, the results are converted into a DataFrame for easy visualization.
# Empty lists to store results
results = [] # Store evaluation metrics for each classifier
models = {} # Store trained models
confusion = {} # Store confusion matrices for each classifier
class_report = {} # Store classification reports for each classifier
# Iterate over each classifier
for name, clf in zip(names, classifiers):
print ('Fitting {:s} model...'.format(name))
# Measure the time taken to fit the model
run_time = %timeit -q -o clf.fit(X_train, y_train)
print ('... predicting')
# Predict on the training data
y_pred = clf.predict(X_train)
y_pred_test = clf.predict(X_test)
print ('...
# Calculate
accuracy =
precision =
recall
=
f1
f1_test
scoring')
evaluation metrics
metrics.accuracy_score(y_train, y_pred)
metrics.precision_score(y_train, y_pred)
metrics.recall_score(y_train, y_pred)
= metrics.f1_score(y_train, y_pred)
= metrics.f1_score(y_test, y_pred_test)
# Save the results to dictionaries
models[name] = clf
confusion[name] = metrics.confusion_matrix(y_train, y_pred)
class_report[name] = metrics.classification_report(y_train, y_pred)
# Append results to the list
results.append([name, accuracy, precision, recall, f1, f1_test, run_time.best])
# Convert results to DataFrame for easy visualisation
results = pd.DataFrame(results, columns=['Classifier', 'Accuracy', 'Precision', 'Recall', 'F1 Train', 'F1 Test', 'Tra
results.set_index('Classifier', inplace= True)
print ('... All done!')
Fitting Nearest Neighbors model...
... predicting
... scoring
Fitting Linear SVM model...
... predicting
... scoring
Fitting RBF SVM model...
... predicting
... scoring
Fitting RandomForest model...
... predicting
... scoring
Fitting AdaBoost model...
... predicting
... scoring
Fitting GradientBoost model...
... predicting
... scoring
... All done!
results.sort_values('F1 Train', ascending=False)
Accuracy Precision
Recall
F1 Train
F1 Test Train Time
Classifier
RandomForest
-
-
-
Linear SVM
-
-
-
Nearest Neighbors
-
-
-
GradientBoost
-
-
-
RBF SVM
-
-
-
AdaBoost
-
-
-
A summary of the evaluation metrics for each classifier:
RandomForest:
Achieved the highest accuracy of 98.31% and precision of 97.98%, indicating strong performance in correctly identifying both positive
and negative cases. However, its recall score of 95.47% suggests that it may miss some instances of positive cases. The F1 scores for
both training and testing data are high, indicating a good balance between precision and recall. The training time is relatively high at 2.53
seconds.
Linear SVM:
Achieved an accuracy of 86.77% and a precision of 76.52%, with a recall of 70.66%. While it demonstrates acceptable performance, its
F1 scores and training time are relatively lower compared to RandomForest.
Nearest Neighbors:
Achieved an accuracy of 83.81% and a precision of 72.93%, with a recall of 59.78%. The F1 scores are moderate, and the training time
is considerably higher than Linear SVM.
GradientBoost:
Achieved an accuracy of 81.39% and a precision of 66.25%, with a recall of 57.59%. Similar to Nearest Neighbors, it shows moderate
performance in terms of accuracy and precision, with relatively lower recall. The training time is comparable to RandomForest.
RBF SVM:
Achieved an accuracy of 81.07% and a precision of 67.93%, with a recall of 51.17%. While its precision is higher compared to
GradientBoost, its recall is relatively lower. The training time is similar to RandomForest.
AdaBoost
Achieved an accuracy of 79.33% and a precision of 76.43%, with a recall of 29.34%. It shows the lowest recall among all classifiers,
indicating its limitation in identifying positive cases. However, it demonstrates a relatively high precision. The training time is the lowest
among all classifiers.
# Display confusion matrices and classification reports
for name, matrix in confusion.items():
print(f"Confusion Matrix for {name}:")
print(matrix)
print()
for name, report in class_report.items():
print(f"Classification Report for {name}:")
print(report)
print()
Confusion Matrix for Nearest Neighbors:
[[-]
[ 551 819]]
Confusion Matrix for Linear SVM:
[[-]
[ 402 968]]
Confusion Matrix for RBF SVM:
[[-]
[ 669 701]]
Confusion Matrix for RandomForest:
[[3885
27]
[ 62 1308]]
Confusion Matrix for AdaBoost:
[[-]
[ 968 402]]
Confusion Matrix for GradientBoost:
[[-]
[ 581 789]]
Classification Report for Nearest Neighbors:
precision
recall f1-score
support
0
1
0.87
0.73
0.92
0.60
0.89
0.66
-
accuracy
macro avg
weighted avg
0.80
0.83
0.76
0.84
-
-
Classification Report for Linear SVM:
precision
recall f1-score
support
0
1
0.90
0.77
0.92
0.71
0.91
0.73
-
accuracy
macro avg
weighted avg
0.83
0.86
0.82
0.87
-
-
Classification Report for RBF SVM:
precision
recall f1-score
support
0
0.84
0.92
0.88
3912
1
0.68
0.51
0.58
1370
accuracy
macro avg
weighted avg
0.76
0.80
0.71
0.81
-
-
Classification Report for RandomForest:
precision
recall f1-score
support
0
1
0.98
0.98
0.99
0.95
0.99
0.97
-
accuracy
macro avg
weighted avg
0.98
0.98
0.97
0.98
-
-
Classification Report for AdaBoost:
precision
recall f1-score
support
0
1
0.80
0.76
0.97
0.29
0.87
0.42
-
accuracy
macro avg
weighted avg
0.78
0.79
0.63
0.79
-
-
Classification Report for GradientBoost:
precision
recall f1-score
support
0
1
0.86
0.66
0.90
0.58
0.88
0.62
-
accuracy
macro avg
weighted avg
0.76
0.81
0.74
0.81
-
-
Based on the provided confusion matrices and classification reports:
Nearest Neighbors (KNN):
Performs reasonably well with an accuracy of 84%. Better at identifying negatives (0) than positives (1), as indicated by higher precision
and recall for class 0. Lower precision and recall for class 1 suggest some misclassification of positives.
Linear SVM:
Achieves an accuracy of 87%, indicating good performance. Shows a slight imbalance in precision and recall between the two classes,
with class 0 having higher values. Overall, precision and recall are balanced, indicating robust performance.
RBF SVM:
Demonstrates an accuracy of 81%. Shows relatively lower precision and recall for class 1, suggesting difficulty in correctly identifying
positives. Class 0 has higher precision and recall, indicating better performance in identifying negatives.
RandomForest:
Performs impressively well with an accuracy of 98%. High precision and recall for both classes indicate excellent performance in
classifying both positives and negatives. Minimal misclassification evident from the confusion matrix (27 false positives and 62 false
negatives).
AdaBoost:
Displays an accuracy of 79%. High precision but low recall for class 1 suggests it may be overly conservative in predicting positives.
Imbalance in precision and recall indicates the model's struggle with correctly identifying positives.
GradientBoost:
Achieves an accuracy of 81%. Slightly lower precision and recall for class 1 compared to class 0. Overall, performs reasonably well but
shows some difficulty in correctly identifying positives.
In summary, RandomForest stands out as the best performer with high accuracy and balanced precision and recall for both classes.
Linear SVM also performs well with balanced precision and recall, while other models show varying degrees of performance with
strengths and weaknesses in different aspects of classification.
Instanciate A RandomForest Instance
model = RandomForestClassifier(
n_estimators=1000
)
model.fit(X_train, y_train)
▾
RandomForestClassifier
i
?
RandomForestClassifier(n_estimators=1000)
Model Understanding
A simple way of understanding the results of a model is to look at feature importances. Feature importances indicate the importance of a
feature within the predictive model, there are several ways to calculate feature importance, but with the Random Forest classifier, we're
able to extract feature importances using the built-in method on the trained model. In the Random Forest case, the feature importance
represents the number of times each feature is used for splitting across all trees.
models
{'Nearest Neighbors': GradientBoostingClassifier(),
'Linear SVM': KNeighborsClassifier(n_neighbors=3),
'RBF SVM': SVC(C=0.025, kernel='linear'),
'RandomForest': SVC(C=1, gamma=2),
'AdaBoost': RandomForestClassifier(max_depth=5, max_features=1, n_estimators=10),
'GradientBoost': AdaBoostClassifier()}
Feature Importance Based on all the Models
feature_importances = pd.DataFrame({
'features': X_train.columns,
'importance': clf.feature_importances_
}).sort_values(by='importance', ascending=False).reset_index()
feature_importances
index
features importance
0
20
CLTV
0.20
1
14
Contract
0.14
2
19
Total Charges
0.14
3
18
Monthly Charges
0.14
4
4
Tenure Months
0.10
5
17
Tenure Months
0.08
6
3
Dependents
0.04
7
8
Online Security
0.04
8
16
Payment Method
0.04
9
2
Partner
0.02
10
6
Multiple Lines
0.02
11
11
Tech Support
0.02
12
15
Paperless Billing
0.02
13
0
Gender
0.00
14
13 Streaming Movies
0.00
15
12
Streaming TV
0.00
16
1
Senior Citizen
0.00
17
9
Online Backup
0.00
18
7
Internet Service
0.00
19
5
Phone Service
0.00
20
10
Device Protection
0.00
The feature importance analysis reveals insights into the predictive power of different features in the dataset. Among the features
considered, CLTV (Customer Lifetime Value) emerges as the most influential, with a relative importance score of 0.20, indicating its
significant role in predicting churn. Contract type and both monthly and total charges follow closely, each contributing 0.14 to the
predictive model. Tenure months, appearing twice in the list, also show importance, albeit to a lesser extent, with one instance ranking at
0.10 and the other at 0.08. Other features such as Dependents, Online Security, and Payment Method exhibit modest importance with
scores of 0.04. Interestingly, some features like Gender, Streaming Movies, and Streaming TV carry negligible importance, each scoring
zero, suggesting they have little impact on predicting churn in this context. Overall, this analysis highlights the key predictors of churn
within the dataset, providing valuable insights for targeted intervention strategies.
plt.figure(figsize=(15, 25))
plt.title('Feature Importances')
plt.barh(range(len(feature_importances)), feature_importances['importance'], color='b', align='center')
plt.yticks(range(len(feature_importances)), feature_importances['features'])
plt.xlabel('Importance')
plt.show()
Feature Importances Based on the RandomForest Model
feature_importances = pd.DataFrame({
'features': X_train.columns,
'importance': model.feature_importances_
}).sort_values(by='importance', ascending=False).reset_index()
feature_importances
index
features importance
0
18
Monthly Charges
-
1
19
Total Charges
-
2
20
CLTV
-
3
17
Tenure Months
-
4
4
Tenure Months
-
5
14
Contract
-
6
8
Online Security
-
7
16
Payment Method
-
8
11
Tech Support
-
9
7
Internet Service
-
10
3
Dependents
-
11
10
Device Protection
-
12
2
Partner
-
13
15
Paperless Billing
-
14
6
Multiple Lines
-
15
0
Gender
-
16
9
Online Backup
-
17
1
Senior Citizen
-
18
12
Streaming TV
-
19
13 Streaming Movies
-
20
5
Phone Service
-
Comparing the latest feature importance analysis with the previous findings, Monthly Charges emerge as the most influential feature,
topping the list with a relative importance score of 0.135. This contrasts with the previous analysis where CLTV held the highest
importance score. The shift in importance highlights the dynamic nature of feature relevance in predictive modeling and underscores the
need for periodic reassessment of feature importance.
Interestingly, Total Charges and CLTV, which held significant importance in the previous analysis, follow closely behind Monthly Charges
in the latest findings. Despite their slightly lower importance scores, they remain pivotal predictors of churn.
Moreover, Tenure Months, represented twice in the list with varying importance scores, maintain their relevance in both analyses. While
their importance ranks slightly lower in the latest analysis compared to the previous one, they still demonstrate significant predictive
power.
Contract type, Online Security, and Payment Method also retain their positions in the list of influential features, albeit with minor
fluctuations in importance scores.
Notably, features like Gender, Streaming Movies, and Streaming TV continue to exhibit negligible importance in predicting churn,
corroborating previous observations.
Feature Importance Summary
Understanding feature importance is crucial for effective predictive modeling, particularly in the context of churn prediction. By identifying
the most influential features, businesses can prioritize their resources and strategies to address the underlying factors driving customer
attrition. For instance, knowing that CLTV, Total Charges,Tenure Month, Contract Type and Monthly Charges are significant
predictors allows companies to tailor retention efforts and pricing strategies accordingly. Moreover, insights gained from feature
importance analysis can inform decision-making processes, such as resource allocation for customer retention programs or the
development of personalized offers aimed at reducing churn. By leveraging this knowledge, organizations can optimize their efforts to
retain valuable customers, enhance customer satisfaction, and ultimately improve business performance. Therefore, understanding
feature importance not only enhances the predictive accuracy of churn models but also empowers businesses to proactively mitigate
churn and foster long-term customer loyalty.
Random Forest Evaluation
predictions = model.predict(X_test)
tn, fp, fn, tp = metrics.confusion_matrix(y_test, predictions).ravel()
y_test.value_counts()
Churn Value-
Name: count, dtype: int64
print(f"True positives: {tp}")
print(f"False positives: {fp}")
print(f"True negatives: {tn}")
print(f"False negatives: {fn}\n")
print(f"Accuracy: {metrics.accuracy_score(y_test, predictions)}")
print(f"Precision: {metrics.precision_score(y_test, predictions)}")
print(f"Recall: {metrics.recall_score(y_test, predictions)}")
True positives: 268
False positives: 131
True negatives: 1131
False negatives: 231
Accuracy:-
Precision:-
Recall:-
Hyperparameter Tuning
Important_features = ['Monthly Charges','Total Charges','CLTV','Tenure Months','Tenure Months','Contract','Online Sec
,'Gender']
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
# Assuming 'X' is your feature matrix and 'y' is your target variable
# Replace 'X' and 'y' with your actual feature matrix and target variable
# Assuming you already have your dataset loaded into 'data'
# Create a DataFrame with important features
important_features = ['Monthly Charges', 'Total Charges', 'CLTV', 'Tenure Months', 'Tenure Months', 'Contract',
'Online Security', 'Payment Method', 'Tech Support', 'Internet Service', 'Dependents',
'Device Protection', 'Partner', 'Paperless Billing', 'Multiple Lines','Gender']
data_subset = train_df[important_features]
# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(data_subset, y, test_size=0.2, random_state=42)
# Initialize Random Forest classifier
rf_model = RandomForestClassifier()
# Train the model
rf_model.fit(X_train, y_train)
# Evaluate the model
accuracy = rf_model.score(X_test, y_test)
print("Accuracy:", accuracy)
Accuracy:-
from sklearn.model_selection import GridSearchCV
# Define the hyperparameters grid to search
param_grid = {
'n_estimators': [100, 200, 300],
'max_depth': [None, 10, 20],
'min_samples_split': [2, 5, 10],
'min_samples_leaf': [1, 2, 4],
'bootstrap': [True, False]
}
# Initialize Random Forest classifier
rf_model = RandomForestClassifier()
# Initialize GridSearchCV with the Random Forest classifier and the hyperparameters grid
grid_search = GridSearchCV(estimator=rf_model, param_grid=param_grid, cv=5, scoring='accuracy', n_jobs=-1)
# Perform grid search to find the best hyperparameters
grid_search.fit(X_train, y_train)
# Get the best hyperparameters
best_params = grid_search.best_params_
# Train a new Random Forest model using the best hyperparameters
best_rf_model = RandomForestClassifier(**best_params)
best_rf_model.fit(X_train, y_train)
# Evaluate the model
accuracy = best_rf_model.score(X_test, y_test)
print("Accuracy after hyperparameter tuning:", accuracy)
Accuracy after hyperparameter tuning:-
After performing hyperparameter tuning, the Random Forest classifier achieved an improved accuracy of 80.34%, compared to the initial
accuracy of 79.44%. This enhancement demonstrates the effectiveness of fine-tuning model parameters to achieve better performance.
Summary of Hyperparameter Tuning Process
Define the Hyperparameters Grid:
A range of potential values for various hyperparameters of the Random Forest model is specified. These include the number of trees
(n_estimators), maximum depth of trees (max_depth), minimum number of samples required to split a node (min_samples_split),
minimum number of samples required at each leaf node (min_samples_leaf), and whether to use bootstrap samples (bootstrap).
Initialize GridSearchCV:
The GridSearchCV function is used to systematically explore the defined hyperparameters grid. It evaluates each combination using
cross-validation to determine the optimal set of hyperparameters.
Fit the Model:
The fit method of GridSearchCV is applied to the training data, performing an exhaustive search over the hyperparameters grid. It
identifies the best hyperparameters based on the cross-validation scores.
Train the Final Model:
Using the best hyperparameters identified by GridSearchCV, a new Random Forest model is instantiated and trained on the full training
dataset.
Evaluate the Model:
The final model's performance is evaluated on the test dataset, resulting in the improved accuracy of 80.34%.
Final Remarks on Random Forest Evaluation
Despite the improvement in accuracy to 80.34% after hyperparameter tuning, further data exploration and analysis are required.
Specifically, addressing the imbalance in the dataset is crucial. Techniques such as downsampling the majority class or acquiring
additional data to balance the churn values should be considered to ensure that the model performs equitably across all classes.
Balancing the dataset can significantly enhance the model's ability to predict churn accurately. Additionally, continuous feature
importance analysis and periodic retraining with updated data can help maintain and potentially improve the model's performance over
time. These steps are essential for developing a robust and reliable churn prediction model using the Random Forest classifier.
Reporting
Project Summary Report: Customer Churn Prediction
Overview
The primary goal of this project was to develop a predictive model to identify customers likely to churn, enabling targeted intervention
strategies. I utilized a Kaggle dataset containing various customer attributes and trained multiple machine learning models to predict
churn. After extensive analysis, a Random Forest classifier was chosen due to its superior performance. This report summarizes the
project process, key findings, and recommended action steps based on the predictive model and feature importance analysis.
Data Analysis and Preprocessing
Initial Data Exploration:
The dataset was explored to understand the distribution of features and the target variable (churn). Key features were identified, including
Customer Lifetime Value (CLTV), contract type, monthly charges, and total charges.
Handling Imbalanced Data:
The dataset was imbalanced, with significantly more non-churners than churners. Techniques such as resampling or acquiring additional
data were suggested to address this issue.
Feature Selection:
Important features were selected based on their predictive power, including CLTV, contract type, monthly charges, total charges, tenure
months, online security, payment method, and tech support.
Model Training and Hyperparameter Tuning
Model Training: Various models were trained, including Nearest Neighbors, Linear SVM, RBF SVM, Random Forest, AdaBoost, and
Gradient Boosting. Random Forest emerged as the best performer.
Hyperparameter Tuning:
GridSearchCV was used to optimize the Random Forest classifier's hyperparameters, resulting in an improved accuracy of 80.34%.
Feature Importance Analysis
Key Predictors:
CLTV, monthly charges, total charges, contract type, and tenure months were identified as the most significant predictors of customer
churn. Less Important Features: Features like gender, streaming services (TV and movies), and phone service were found to have
negligible impact on churn prediction.
Impact on Customer Churn
The Random Forest model, with its improved accuracy, provides valuable insights into the factors influencing customer churn. By
understanding the key predictors, businesses can:
Tailor Retention Efforts:
Focus on customers with high CLTV and those with contracts nearing expiration to prevent churn.
Adjust Pricing Strategies:
Analyze and optimize monthly and total charges to improve customer satisfaction and retention.
Enhance Service Offerings:
Improve features like online security and tech support, which were found to be moderately important in predicting churn.
Segment Customers:
Use tenure months to identify and address the needs of long-term and short-term customers differently.
Recommended Action Steps
Balance the Dataset:
Implement techniques such as downsampling the majority class or acquiring more data to ensure an evenly balanced dataset for future
modeling.
Continuous Monitoring and Retraining:
Periodically retrain the model with updated data and conduct feature importance analysis to adapt to changing customer behavior.
Targeted Interventions:
Develop personalized retention strategies based on the key predictors identified, focusing resources on high-risk customers.
Enhanced Customer Engagement:
Proactively engage with customers through personalized offers and improved service quality to reduce churn rates.
Conclusion
This project successfully developed a robust predictive model for customer churn using the Random Forest classifier. The insights gained
from the feature importance analysis and the model's predictions can guide targeted interventions to reduce churn and enhance
customer loyalty. By addressing the imbalances in the dataset and continuously refining the model, businesses can achieve even better
performance and more effectively mitigate customer attrition.
Loading [MathJax]/extensions/Safe.js