Collins Macharia Muturi | Freelancer Machine Learning | Data Science

Machine Learning | Data science | Data analytics

Predictive Model That Predicts Customer Churn Introduction This project focuses on developing a predictive model to identify customers likely to churn, allowing businesses to implement targeted retention strategies. By analyzing a dataset containing various customer attributes, we aimed to uncover the key factors driving customer attrition. The project involved training multiple machine learning models, with the Random Forest classifier ultimately chosen for its superior performance. Key steps included data preprocessing, handling imbalanced data, feature selection, model training, and hyperparameter tuning. The results provided actionable insights into customer behavior, enabling businesses to tailor their efforts to retain valuable customers and reduce churn. I'll start by importing the necessary packages for my initial ETL (Extract, Transform, Load) process that involves data extraction, cleaning, transformation, and loading into a suitable format for analysis. The following libraries will be used: Import Packages # Importing necessary packages for ETL process import pandas as pd # For data manipulation and analysis import numpy as np # For numerical operations import matplotlib.pyplot as plt # For data visualization import seaborn as sns # For statistical data visualization from sklearn.model_selection import train_test_split # For splitting data into training and testing sets from sklearn.preprocessing import StandardScaler # For data normalization from sklearn.ensemble import RandomForestClassifier # For building the predictive model from sklearn.model_selection import GridSearchCV # For hyperparameter tuning from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, precision_score, recall_score import seaborn as sns from datetime import datetime # Shows plots in jupyter notebook %matplotlib inline # Set plot style sns.set(color_codes=True) import numpy as np import pandas as pd import matplotlib.pyplot as plt import warnings warnings.filterwarnings('ignore',category=FutureWarning) Load the data This initial step involved installing the openpyxl package to handle Excel files and loading the Telco customer churn data into a pandas DataFrame for analysis. pip install openpyxl Requirement already satisfied: openpyxl in c:\users\collins pc\anaconda3\envs\collonel\lib\site-packages (3.1.3) Note: you may need to restart the kernel to use updated packages. Requirement already satisfied: et-xmlfile in c:\users\collins pc\anaconda3\envs\collonel\lib\site-packages (from openpyxl) (1.1.0) telco_data= pd.read_excel('C:/Users/Collins PC/Downloads/Telco_customer_churn/churn.xlsx') telco_data.head() CustomerID Count Country State City Zip Code Lat Long Latitude Longitude Gender ... Contract 0 3668QPYBK 1 United Los-, California- - States Angeles - 1 9237HQITU 1 United Los California 90005 States Angeles 2 9305CDSKC 3 4 Paperless Billing Male ... Monthto-month Yes -,- - - Female ... Monthto-month Yes 1 United Los-, California- - States Angeles - Female ... Monthto-month Yes 7892POOKP 1 United Los-, California- - States Angeles - Female ... Monthto-month Yes 0280XJGEX 1 United Los-, California- - States Angeles - Male ... Monthto-month Yes 5 rows × 33 columns Data Wrangling The data wrangling process ivolves loading the Telco customer churn dataset and inspecting its structure, which consists of 7043 entries and 33 columns. Initial checks revealed no missing values for most columns except for 'Churn Reason,' which had a significant number of missing entries (5174 out of 7043). The 'Total Charges' column, originally an object type, was converted to numeric, resulting in 11 missing values due to non-numeric entries being coerced. Further analysis showed that the dataset is imbalanced, with 5174 non-churn entries and 1869 churn entries. These steps set the foundation for subsequent data cleaning, transformation, and modeling efforts. telco_data.info() RangeIndex: 7043 entries, 0 to 7042 Data columns (total 33 columns): # Column Non-Null Count Dtype --- ------------------- ----0 CustomerID 7043 non-null object 1 Count 7043 non-null int64 2 Country 7043 non-null object 3 State 7043 non-null object 4 City 7043 non-null object 5 Zip Code 7043 non-null int64 6 Lat Long 7043 non-null object 7 Latitude 7043 non-null float64 8 Longitude 7043 non-null float64 9 Gender 7043 non-null object 10 Senior Citizen 7043 non-null object 11 Partner 7043 non-null object 12 Dependents 7043 non-null object 13 Tenure Months 7043 non-null int64 14 Phone Service 7043 non-null object 15 Multiple Lines 7043 non-null object 16 Internet Service 7043 non-null object 17 Online Security 7043 non-null object 18 Online Backup 7043 non-null object 19 Device Protection 7043 non-null object 20 Tech Support 7043 non-null object 21 Streaming TV 7043 non-null object 22 Streaming Movies 7043 non-null object 23 Contract 7043 non-null object 24 Paperless Billing 7043 non-null object 25 Payment Method 7043 non-null object 26 Monthly Charges 7043 non-null float64 27 Total Charges 7043 non-null object 28 Churn Label 7043 non-null object 29 Churn Value 7043 non-null int64 30 Churn Score 7043 non-null int64 31 CLTV 7043 non-null int64 32 Churn Reason 1869 non-null object dtypes: float64(3), int64(6), object(24) memory usage: 1.8+ MB telco_data.isnull().sum() Paym Meth Electro Electro Electro trans (automa CustomerID Count Country State City Zip Code Lat Long Latitude Longitude Gender Senior Citizen Partner Dependents Tenure Months Phone Service Multiple Lines Internet Service Online Security Online Backup Device Protection Tech Support Streaming TV Streaming Movies Contract Paperless Billing Payment Method Monthly Charges Total Charges Churn Label Churn Value Churn Score CLTV Churn Reason dtype: int64 - telco_data['Churn Reason'].head() 0 Competitor made better offer 1 Moved 2 Moved 3 Moved 4 Competitor had better devices Name: Churn Reason, dtype: object # Convert 'Total Charges' to numeric, using errors='coerce' to handle spaces and non-numeric values. telco_data['Total Charges'] = pd.to_numeric(telco_data['Total Charges'], errors='coerce') telco_data.info() RangeIndex: 7043 entries, 0 to 7042 Data columns (total 33 columns): # Column Non-Null Count Dtype --- ------------------- ----0 CustomerID 7043 non-null object 1 Count 7043 non-null int64 2 Country 7043 non-null object 3 State 7043 non-null object 4 City 7043 non-null object 5 Zip Code 7043 non-null int64 6 Lat Long 7043 non-null object 7 Latitude 7043 non-null float64 8 Longitude 7043 non-null float64 9 Gender 7043 non-null object 10 Senior Citizen 7043 non-null object 11 Partner 7043 non-null object 12 Dependents 7043 non-null object 13 Tenure Months 7043 non-null int64 14 Phone Service 7043 non-null object 15 Multiple Lines 7043 non-null object 16 Internet Service 7043 non-null object 17 Online Security 7043 non-null object 18 Online Backup 7043 non-null object 19 Device Protection 7043 non-null object 20 Tech Support 7043 non-null object 21 Streaming TV 7043 non-null object 22 Streaming Movies 7043 non-null object 23 Contract 7043 non-null object 24 Paperless Billing 7043 non-null object 25 Payment Method 7043 non-null object 26 Monthly Charges 7043 non-null float64 27 Total Charges 7032 non-null float64 28 Churn Label 7043 non-null object 29 Churn Value 7043 non-null int64 30 Churn Score 7043 non-null int64 31 CLTV 7043 non-null int64 32 Churn Reason 1869 non-null object dtypes: float64(4), int64(6), object(23) memory usage: 1.8+ MB telco_data['Churn Value'].value_counts() Churn Value- Name: count, dtype: int64 Summary Statistics telco_data.describe() Count count 7043.0 mean Zip Code Latitude Longitude Tenure Months Monthly Charges Total Charges Churn Value Churn Score CLTV - - - - - - - - - - - std 0.0 - - - - min - - - - - - - - 25% - - - - - - - - 50% - - - - - - - 75% - - - - - - - max - - - - - - - The descriptive statistics of the Telco customer churn dataset provide key insights into its numerical features. The dataset includes 7043 entries for most columns, with the exception of 'Total Charges,' which has 7032 entries. The mean values for key features are as follows: 'Tenure Months' is 32.37, 'Monthly Charges' is 64.76, and 'CLTV' (Customer Lifetime Value) is 4400.30. The dataset shows variability, with 'Tenure Months' ranging from 0 to 72, 'Monthly Charges' from 18.25 to 118.75, and 'Total Charges' from 18.80 to 8684.80. The 'Churn Value' indicates that approximately 26.5% of the customers have churned, as reflected in the mean value of 0.265. The 'Churn Score' ranges from 5 to 100, with a mean of 58.70, indicating a broad distribution of customer churn risk scores. These statistics highlight the diversity and range of customer behaviors captured in the dataset, which is essential for developing a robust churn prediction model. Understanding the Data col = list(telco_data.columns) categorical_features = [] numerical_features = [] for i in col: unique_values = telco_data[i].unique() if len(unique_values) > 6: numerical_features.append(i) else: categorical_features.append(i) print(f"Column: {i}") print(f"Unique Values ({len(unique_values)}): {unique_values}") print("-" * 50) print('Categorical Features:', *categorical_features) print('Numerical Features:', *numerical_features) Column: CustomerID Unique Values (7043): ['3668-QPYBK' '9237-HQITU' '9305-CDSKC' ... '2234-XADUH' '4801-JZAZL' '3186-AJIEK'] -------------------------------------------------Column: Count Unique Values (1): [1] -------------------------------------------------Column: Country Unique Values (1): ['United States'] -------------------------------------------------Column: State Unique Values (1): ['California'] -------------------------------------------------Column: City Unique Values (1129): ['Los Angeles' 'Beverly Hills' 'Huntington Park' ... 'Standish' 'Tulelake' 'Olympic Valley'] -------------------------------------------------Column: Zip Code Unique Values (1652): [- ..-] -------------------------------------------------Column: Lat Long Unique Values (1652): ['-, -' '-, -' '-, -' ... '-, -' '-, -' '-, -'] -------------------------------------------------Column: Latitude Unique Values (1652): [- ..-] -------------------------------------------------Column: Longitude Unique Values (1651): [- - - ... - - -] -------------------------------------------------Column: Gender Unique Values (2): ['Male' 'Female'] -------------------------------------------------Column: Senior Citizen Unique Values (2): ['No' 'Yes'] -------------------------------------------------Column: Partner Unique Values (2): ['No' 'Yes'] -------------------------------------------------Column: Dependents Unique Values (2): ['No' 'Yes'] -------------------------------------------------Column: Tenure Months Unique Values (73): [-] -------------------------------------------------Column: Phone Service Unique Values (2): ['Yes' 'No'] -------------------------------------------------Column: Multiple Lines Unique Values (3): ['No' 'Yes' 'No phone service'] -------------------------------------------------Column: Internet Service Unique Values (3): ['DSL' 'Fiber optic' 'No'] -------------------------------------------------Column: Online Security Unique Values (3): ['Yes' 'No' 'No internet service'] -------------------------------------------------Column: Online Backup Unique Values (3): ['Yes' 'No' 'No internet service'] -------------------------------------------------Column: Device Protection Unique Values (3): ['No' 'Yes' 'No internet service'] -------------------------------------------------Column: Tech Support Unique Values (3): ['No' 'Yes' 'No internet service'] -------------------------------------------------Column: Streaming TV Unique Values (3): ['No' 'Yes' 'No internet service'] -------------------------------------------------Column: Streaming Movies Unique Values (3): ['No' 'Yes' 'No internet service'] -------------------------------------------------Column: Contract Unique Values (3): ['Month-to-month' 'Two year' 'One year'] -------------------------------------------------Column: Paperless Billing Unique Values (2): ['Yes' 'No'] -------------------------------------------------Column: Payment Method Unique Values (4): ['Mailed check' 'Electronic check' 'Bank transfer (automatic)' 'Credit card (automatic)'] -------------------------------------------------Column: Monthly Charges Unique Values (1585): [- ..- ] -------------------------------------------------Column: Total Charges Unique Values (6531): [- ..- ] -------------------------------------------------Column: Churn Label Unique Values (2): ['Yes' 'No'] -------------------------------------------------Column: Churn Value Unique Values (2): [1 0] -------------------------------------------------Column: Churn Score Unique Values (85): [-] -------------------------------------------------Column: CLTV Unique Values (3438): [- ..-] -------------------------------------------------Column: Churn Reason Unique Values (21): ['Competitor made better offer' 'Moved' 'Competitor had better devices' 'Competitor offered higher download speeds' 'Competitor offered more data' 'Price too high' 'Product dissatisfaction' 'Service dissatisfaction' 'Lack of self-service on Website' 'Network reliability' 'Limited range of services' 'Lack of affordable download/upload speed' 'Long distance charges' 'Extra data charges' "Don't know" 'Poor expertise of online support' 'Poor expertise of phone support' 'Attitude of service provider' 'Attitude of support person' 'Deceased' nan] -------------------------------------------------Categorical Features: Count Country State Gender Senior Citizen Partner Dependents Phone Service Multiple Lines Internet Service Online Security Online Backup Device Protection Tech Support Streaming TV Streaming Movies Cont ract Paperless Billing Payment Method Churn Label Churn Value Numerical Features: CustomerID City Zip Code Lat Long Latitude Longitude Tenure Months Monthly Charges Total Cha rges Churn Score CLTV Churn Reason Summary of Findings: The dataset contains a mix of categorical and numerical features, with some initial observations needing clarification. Below are the key points from the analysis: Categorical Features: Correctly Identified: Country, State, Gender, Senior Citizen, Partner, Dependents, Phone Service, Multiple Lines, Internet Service, Online Security, Online Backup, Device Protection, Tech Support, Streaming TV, Streaming Movies, Contract, Paperless Billing, Payment Method, Churn Label. Incorrectly Identified: Churn Reason: Despite having 21 unique values, it is a categorical variable describing reasons for customer churn. City: Despite having 1129 unique values, it represents the names of cities and should be classified as a categorical feature. Numerical Features: Correctly Identified: Churn Value,Zip Code, Lat Long, Latitude, Longitude, Tenure Months, Monthly Charges, Total Charges, Churn Score, CLTV. Incorrectly Identified: CustomerID: Despite having 7043 unique values, it is an identifier and should not be treated as a numerical feature. Geographical Insight: The data indicates that all customers are from the United States, specifically from the state of California. This is evident from the unique values in the Country and State columns, which are both singularly populated with 'United States' and 'California', respectively. Detailed Insights: Numerical Variables: Variables such as Zip Code, Lat Long, Latitude, and Longitude provide precise geographic information about the customers. Variables like Tenure Months, Monthly Charges, and Total Charges offer crucial data for understanding customer behavior and financial aspects. Categorical Variables: Variables such as Gender, Senior Citizen, Partner, and Dependents provide demographic information about the customers. Variables like Phone Service, Internet Service, Contract, and Payment Method give insights into the services customers use and their preferences. Churn Analysis: The Churn Reason column is essential for understanding why customers leave, despite being misclassified initially as numerical. Churn Label and Churn Value correctly identify customers who have churned, aiding in churn prediction and prevention strategies. Conclusion: The dataset predominantly consists of well-categorized variables with some needing reclassification for accurate analysis. The geographical scope limited to California can help tailor state-specific strategies. Overall, this dataset offers valuable insights for demographic analysis, service usage patterns, and churn prediction. Proper classification and understanding of each variable are crucial for effective data analysis and subsequent decision-making. Data Visualization The provided code defines two functions: plot_stacked_bars and annotate_stacked_bars. The plot_stacked_bars function aims to create a stacked bar plot from the given DataFrame, with annotations added to each bar representing the respective values. It allows for customization of the plot's title, size, rotation of x-axis labels, and legend placement. The annotate_stacked_bars function is a helper function used internally to add value annotations to the bars of the stacked plot. It iterates over the plotted bars, calculates the annotation value, and annotates non-zero values onto the bars. Together, these functions facilitate the visualization of stacked bar plots, particularly useful for comparing proportions of different categories across multiple groups, such as retention and churn rates in business contexts. def plot_stacked_bars(dataframe, title_, size_=(18, 10), rot_=0, legend_="upper right"): """ Plot stacked bars with annotations """ ax = dataframe.plot( kind="bar", stacked=True, figsize=size_, rot=rot_, title=title_ ) # Annotate bars annotate_stacked_bars(ax, textsize=14) # Rename legend plt.legend(["Retention", "Churn"], loc=legend_) # Labels plt.ylabel("Company base (%)") plt.show() def annotate_stacked_bars(ax, pad=0.99, colour="white", textsize=13): """ Add value annotations to the bars """ # Iterate over the plotted rectanges/bars for p in ax.patches: # Calculate annotation value = str(round(p.get_height(),1)) # If value is 0 do not annotate if value == '0.0': continue ax.annotate( value, ((p.get_x()+ p.get_width()/2)*pad-0.05, (p.get_y()+p.get_height()/2)*pad), color=colour, size=textsize ) churn = telco_data[['CustomerID', 'Churn Value']] churn.columns = ['CustomerID', 'Churn Value'] churn_total = churn.groupby(churn['Churn Value']).count() churn_percentage = churn_total / churn_total.sum() * 100 The code below plot_stacked_bars(churn_percentage.transpose(), "Churning status", (5, 5), legend_="lower right") utilizes the plot_stacked_bars function to generate a stacked bar plot representing the churning status. The churn rate, calculated to be 26.5% based on the descriptive statistics analysis, is depicted in this visualization. The plot size is set to (5, 5), and the legend is positioned in the lower-right corner for improved readability. The purpose of this plot is to visually convey the distribution of churn and retention rates across different categories or groups, providing valuable insights into customer behavior within the dataset. plot_stacked_bars(churn_percentage.transpose(), "Churning status", (5, 5), legend_="lower right") Location location = telco_data[['State', 'City', 'Churn Value','CustomerID']] location = location.groupby([location['State'], location['Churn Value']])['CustomerID'].count().unstack(level=1).fill location_churn = (location.div(location.sum(axis=1), axis=0) * 100).sort_values(by=[1], ascending=False) plot_stacked_bars(location_churn, 'Location churn', rot_=30) # divide categorical columns in list to easily plot them customer_info = telco_data[["Gender", "Senior Citizen", "Partner", "Dependents","Churn Value","CustomerID"]] services = telco_data[["Phone Service", "Multiple Lines", "Internet Service", "Online Security", "Online Backup", "Device Protection", "Tech Support", "Streaming TV", "Streaming Movies"]] billing_info = telco_data[["Contract", "Paperless Billing", "Payment Method"]] Customers Segmentation The provided code segment aggregates customer information based on two key variables, 'Senior Citizen' status and 'Churn Value,' which indicates whether a customer has churned or not. By grouping the data and counting the occurrences of customer IDs for each combination of 'Senior Citizen' and 'Churn Value,' the code constructs a contingency table. This table is then transformed to represent the percentage of churned customers relative to the total number of customers within each 'Senior Citizen' group. The resulting DataFrame, customers_churn, showcases the churn rates for both senior and non-senior citizens, sorted in descending order based on the churn rate for churned customers ('Churn Value' equals 1). This analysis provides insights into how churn rates vary between senior and non-senior citizen customers, aiding in understanding the impact of age on customer attrition. customers = customer_info.groupby([customer_info['Senior Citizen'],customer_info['Churn Value']])['CustomerID'].count customers_churn = (customers.div(customers.sum(axis=1), axis=0) * 100).sort_values(by=[1], ascending=False) plot_stacked_bars(customers_churn, 'customers_churn', rot_=30) customers = customer_info.groupby([customer_info['Partner'],customer_info['Churn Value']])['CustomerID'].count().unst customers_churn = (customers.div(customers.sum(axis=1), axis=0) * 100).sort_values(by=[1], ascending=False) plot_stacked_bars(customers_churn, 'customers_churn', rot_=30) customers = customer_info.groupby([customer_info['Dependents'],customer_info['Churn Value']])['CustomerID'].count() customers_churn = (customers.div(customers.sum(axis=1), axis=0) * 100).sort_values(by=[1], ascending=False) plot_stacked_bars(customers_churn, 'customers_churn', rot_=30) Customer Sumary Report Several customer factors influence the churn rate as I explored below Senior Citizen Status: Customers who are not senior citizens have a notably higher churn rate of 41.7%, compared to senior citizens, who exhibit a lower churn rate of 23.6%. Partner Status: Customers without a partner experience a churn rate of 33%, while those with a partner have a lower churn rate of 19.7%. Dependents: Customers without dependents have a higher churn rate of 32.6%, whereas customers with dependents demonstrate a significantly lower churn rate of 6.5%. Proposed Improvements Senior Citizen Status: Implement targeted retention programs specifically tailored to non-senior citizen customers to address their needs and concerns, potentially offering personalized incentives or discounts to encourage loyalty. Partner Status: Focus on enhancing the customer experience for those without partners by offering bundled services or loyalty rewards. Additionally, consider implementing referral programs to incentivize existing customers to refer their partners, thereby potentially reducing churn among this group. Dependents: Develop family-oriented packages or services aimed at attracting and retaining customers with dependents. Highlight the value of family plans or discounts to emphasize the benefits of staying with the company for the long term. Overall, prioritize personalized communication and offerings tailored to each customer segment's unique characteristics and preferences to improve retention rates and foster long-term customer loyalty. Services services_info = telco_data[["Phone Service", "Multiple Lines", "Internet Service", "Online Security", "Online Backup", "Device Protection", "Tech Support", "Streaming TV", "Streaming Movies","Churn Value","CustomerID"]] The provided code segment organizes service-related data based on two criteria: 'Phone Service' status and 'Churn Value,' denoting customer churn. By grouping the data accordingly and tallying the occurrences of customer IDs for each combination of 'Phone Service' and 'Churn Value,' the code constructs a contingency table. This table is then transformed to represent the percentage of churned customers relative to the total number of customers within each 'Phone Service' category. The resulting DataFrame, services_churn, displays the churn rates for customers with and without phone service, sorted in descending order based on the churn rate for churned customers ('Churn Value' equals 1). This analysis offers insights into how churn rates vary based on phone service subscription status, aiding in understanding the impact of this service on customer retention. services = services_info.groupby(['Phone Service', 'Churn Value'])['CustomerID'].count().unstack(level=1).fillna(0) services_churn = (services.div(services.sum(axis=1), axis=0) * 100).sort_values(by=[1], ascending=False) plot_stacked_bars(services_churn, 'services_churn', rot_=30) The provided code iterates over each service in the dataset, excluding 'Churn Value' and 'CustomerID,' and calculates the churn percentages for each service. It groups the data by each service and churn value, constructs a contingency table, and then transforms it to represent the percentage of churned customers relative to the total number of customers for each service category. The resulting churn percentages are then visualized using stacked bar plots, with each plot illustrating the retention and churn rates for a specific service. This iterative process allows for a comprehensive examination of churn behavior across different services offered by the company, facilitating insights into service-specific customer retention challenges. import matplotlib.pyplot as plt # Grouping by each service and calculating churn percentages for each service for service in services_info.columns[:-2]: # Exclude 'Churn Value' and 'CustomerID' services = services_info.groupby([service, 'Churn Value'])['CustomerID'].count().unstack(level=1).fillna(0) services_churn = (services.div(services.sum(axis=1), axis=0) * 100).sort_values(by=[1], ascending=False) plot_stacked_bars(services_churn, f'Retention and Churn Rate for {service}') services_info Phone Service Multiple Lines Internet Service Online Security Online Backup Device Protection Tech Support Streaming TV Streaming Movies Churn CustomerID Value 0 Yes No DSL Yes Yes No No No No 1 3668QPYBK 1 Yes No Fiber optic No No No No No No 1 9237HQITU 2 Yes Yes Fiber optic No No Yes No Yes Yes 1 9305CDSKC 3 Yes Yes Fiber optic No No Yes Yes Yes Yes 1 7892POOKP 4 Yes Yes Fiber optic No Yes Yes No Yes Yes 1 0280XJGEX ... ... ... ... ... ... ... ... ... ... ... ... 7038 Yes No No No internet service No internet service No internet service No internet service No internet service No internet service 0 2569WGERO 7039 Yes Yes DSL Yes No Yes Yes Yes Yes 0 6840RESVB 7040 Yes Yes Fiber optic No Yes Yes No Yes Yes 0 2234XADUH 7041 No No phone service DSL Yes No No No No No 0 4801-JZAZL 7042 Yes No Fiber optic Yes No Yes Yes Yes Yes 0 3186-AJIEK 7043 rows × 11 columns Services churn report Multiple Lines and Internet Service Availability: Customers with multiple lines and access to internet service are less likely to churn, with a churn rate reduction of approximately 24.93% compared to those without these services. Type of Internet Service: The type of internet service also significantly impacts churn rate. Customers with fiber optic internet service exhibit a higher churn rate of 41.89% compared to DSL users, who have a churn rate of 18.96%. Online Security, Online Backup, Device Protection, and Tech Support: Customers who do not have online security, online backup, device protection, or tech support services are more likely to churn. The absence of these services correlates with higher churn rates: online security (41.77% without, 14.61% with), online backup (39.93% without, 21.53% with), device protection (39.13% without, 22.50% with), and tech support (41.64% without, 15.17% with). Streaming Services: The availability of streaming services also affects churn rate. Customers with access to streaming TV and streaming movies are less likely to churn compared to those without these services: streaming TV (33.52% without, 30.07% with) and streaming movies (33.68% without, 29.84% with). Understanding these factors provides valuable insights for strategizing customer retention efforts. Telco can focus on improving and promoting services such as online security, device protection, and streaming options to mitigate churn and enhance customer satisfaction. Additionally, offering reliable and diverse internet service options, such as DSL alongside fiber optic, can contribute to retaining more customers. Proposed Improvements Improving Service Offerings: Enhance the availability and quality of multiple lines and internet service, as customers with access to these services are less likely to churn. Consider upgrading infrastructure and expanding coverage to ensure reliable and high-speed internet access for all customers. Internet Service Type: Evaluate the performance and reliability of fiber optic internet service to address the high churn rate associated with it. Consider offering incentives or discounts to DSL users to encourage them to remain loyal to the service. Enhanced Security and Support Services: Invest in robust online security, online backup, device protection, and tech support services to reduce churn rates. Emphasize the importance of these services in protecting customers' data and providing timely assistance to address any issues they may encounter. Streaming Services: Enhance the availability and diversity of streaming TV and streaming movie options to retain customers. Consider partnering with content providers to offer exclusive content or bundles that appeal to a wide range of preferences and interests. By addressing these key factors and improving service offerings, Telco can effectively reduce churn rates and foster long-term customer satisfaction and loyalty. Billing Information billing_info = telco_data[["Contract", "Paperless Billing", "Payment Method","Churn Value","CustomerID"]] The provided code loops through each billing information category, excluding 'Churn Value' and 'CustomerID,' and computes the churn percentages for each category. It groups the data by each billing category and churn value, constructs a contingency table, and then transforms it to represent the percentage of churned customers relative to the total number of customers for each billing category. The resulting churn percentages are visualized using stacked bar plots, with each plot illustrating the retention and churn rates for a specific billing category. This iterative process enables a detailed examination of churn behavior across various billing-related factors, facilitating insights into their impact on customer retention. # Grouping by each billing information and calculating churn percentages for each category for service in billing_info.columns[:-2]: # Exclude 'Churn Value' and 'CustomerID' services = billing_info.groupby([service, 'Churn Value'])['CustomerID'].count().unstack(level=1).fillna(0) services_churn = (services.div(services.sum(axis=1), axis=0) * 100).sort_values(by=[1], ascending=False) plot_stacked_bars(services_churn, f'Retention and Churn Rate for {service}') Billing Information Report Contract Type: Customers with a month-to-month contract exhibit a significantly higher churn rate of 42.71% compared to those with one-year contracts (11.27%) and two-year contracts (2.83%). This suggests that offering longer-term contracts may help reduce churn rates. Paperless Billing: Customers who opt for paperless billing have a higher churn rate of 33.57% compared to those who prefer traditional billing methods (16.33%). Including billing receipts in a more appealing and convinient manner could encourage customers and build trust in the payment systems. Payment Method: Customers who use electronic checks as their payment method have the highest churn rate at 45.29%, followed by those who use mailed checks (19.11%). In contrast, customers who use bank transfer (automatic) and credit card (automatic) have lower churn rates of 16.71% and 15.24%, respectively. Offering incentives or discounts for customers to switch to automatic payment methods could help reduce churn rates associated with electronic checks. Proposed Improvements Contract Type: To mitigate the high churn rate associated with month-to-month contracts, businesses could introduce incentives for customers to sign up for longer-term contracts, such as discounted rates or additional benefits. Emphasizing the advantages of stability and predictability that come with longer commitments may also help retain customers. Paperless Billing: To address the higher churn rate among customers using paperless billing, companies could focus on improving the user experience and providing clearer billing information online. Implementing user-friendly interfaces, providing detailed billing breakdowns, and offering personalized billing notifications could enhance customer satisfaction and trust in the paperless billing process. Payment Method: Since electronic checks have the highest churn rate, companies could incentivize customers to switch to more reliable payment methods such as bank transfers or credit card payments. Offering discounts, rewards, or exclusive deals for customers who opt for automatic payment methods could encourage the transition and reduce churn associated with electronic checks. Additionally, ensuring seamless and secure payment processes can help build trust and confidence in the chosen payment methods. Top Churn Features Here is a list of the highest churn rates based on my analysis so far let me do a little feature engineering and dive deeper into the feature analysis Payment Method: Electronic Check: 45.29% Mailed Check: 19.11% Bank Transfer (Automatic): 16.71% Credit Card (Automatic): 15.24% Contract Type: Month-to-Month: 42.71% Paperless Billing: 33.57% One Year: 11.27% Two Year: 2.83% Internet Service: Fiber Optic: 41.89% DSL: 18.96% Online Security: No: 41.77% Yes: 14.61% Tech Support: No: 41.64% Yes: 15.17% Streaming TV: No: 33.52% Yes: 30.07% Streaming Movies: No: 33.68% Yes: 29.84% Device Protection: No: 39.13% Yes: 22.50% Online Backup: No: 39.93% Yes: 21.53% Partner: No: 33.00% Yes: 19.70% Senior Citizen: No: 41.70% Yes: 23.60% Paperless Billing: Yes: 33.57% No: 16.33% Dependents: No: 32.60% Yes: 6.50% Feature Engineering Numerical Analysis Tenure The provided code groups the Telco data by 'Tenure Months' and calculates the mean of the 'Churn Value' within each group. The results are sorted in descending order based on the churn value, indicating the average churn rate across different tenure periods. This analysis offers insights into how customer churn varies depending on the length of their tenure with the company. telco_data.groupby(['Tenure Months']).agg({'Churn Value': 'mean'}).sort_values(by='Churn Value', ascending=False) Churn Value Tenure Months 1 - 2 - 5 - 4 - 3 - ... ... 63 - 64 - 71 - 72 - 0 - 73 rows × 1 columns The data indicates that customers who have been clients for just 1 month have a high churn rate of approximately 62%. This likelihood decreases as tenure increases. Notably, customers at the 2-month mark show a significant drop to around 52%, and the trend continues with a more gradual decrease. Interestingly, at 5 months, the churn rate is about 48%, while at 4 months, it’s approximately 47%, showing a small but meaningful difference. This suggests that moving past the 4-month mark could be a critical milestone for customer retention. As clients stay longer, the likelihood of churn significantly drops, reaching as low as around 5.6% at 63 months, and further down to about 1.7% at 72 months. Clients who have been with the company for 72 months show the lowest churn rate, highlighting the importance of long-term client relationships in reducing churn. telco_data.describe(include='number') Count count 7043.0 mean Zip Code Latitude Longitude Tenure Months Monthly Charges Total Charges Churn Value Churn Score CLTV - - - - - - - - - - - std 0.0 - - - - min - - - - - - - - 25% - - - - - - - - 50% - - - - - - - 75% - - - - - - - max - - - - - - - numeric=telco_data[['Zip Code','Latitude','Longitude','Tenure Months','Monthly Charges','Total Charges','Churn Value' numeric Zip Code Latitude Longitude Tenure Months Monthly Charges Total Charges Churn Value Churn Score CLTV 0 - - 2 53.85 108.15 1 86 3239 1 - - 2 70.70 151.65 1 67 2701 2 - - 8 99.65 820.50 1 86 5372 3 - - 28 104.80 3046.05 1 84 5003 4 - - 49 103.70 5036.30 1 89 5340 ... ... ... ... ... ... ... 7038 - - 72 21.15 1419.40 0 45 5306 7039 - - 24 84.80 1990.50 0 59 2140 7040 - - 72 103.20 7362.90 0 71 5560 7041 - - 11 29.60 346.45 0 59 2793 7042 - - 66 105.65 6844.50 0 38 5097 ... ... 7043 rows × 9 columns ... Data Distribution from matplotlib.colors import ListedColormap colors = ["windows blue", "amber", "coral", "faded green"] # plot them as a palette sns.palplot(sns.xkcd_palette(colors)) numeric_features =['Tenure Months','Monthly Charges','Total Charges','CLTV'] The provided code utilizes Seaborn to create a histogram plot for each numeric feature in the dataset. It sets the color palette using XKCD colors and creates a figure with a size of 15x5 inches. For each numeric feature, it generates a subplot within a row, ensuring that each feature has its histogram. The histograms display the distribution of the respective numeric feature values, overlaid with kernel density estimation (KDE) curves for visual representation. The titles of the subplots indicate the name of the feature being visualized. Finally, the plots are displayed with tight layout to avoid overlap using plt.tight_layout(), followed by plt.show() to render the plots. This visualization provides an overview of the distributions of numeric features in the dataset, aiding in understanding their characteristics and potential relationships. cmap = sns.xkcd_palette(colors) plt.figure(figsize=(15, 5)) for i in range(len(numeric_features)): plt.subplot(1, 4, i + 1) sns.histplot(numeric[numeric_features[i]], color=cmap[i % len(colors)], kde=True) title = 'Distribution: ' + numeric_features[i] plt.title(title) plt.tight_layout() plt.show() The variables "tenure" and "MonthlyCharges" exhibit bimodal distributions, with notable peaks in the ranges of 0 to 70 for "tenure" and 20 to 80 for "MonthlyCharges". This suggests that there are two distinct groups within these variables. In contrast, the "TotalCharges" variable demonstrates a positively skewed distribution, indicating that most of the data points are clustered towards the lower end, with a long tail extending towards higher values. import pandas as pd import numpy as np import cufflinks as cf from plotly.offline import init_notebook_mode, iplot # using plotly to plot the boxplot numeric[numeric_features].iplot(kind='box', title="Boxplots of Numeric Features") Boxplots of Numeric Features Tenure Months Monthly Charges 8000 Total Charges CLTV 6000 4000 2000 0 Tenure Months Monthly Charges Total Charges CLTV Export to plot.ly » The box plots of the features "Tenure Months", "Monthly Charges", "Total Charges", and "CLTV" reveal uneven distributions. The variability and range of values across these features suggest that the data is not uniformly scaled. This unevenness can obscure the true relationships between these features and the churn rate. To ensure that the features are on a comparable scale and to provide a clearer representation of their relationships, it is recommended to apply scaling techniques, such as standardization or normalization. Scaling will help in mitigating the differences in magnitude and variance, leading to more accurate and interpretable results in subsequent analyses. import pandas as pd import numpy as np import cufflinks as cf from plotly.offline import init_notebook_mode, iplot import plotly.express as px # Configure cufflinks to work offline cf.go_offline() init_notebook_mode(connected=True) # Create individual box plots for each numeric feature versus churn rate for feature in numeric_features: fig = px.box(numeric, x='Churn Value', y=feature, color='Churn Value', title=f'{feature} vs Churn Rate') fig.show() Tenure Months vs Churn Rate Churn Value 70 1 0 60 Tenure Months - 1 Churn Value Monthly Charges vs Churn Rate Churn Value 120 1 0 Monthly Charges 100 80 60 40 20 0 1 Churn Value Total Charges vs Churn Rate Churn Value 1 8000 0 Total Charges 6000 4000 2000 0 0 1 Churn Value CLTV vs Churn Rate Churn Value 1 6000 0 CLTV 5000 4000 3000 2000 0 1 Churn Value Logarithmic Scalling to standardise the features Log-scaled box plots provide insights into data with wide ranges or skewed distributions by highlighting relative differences and multiplicative factors. When interpreting, focus on the log-transformed nature of the axes and understand that differences in log values correspond to multiplicative differences in the original data.Interpreting log-scaled values on a box plot involves understanding how the transformation affects the data and how to read the modified axes. Understanding Log Scaling: Purpose: Log scaling is applied to reduce skewness and handle a wide range of values, making patterns in the data more apparent. It is particularly useful when data spans several orders of magnitude. Transformation: The log transformation compresses larger values more than smaller ones, which helps in visualizing data that would otherwise be dominated by large outliers as witnessed above with Total Charges Vs Churn and Tenure Months Vs Churn boxplots. Reading the Box Plot: Axes Interpretation: The y-axis (or x-axis if horizontal) represents the log-transformed values. If you used np.log1p for transformation, it means the axis represents log(1 + value). Logarithmic Nature: The spacing on the axis is not linear. For instance, the difference between log(1) and log(10) is the same as the difference between log(10) and log(100). Comparing Medians and Quartiles: The box plot elements (median, quartiles, and whiskers) should be interpreted similarly to non-transformed data, but keep in mind the axis scale is logarithmic. Median (Central Line): Indicates the central tendency of the log-transformed data. Interquartile Range (Box): Shows the spread of the middle 50% of the data on the log scale. Whiskers: Extend to the minimum and maximum values within 1.5 times the interquartile range from the quartiles. Outliers: Data points outside the whiskers are plotted individually and represent extreme values. Comparing Groups: When comparing groups (such as "Retention" vs. "Churn"), observe the following: Central Tendencies: Compare the medians of the groups. Differences in medians indicate differences in central tendencies on the log scale. Spread and Variability: Compare the IQRs and whiskers. Differences in these indicate differences in variability and spread on the log scale. Relative Differences: Since the axis is logarithmic, differences between values should be interpreted in terms of ratios or multiplicative factors rather than absolute differences. # Apply log scaling to the numeric features log_scaled_numeric = numeric[numeric_features].apply(lambda x: np.log1p(x)) # Add the Churn Value column back to the log-scaled dataframe log_scaled_numeric['Churn Value'] = numeric['Churn Value'] # Configure cufflinks to work offline cf.go_offline() init_notebook_mode(connected=True) # Plotting the log-scaled boxplots for feature in numeric_features: log_scaled_numeric.iplot(kind='box', y=feature, title=f'Log-Scaled {feature} Boxplot', asFigure=True).show() Log-Scaled Tenure Months Boxplot Tenure Months Monthly Charges 8 Total Charges CLTV Churn Value 6 4 2 0 Te nu re Mo nt hs Mo nt hly Ch ar ge s To tal Ch ar ge s CL TV Ch ur nV alu e Log-Scaled Monthly Charges Boxplot Tenure Months Monthly Charges 8 Total Charges CLTV Churn Value 6 4 2 0 Te nu re Mo nt hs Mo nt hly Ch ar ge s To tal Ch ar ge s CL TV Ch ur nV alu e Log-Scaled Total Charges Boxplot Tenure Months Monthly Charges 8 Total Charges CLTV Churn Value 6 4 2 0 Te nu re Mo nt hs Mo nt hly Ch ar ge s To tal Ch ar ge s CL TV Ch ur nV alu e Log-Scaled CLTV Boxplot Tenure Months Monthly Charges 8 Total Charges CLTV Churn Value 6 4 2 0 Te nu re Mo nt hs Mo nt hly Ch ar ge s To tal Ch ar ge s CL TV Ch ur nV alu e # Apply log scaling to the numeric features log_scaled_numeric = numeric[numeric_features].apply(lambda x: np.log1p(x)) # Add the Churn Value column back to the log-scaled dataframe log_scaled_numeric['Churn Value'] = numeric['Churn Value'] # Configure cufflinks to work offline cf.go_offline() init_notebook_mode(connected=True) # Create individual box plots for each log-scaled feature versus churn value for feature in numeric_features: fig = px.box(log_scaled_numeric, x='Churn Value', y=feature, color='Churn Value', title=f'Log-Scaled {feature} vs fig.show() Log-Scaled Tenure Months vs Churn Value 4.5 Churn Value 1 4 0 3.5 Tenure Months - 1 Churn Value Log-Scaled Monthly Charges vs Churn Value Churn Value 1 0 Monthly Charges 4.5 4 3.5 3 0 1 Churn Value Log-Scaled Total Charges vs Churn Value Churn Value 9 1 0 Total Charges 8 7 6 5 4 3 0 1 Churn Value Log-Scaled CLTV vs Churn Value 8.8 Churn Value 1 0 8.6 CLTV 8.4 8.2 8 7.8 7.6 0 1 Churn Value The analysis shows as the monthly charges increase so does the churn rate increase as seen in the mean logarithmic rate of 4.390 exhibiting higher churn rates compared to lower rates. There are some outliers within the data that indicate that further investigation should be conducted on the monthly charges data.Tenure Months also has outliers towards the lower part of the whiskers indicating a further need for analysis as well. Analysis reveals that as 'Tenure Months' and 'CLTV' increase, 'Churn Value' tends to decrease, indicating longer customer tenure and higher CLTV correlate with lower churn rates. Conversely, higher 'Monthly Charges' and 'Total Charges' do not exhibit a clear trend with churn, suggesting that price alone may not be a significant factor in customer retention. This insight suggests that strategies aimed at increasing tenure and CLTV may be effective in reducing churn rates, while further investigation is warranted to understand the relationship between pricing and churn more comprehensively. log_scaled_numeric Tenure Months Monthly Charges Total Charges CLTV Churn Value 0 - - - 1 1 - - - 1 2 - - - 1 3 - - - 1 4 - - - 1 ... ... ... 7038 - 7039 ... ... ... - - 0 - - - 0 7040 - - - 0 7041 - - - 0 7042 - - - 0 7043 rows × 5 columns Data Preparation and Preprocessing # divide categorical columns in list to easily plot them customer_info = telco_data[["Gender", "Senior Citizen", "Partner", "Dependents","Churn Value","CustomerID"]] services = telco_data[["Phone Service", "Multiple Lines", "Internet Service", "Online Security", "Online Backup", "Device Protection", "Tech Support", "Streaming TV", "Streaming Movies"]] billing_info = telco_data[["Contract", "Paperless Billing", "Payment Method"]] customers=telco_data[["Gender", "Senior Citizen", "Partner"]] services_data=telco_data[["Phone Service", "Multiple Lines", "Internet Service", "Online Security", "Online Backup", "Device Protection", "Tech Support", "Streaming TV", "Streaming Movies"]] billing= telco_data[["Contract", "Paperless Billing", "Payment Method"]] billing Contract Paperless Billing Payment Method 0 Month-to-month Yes Mailed check 1 Month-to-month Yes Electronic check 2 Month-to-month Yes Electronic check 3 Month-to-month Yes Electronic check 4 Month-to-month Yes Bank transfer (automatic) ... ... ... ... 7038 Two year Yes Bank transfer (automatic) 7039 One year Yes Mailed check 7040 One year Yes Credit card (automatic) 7041 Month-to-month Yes Electronic check 7042 Yes Bank transfer (automatic) Two year 7043 rows × 3 columns categorical_features=["Gender", "Senior Citizen", "Partner","Dependents","Phone Service", "Multiple Lines", "Internet "Online Backup", "Device Protection", "Tech Support", "Streaming TV", "Streaming Movies","Contract", "Paperless Billing", "Payment Method"] The code below utilizes the LabelEncoder from the scikit-learn library to transform categorical features within the Telco dataset. It first creates a deep copy of the original DataFrame to preserve the data integrity. Then, it initializes the LabelEncoder and iterates through each categorical feature in the dataset. For each feature, it applies the label encoding transformation using the fit_transform method, which assigns numerical labels to the categories. It also prints the unique values before and after encoding, providing insight into the transformation. Finally, it displays the transformed DataFrame, showcasing the encoded categorical features. This process is useful for preparing categorical data for machine learning algorithms that require numerical inputs. from sklearn.preprocessing import LabelEncoder # Create a deep copy of the DataFrame cat = telco_data.copy(deep=True) # Initialize LabelEncoder le = LabelEncoder() print('Label Encoder Transformation') for feature in categorical_features: cat[feature] = le.fit_transform(cat[feature]) print(f"{feature} : {cat[feature].unique()} = {le.inverse_transform(cat[feature].unique())}") # Display the transformed DataFrame print(cat.head()) Label Encoder Transformation Gender : [1 0] = ['Male' 'Female'] Senior Citizen : [0 1] = ['No' 'Yes'] Partner : [0 1] = ['No' 'Yes'] Dependents : [0 1] = ['No' 'Yes'] Phone Service : [1 0] = ['Yes' 'No'] Multiple Lines : [0 2 1] = ['No' 'Yes' 'No phone service'] Internet Service : [0 1 2] = ['DSL' 'Fiber optic' 'No'] Online Security : [2 0 1] = ['Yes' 'No' 'No internet service'] Online Backup : [2 0 1] = ['Yes' 'No' 'No internet service'] Device Protection : [0 2 1] = ['No' 'Yes' 'No internet service'] Tech Support : [0 2 1] = ['No' 'Yes' 'No internet service'] Streaming TV : [0 2 1] = ['No' 'Yes' 'No internet service'] Streaming Movies : [0 2 1] = ['No' 'Yes' 'No internet service'] Contract : [0 2 1] = ['Month-to-month' 'Two year' 'One year'] Paperless Billing : [1 0] = ['Yes' 'No'] Payment Method : [3 2 0 1] = ['Mailed check' 'Electronic check' 'Bank transfer (automatic)' 'Credit card (automatic)'] CustomerID Count Country State City Zip Code \ 0 3668-QPYBK 1 United States California Los Angeles--HQITU 1 United States California Los Angeles--CDSKC 1 United States California Los Angeles--POOKP 1 United States California Los Angeles--XJGEX 1 United States California Los Angeles 90015 0 1 2 3 4 Lat Long-, -, -, -, -, - 0 1 2 3 4 Paperless Billing 1 1 1 1 1 0 1 2 3 4 Churn Label Yes Yes Yes Yes Yes Latitude- Payment Method 3 2 2 2 0 Churn Value 1 1 1 1 1 Longitude - - - - - Gender 1 0 0 0 1 Monthly Charges- Churn Score- CLTV- ... ... ... ... ... ... Contract 0 0 0 0 0 Total Charges- \ \ Churn Reason Competitor made better offer Moved Moved Moved Competitor had better devices [5 rows x 33 columns] cat.columns Index(['CustomerID', 'Count', 'Country', 'State', 'City', 'Zip Code', 'Lat Long', 'Latitude', 'Longitude', 'Gender', 'Senior Citizen', 'Partner', 'Dependents', 'Tenure Months', 'Phone Service', 'Multiple Lines', 'Internet Service', 'Online Security', 'Online Backup', 'Device Protection', 'Tech Support', 'Streaming TV', 'Streaming Movies', 'Contract', 'Paperless Billing', 'Payment Method', 'Monthly Charges', 'Total Charges', 'Churn Label', 'Churn Value', 'Churn Score', 'CLTV', 'Churn Reason'], dtype='object') cat.drop(['CustomerID', 'Count', 'Country', 'State', 'City', 'Zip Code', 'Lat Long', 'Latitude', 'Longitude','Monthly Charges', 'Total Charges', 'Churn Label', 'Churn Value', 'Churn Score', 'CLTV', 'Churn Reason'], axis=1,inplace=True) cat Gender Senior Tenure Phone Multiple Internet Online Online Device Tech Streaming Strea Partner Dependents Citizen Months Service Lines Service Security Backup Protection Support TV 0 1 0 0 0 2 1 0 0 2 2 0 0 0 1 0 0 0 1 2 1 0 1 0 0 0 0 0 2 0 0 0 1 8 1 2 1 0 0 2 0 2 3 0 0 1 1 28 1 2 1 0 0 2 2 2 4 1 0 0 1 49 1 2 1 0 2 2 0 2 ... ... ... ... ... ... ... ... ... ... ... ... ... ... 7038 0 0 0 0 72 1 0 2 1 1 1 1 1 7039 1 0 1 1 24 1 2 0 2 0 2 2 2 7040 0 0 1 1 72 1 2 1 0 2 2 0 2 7041 0 0 1 1 11 0 1 0 2 0 0 0 0 7042 1 0 0 0 66 1 0 1 2 0 2 2 2 7043 rows × 17 columns The provided code concatenates two clean and analyzed dataframes into one unified dataframe, facilitating easier interpretation by machine learning algorithms. By combining the dataframes, the algorithm gains access to a comprehensive set of features, potentially enhancing its predictive performance. Additionally, scaling is applied to the data to ensure even distribution across all features. Scaling is crucial for algorithms sensitive to feature magnitudes, such as those based on distance metrics like k-nearest neighbors or those employing gradient descent optimization. By scaling the data, each feature is transformed to have a similar scale, preventing features with larger magnitudes from dominating the algorithm's learning process. This normalization process promotes fair and effective feature representation, contributing to more accurate and reliable model predictions. Overall, the concatenated dataframe and scaled data collectively optimize the dataset for machine learning tasks, improving the algorithm's ability to discern patterns and make informed predictions. final_df = pd.concat([cat, log_scaled_numeric], axis=1) final_df Gender Senior Tenure Phone Multiple Internet Online Online Streaming Streaming Partner Dependents ... Contract Citizen Months Service Lines Service Security Backup TV Movies 0 1 0 0 0 2 1 0 0 2 2 ... 0 0 0 1 0 0 0 1 2 1 0 1 0 0 ... 0 0 0 2 0 0 0 1 8 1 2 1 0 0 ... 2 2 0 3 0 0 1 1 28 1 2 1 0 0 ... 2 2 0 4 1 0 0 1 49 1 2 1 0 2 ... 2 2 0 ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... 7038 0 0 0 0 72 1 0 2 1 1 ... 1 1 2 7039 1 0 1 1 24 1 2 0 2 0 ... 2 2 1 7040 0 0 1 1 72 1 2 1 0 2 ... 2 2 1 7041 0 0 1 1 11 0 1 0 2 0 ... 0 0 0 7042 1 0 0 0 66 1 0 1 2 0 ... 2 2 2 7043 rows × 22 columns This code checks for any duplicates in the final dataframe #show if there are full duplicates final_df.duplicated().sum() 0 This code checks for the sum of null values and shows that the data contains 11 null values in Total Charges columns. Tsk! something needs to be done about it final_df.isnull().sum() Gender Senior Citizen Partner Dependents Tenure Months Phone Service Multiple Lines Internet Service Online Security Online Backup Device Protection Tech Support Streaming TV Streaming Movies Contract Paperless Billing Payment Method Tenure Months Monthly Charges Total Charges CLTV Churn Value dtype: int64 - You guessed right i did something about it. I imputed the null values with their mean, cool right! # Impute median values to the null values in 'Total Charges' median_total_charges = final_df['Total Charges'].median() final_df['Total Charges'].fillna(median_total_charges, inplace=True) final_df.isnull().sum() Gender Senior Citizen Partner Dependents Tenure Months Phone Service Multiple Lines Internet Service Online Security Online Backup Device Protection Tech Support Streaming TV Streaming Movies Contract Paperless Billing Payment Method Tenure Months Monthly Charges Total Charges CLTV Churn Value dtype: int64 - Modelling 1.Import Packages Let's start by importing the warnings.These warnings typically indicate potential changes in behavior or deprecation of certain features that might occur in future versions of libraries or Python itself. By filtering out these warnings with warnings.filterwarnings("ignore", category=FutureWarning), the code ensures a cleaner output without these specific warning messages, which can help improve readability during code execution. import warnings warnings.filterwarnings("ignore", category=FutureWarning) The code imports essential libraries and modules for data analysis and machine learning in Python. It includes pandas and numpy for data manipulation and numerical computations, matplotlib for plotting, and sklearn for machine learning tasks. Specific modules from sklearn are imported for data preprocessing, model training, evaluation, and ensemble methods such as random forests, k-nearest neighbors, support vector machines, gradient boosting, and AdaBoost. These tools collectively enable comprehensive data analysis and facilitate the development of machine learning models for various tasks. import pandas as pd import numpy as np import matplotlib.pyplot as plt %matplotlib inline from matplotlib.colors import ListedColormap from sklearn.feature_extraction.text import CountVectorizer from from from from from from from sklearn import metrics sklearn.model_selection import train_test_split sklearn.model_selection import cross_val_score sklearn.ensemble import RandomForestClassifier sklearn.neighbors import KNeighborsClassifier sklearn.svm import SVC sklearn.ensemble import GradientBoostingClassifier, AdaBoostClassifier Let's Build Some Classification Models # Define the names of the classifiers names = ['Nearest Neighbors','Linear SVM', 'RBF SVM','RandomForest', 'AdaBoost','GradientBoost'] # Define the classifiers with their respective hyperparameters classifiers = [ GradientBoostingClassifier(), KNeighborsClassifier(3), SVC(kernel="linear", C=0.025), SVC(gamma=2, C=1), RandomForestClassifier(max_depth=5, n_estimators=10, max_features=1), AdaBoostClassifier() ] # Make a copy of our data train_df = final_df.copy() # Separate target variable from independent variables y = final_df['Churn Value'] X = final_df.drop(columns=['Churn Value']) print(X.shape) print(y.shape) (7043, 21) (7043,) The code divides the dataset into training and testing subsets using the train_test_split function. It assigns features and target labels for both sets, specifying that 25% of the data will be allocated to the testing set while maintaining the random state for reproducibility. Finally, it prints the shapes of the training and testing sets to confirm the split. X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42) print(X_train.shape) print(y_train.shape) print(X_test.shape) print(y_test.shape) (5282, 21) (5282,) (1761, 21) (1761,) The code initializes empty lists and dictionaries to store evaluation metrics, trained models, confusion matrices, and classification reports for each classifier. It iterates over each classifier, fitting the model to the training data, predicting on both training and testing sets, and calculating evaluation metrics such as accuracy, precision, recall, and F1 scores. The results are saved to respective dictionaries and appended to the results list. Finally, the results are converted into a DataFrame for easy visualization. # Empty lists to store results results = [] # Store evaluation metrics for each classifier models = {} # Store trained models confusion = {} # Store confusion matrices for each classifier class_report = {} # Store classification reports for each classifier # Iterate over each classifier for name, clf in zip(names, classifiers): print ('Fitting {:s} model...'.format(name)) # Measure the time taken to fit the model run_time = %timeit -q -o clf.fit(X_train, y_train) print ('... predicting') # Predict on the training data y_pred = clf.predict(X_train) y_pred_test = clf.predict(X_test) print ('... # Calculate accuracy = precision = recall = f1 f1_test scoring') evaluation metrics metrics.accuracy_score(y_train, y_pred) metrics.precision_score(y_train, y_pred) metrics.recall_score(y_train, y_pred) = metrics.f1_score(y_train, y_pred) = metrics.f1_score(y_test, y_pred_test) # Save the results to dictionaries models[name] = clf confusion[name] = metrics.confusion_matrix(y_train, y_pred) class_report[name] = metrics.classification_report(y_train, y_pred) # Append results to the list results.append([name, accuracy, precision, recall, f1, f1_test, run_time.best]) # Convert results to DataFrame for easy visualisation results = pd.DataFrame(results, columns=['Classifier', 'Accuracy', 'Precision', 'Recall', 'F1 Train', 'F1 Test', 'Tra results.set_index('Classifier', inplace= True) print ('... All done!') Fitting Nearest Neighbors model... ... predicting ... scoring Fitting Linear SVM model... ... predicting ... scoring Fitting RBF SVM model... ... predicting ... scoring Fitting RandomForest model... ... predicting ... scoring Fitting AdaBoost model... ... predicting ... scoring Fitting GradientBoost model... ... predicting ... scoring ... All done! results.sort_values('F1 Train', ascending=False) Accuracy Precision Recall F1 Train F1 Test Train Time Classifier RandomForest - - - Linear SVM - - - Nearest Neighbors - - - GradientBoost - - - RBF SVM - - - AdaBoost - - - A summary of the evaluation metrics for each classifier: RandomForest: Achieved the highest accuracy of 98.31% and precision of 97.98%, indicating strong performance in correctly identifying both positive and negative cases. However, its recall score of 95.47% suggests that it may miss some instances of positive cases. The F1 scores for both training and testing data are high, indicating a good balance between precision and recall. The training time is relatively high at 2.53 seconds. Linear SVM: Achieved an accuracy of 86.77% and a precision of 76.52%, with a recall of 70.66%. While it demonstrates acceptable performance, its F1 scores and training time are relatively lower compared to RandomForest. Nearest Neighbors: Achieved an accuracy of 83.81% and a precision of 72.93%, with a recall of 59.78%. The F1 scores are moderate, and the training time is considerably higher than Linear SVM. GradientBoost: Achieved an accuracy of 81.39% and a precision of 66.25%, with a recall of 57.59%. Similar to Nearest Neighbors, it shows moderate performance in terms of accuracy and precision, with relatively lower recall. The training time is comparable to RandomForest. RBF SVM: Achieved an accuracy of 81.07% and a precision of 67.93%, with a recall of 51.17%. While its precision is higher compared to GradientBoost, its recall is relatively lower. The training time is similar to RandomForest. AdaBoost Achieved an accuracy of 79.33% and a precision of 76.43%, with a recall of 29.34%. It shows the lowest recall among all classifiers, indicating its limitation in identifying positive cases. However, it demonstrates a relatively high precision. The training time is the lowest among all classifiers. # Display confusion matrices and classification reports for name, matrix in confusion.items(): print(f"Confusion Matrix for {name}:") print(matrix) print() for name, report in class_report.items(): print(f"Classification Report for {name}:") print(report) print() Confusion Matrix for Nearest Neighbors: [[-] [ 551 819]] Confusion Matrix for Linear SVM: [[-] [ 402 968]] Confusion Matrix for RBF SVM: [[-] [ 669 701]] Confusion Matrix for RandomForest: [[3885 27] [ 62 1308]] Confusion Matrix for AdaBoost: [[-] [ 968 402]] Confusion Matrix for GradientBoost: [[-] [ 581 789]] Classification Report for Nearest Neighbors: precision recall f1-score support 0 1 0.87 0.73 0.92 0.60 0.89 0.66 - accuracy macro avg weighted avg 0.80 0.83 0.76 0.84 - - Classification Report for Linear SVM: precision recall f1-score support 0 1 0.90 0.77 0.92 0.71 0.91 0.73 - accuracy macro avg weighted avg 0.83 0.86 0.82 0.87 - - Classification Report for RBF SVM: precision recall f1-score support 0 0.84 0.92 0.88 3912 1 0.68 0.51 0.58 1370 accuracy macro avg weighted avg 0.76 0.80 0.71 0.81 - - Classification Report for RandomForest: precision recall f1-score support 0 1 0.98 0.98 0.99 0.95 0.99 0.97 - accuracy macro avg weighted avg 0.98 0.98 0.97 0.98 - - Classification Report for AdaBoost: precision recall f1-score support 0 1 0.80 0.76 0.97 0.29 0.87 0.42 - accuracy macro avg weighted avg 0.78 0.79 0.63 0.79 - - Classification Report for GradientBoost: precision recall f1-score support 0 1 0.86 0.66 0.90 0.58 0.88 0.62 - accuracy macro avg weighted avg 0.76 0.81 0.74 0.81 - - Based on the provided confusion matrices and classification reports: Nearest Neighbors (KNN): Performs reasonably well with an accuracy of 84%. Better at identifying negatives (0) than positives (1), as indicated by higher precision and recall for class 0. Lower precision and recall for class 1 suggest some misclassification of positives. Linear SVM: Achieves an accuracy of 87%, indicating good performance. Shows a slight imbalance in precision and recall between the two classes, with class 0 having higher values. Overall, precision and recall are balanced, indicating robust performance. RBF SVM: Demonstrates an accuracy of 81%. Shows relatively lower precision and recall for class 1, suggesting difficulty in correctly identifying positives. Class 0 has higher precision and recall, indicating better performance in identifying negatives. RandomForest: Performs impressively well with an accuracy of 98%. High precision and recall for both classes indicate excellent performance in classifying both positives and negatives. Minimal misclassification evident from the confusion matrix (27 false positives and 62 false negatives). AdaBoost: Displays an accuracy of 79%. High precision but low recall for class 1 suggests it may be overly conservative in predicting positives. Imbalance in precision and recall indicates the model's struggle with correctly identifying positives. GradientBoost: Achieves an accuracy of 81%. Slightly lower precision and recall for class 1 compared to class 0. Overall, performs reasonably well but shows some difficulty in correctly identifying positives. In summary, RandomForest stands out as the best performer with high accuracy and balanced precision and recall for both classes. Linear SVM also performs well with balanced precision and recall, while other models show varying degrees of performance with strengths and weaknesses in different aspects of classification. Instanciate A RandomForest Instance model = RandomForestClassifier( n_estimators=1000 ) model.fit(X_train, y_train) ▾ RandomForestClassifier i ? RandomForestClassifier(n_estimators=1000) Model Understanding A simple way of understanding the results of a model is to look at feature importances. Feature importances indicate the importance of a feature within the predictive model, there are several ways to calculate feature importance, but with the Random Forest classifier, we're able to extract feature importances using the built-in method on the trained model. In the Random Forest case, the feature importance represents the number of times each feature is used for splitting across all trees. models {'Nearest Neighbors': GradientBoostingClassifier(), 'Linear SVM': KNeighborsClassifier(n_neighbors=3), 'RBF SVM': SVC(C=0.025, kernel='linear'), 'RandomForest': SVC(C=1, gamma=2), 'AdaBoost': RandomForestClassifier(max_depth=5, max_features=1, n_estimators=10), 'GradientBoost': AdaBoostClassifier()} Feature Importance Based on all the Models feature_importances = pd.DataFrame({ 'features': X_train.columns, 'importance': clf.feature_importances_ }).sort_values(by='importance', ascending=False).reset_index() feature_importances index features importance 0 20 CLTV 0.20 1 14 Contract 0.14 2 19 Total Charges 0.14 3 18 Monthly Charges 0.14 4 4 Tenure Months 0.10 5 17 Tenure Months 0.08 6 3 Dependents 0.04 7 8 Online Security 0.04 8 16 Payment Method 0.04 9 2 Partner 0.02 10 6 Multiple Lines 0.02 11 11 Tech Support 0.02 12 15 Paperless Billing 0.02 13 0 Gender 0.00 14 13 Streaming Movies 0.00 15 12 Streaming TV 0.00 16 1 Senior Citizen 0.00 17 9 Online Backup 0.00 18 7 Internet Service 0.00 19 5 Phone Service 0.00 20 10 Device Protection 0.00 The feature importance analysis reveals insights into the predictive power of different features in the dataset. Among the features considered, CLTV (Customer Lifetime Value) emerges as the most influential, with a relative importance score of 0.20, indicating its significant role in predicting churn. Contract type and both monthly and total charges follow closely, each contributing 0.14 to the predictive model. Tenure months, appearing twice in the list, also show importance, albeit to a lesser extent, with one instance ranking at 0.10 and the other at 0.08. Other features such as Dependents, Online Security, and Payment Method exhibit modest importance with scores of 0.04. Interestingly, some features like Gender, Streaming Movies, and Streaming TV carry negligible importance, each scoring zero, suggesting they have little impact on predicting churn in this context. Overall, this analysis highlights the key predictors of churn within the dataset, providing valuable insights for targeted intervention strategies. plt.figure(figsize=(15, 25)) plt.title('Feature Importances') plt.barh(range(len(feature_importances)), feature_importances['importance'], color='b', align='center') plt.yticks(range(len(feature_importances)), feature_importances['features']) plt.xlabel('Importance') plt.show() Feature Importances Based on the RandomForest Model feature_importances = pd.DataFrame({ 'features': X_train.columns, 'importance': model.feature_importances_ }).sort_values(by='importance', ascending=False).reset_index() feature_importances index features importance 0 18 Monthly Charges - 1 19 Total Charges - 2 20 CLTV - 3 17 Tenure Months - 4 4 Tenure Months - 5 14 Contract - 6 8 Online Security - 7 16 Payment Method - 8 11 Tech Support - 9 7 Internet Service - 10 3 Dependents - 11 10 Device Protection - 12 2 Partner - 13 15 Paperless Billing - 14 6 Multiple Lines - 15 0 Gender - 16 9 Online Backup - 17 1 Senior Citizen - 18 12 Streaming TV - 19 13 Streaming Movies - 20 5 Phone Service - Comparing the latest feature importance analysis with the previous findings, Monthly Charges emerge as the most influential feature, topping the list with a relative importance score of 0.135. This contrasts with the previous analysis where CLTV held the highest importance score. The shift in importance highlights the dynamic nature of feature relevance in predictive modeling and underscores the need for periodic reassessment of feature importance. Interestingly, Total Charges and CLTV, which held significant importance in the previous analysis, follow closely behind Monthly Charges in the latest findings. Despite their slightly lower importance scores, they remain pivotal predictors of churn. Moreover, Tenure Months, represented twice in the list with varying importance scores, maintain their relevance in both analyses. While their importance ranks slightly lower in the latest analysis compared to the previous one, they still demonstrate significant predictive power. Contract type, Online Security, and Payment Method also retain their positions in the list of influential features, albeit with minor fluctuations in importance scores. Notably, features like Gender, Streaming Movies, and Streaming TV continue to exhibit negligible importance in predicting churn, corroborating previous observations. Feature Importance Summary Understanding feature importance is crucial for effective predictive modeling, particularly in the context of churn prediction. By identifying the most influential features, businesses can prioritize their resources and strategies to address the underlying factors driving customer attrition. For instance, knowing that CLTV, Total Charges,Tenure Month, Contract Type and Monthly Charges are significant predictors allows companies to tailor retention efforts and pricing strategies accordingly. Moreover, insights gained from feature importance analysis can inform decision-making processes, such as resource allocation for customer retention programs or the development of personalized offers aimed at reducing churn. By leveraging this knowledge, organizations can optimize their efforts to retain valuable customers, enhance customer satisfaction, and ultimately improve business performance. Therefore, understanding feature importance not only enhances the predictive accuracy of churn models but also empowers businesses to proactively mitigate churn and foster long-term customer loyalty. Random Forest Evaluation predictions = model.predict(X_test) tn, fp, fn, tp = metrics.confusion_matrix(y_test, predictions).ravel() y_test.value_counts() Churn Value- Name: count, dtype: int64 print(f"True positives: {tp}") print(f"False positives: {fp}") print(f"True negatives: {tn}") print(f"False negatives: {fn}\n") print(f"Accuracy: {metrics.accuracy_score(y_test, predictions)}") print(f"Precision: {metrics.precision_score(y_test, predictions)}") print(f"Recall: {metrics.recall_score(y_test, predictions)}") True positives: 268 False positives: 131 True negatives: 1131 False negatives: 231 Accuracy:- Precision:- Recall:- Hyperparameter Tuning Important_features = ['Monthly Charges','Total Charges','CLTV','Tenure Months','Tenure Months','Contract','Online Sec ,'Gender'] import pandas as pd from sklearn.model_selection import train_test_split from sklearn.ensemble import RandomForestClassifier # Assuming 'X' is your feature matrix and 'y' is your target variable # Replace 'X' and 'y' with your actual feature matrix and target variable # Assuming you already have your dataset loaded into 'data' # Create a DataFrame with important features important_features = ['Monthly Charges', 'Total Charges', 'CLTV', 'Tenure Months', 'Tenure Months', 'Contract', 'Online Security', 'Payment Method', 'Tech Support', 'Internet Service', 'Dependents', 'Device Protection', 'Partner', 'Paperless Billing', 'Multiple Lines','Gender'] data_subset = train_df[important_features] # Split data into train and test sets X_train, X_test, y_train, y_test = train_test_split(data_subset, y, test_size=0.2, random_state=42) # Initialize Random Forest classifier rf_model = RandomForestClassifier() # Train the model rf_model.fit(X_train, y_train) # Evaluate the model accuracy = rf_model.score(X_test, y_test) print("Accuracy:", accuracy) Accuracy:- from sklearn.model_selection import GridSearchCV # Define the hyperparameters grid to search param_grid = { 'n_estimators': [100, 200, 300], 'max_depth': [None, 10, 20], 'min_samples_split': [2, 5, 10], 'min_samples_leaf': [1, 2, 4], 'bootstrap': [True, False] } # Initialize Random Forest classifier rf_model = RandomForestClassifier() # Initialize GridSearchCV with the Random Forest classifier and the hyperparameters grid grid_search = GridSearchCV(estimator=rf_model, param_grid=param_grid, cv=5, scoring='accuracy', n_jobs=-1) # Perform grid search to find the best hyperparameters grid_search.fit(X_train, y_train) # Get the best hyperparameters best_params = grid_search.best_params_ # Train a new Random Forest model using the best hyperparameters best_rf_model = RandomForestClassifier(**best_params) best_rf_model.fit(X_train, y_train) # Evaluate the model accuracy = best_rf_model.score(X_test, y_test) print("Accuracy after hyperparameter tuning:", accuracy) Accuracy after hyperparameter tuning:- After performing hyperparameter tuning, the Random Forest classifier achieved an improved accuracy of 80.34%, compared to the initial accuracy of 79.44%. This enhancement demonstrates the effectiveness of fine-tuning model parameters to achieve better performance. Summary of Hyperparameter Tuning Process Define the Hyperparameters Grid: A range of potential values for various hyperparameters of the Random Forest model is specified. These include the number of trees (n_estimators), maximum depth of trees (max_depth), minimum number of samples required to split a node (min_samples_split), minimum number of samples required at each leaf node (min_samples_leaf), and whether to use bootstrap samples (bootstrap). Initialize GridSearchCV: The GridSearchCV function is used to systematically explore the defined hyperparameters grid. It evaluates each combination using cross-validation to determine the optimal set of hyperparameters. Fit the Model: The fit method of GridSearchCV is applied to the training data, performing an exhaustive search over the hyperparameters grid. It identifies the best hyperparameters based on the cross-validation scores. Train the Final Model: Using the best hyperparameters identified by GridSearchCV, a new Random Forest model is instantiated and trained on the full training dataset. Evaluate the Model: The final model's performance is evaluated on the test dataset, resulting in the improved accuracy of 80.34%. Final Remarks on Random Forest Evaluation Despite the improvement in accuracy to 80.34% after hyperparameter tuning, further data exploration and analysis are required. Specifically, addressing the imbalance in the dataset is crucial. Techniques such as downsampling the majority class or acquiring additional data to balance the churn values should be considered to ensure that the model performs equitably across all classes. Balancing the dataset can significantly enhance the model's ability to predict churn accurately. Additionally, continuous feature importance analysis and periodic retraining with updated data can help maintain and potentially improve the model's performance over time. These steps are essential for developing a robust and reliable churn prediction model using the Random Forest classifier. Reporting Project Summary Report: Customer Churn Prediction Overview The primary goal of this project was to develop a predictive model to identify customers likely to churn, enabling targeted intervention strategies. I utilized a Kaggle dataset containing various customer attributes and trained multiple machine learning models to predict churn. After extensive analysis, a Random Forest classifier was chosen due to its superior performance. This report summarizes the project process, key findings, and recommended action steps based on the predictive model and feature importance analysis. Data Analysis and Preprocessing Initial Data Exploration: The dataset was explored to understand the distribution of features and the target variable (churn). Key features were identified, including Customer Lifetime Value (CLTV), contract type, monthly charges, and total charges. Handling Imbalanced Data: The dataset was imbalanced, with significantly more non-churners than churners. Techniques such as resampling or acquiring additional data were suggested to address this issue. Feature Selection: Important features were selected based on their predictive power, including CLTV, contract type, monthly charges, total charges, tenure months, online security, payment method, and tech support. Model Training and Hyperparameter Tuning Model Training: Various models were trained, including Nearest Neighbors, Linear SVM, RBF SVM, Random Forest, AdaBoost, and Gradient Boosting. Random Forest emerged as the best performer. Hyperparameter Tuning: GridSearchCV was used to optimize the Random Forest classifier's hyperparameters, resulting in an improved accuracy of 80.34%. Feature Importance Analysis Key Predictors: CLTV, monthly charges, total charges, contract type, and tenure months were identified as the most significant predictors of customer churn. Less Important Features: Features like gender, streaming services (TV and movies), and phone service were found to have negligible impact on churn prediction. Impact on Customer Churn The Random Forest model, with its improved accuracy, provides valuable insights into the factors influencing customer churn. By understanding the key predictors, businesses can: Tailor Retention Efforts: Focus on customers with high CLTV and those with contracts nearing expiration to prevent churn. Adjust Pricing Strategies: Analyze and optimize monthly and total charges to improve customer satisfaction and retention. Enhance Service Offerings: Improve features like online security and tech support, which were found to be moderately important in predicting churn. Segment Customers: Use tenure months to identify and address the needs of long-term and short-term customers differently. Recommended Action Steps Balance the Dataset: Implement techniques such as downsampling the majority class or acquiring more data to ensure an evenly balanced dataset for future modeling. Continuous Monitoring and Retraining: Periodically retrain the model with updated data and conduct feature importance analysis to adapt to changing customer behavior. Targeted Interventions: Develop personalized retention strategies based on the key predictors identified, focusing resources on high-risk customers. Enhanced Customer Engagement: Proactively engage with customers through personalized offers and improved service quality to reduce churn rates. Conclusion This project successfully developed a robust predictive model for customer churn using the Random Forest classifier. The insights gained from the feature importance analysis and the model's predictions can guide targeted interventions to reduce churn and enhance customer loyalty. By addressing the imbalances in the dataset and continuously refining the model, businesses can achieve even better performance and more effectively mitigate customer attrition. Loading [MathJax]/extensions/Safe.js