Collins Macharia Muturi | Freelancer Machine Learning

Machine Learning | Bank Marketing Campaign

Data Analysis of Bank marketing campaign Introduction In my analysis, "deposit" refers to whether or not a client has subscribed to a term deposit, which is the target variable I'm trying to predict. The model analyzes various features, such as client characteristics, financial status, and past interactions, to predict whether a client will subscribe to a term deposit as part of a financial product offering. The Bank Marketing Campaign Analysis project aims to evaluate the effectiveness of a direct marketing campaign run by a banking institution to promote its term deposit products. The dataset used for this analysis consists of 11,162 customer records, capturing various attributes such as demographics, financial information, and customer interactions during the campaign. The key objective of the project is to identify factors that influence customer decisions to subscribe to a term deposit and to build predictive models that can help the bank target potential customers more effectively. The dataset includes variables such as age, balance, job type, education level, contact methods, and the number of previous campaign contacts. Through data exploration, preprocessing, and modeling, the project aims to uncover insights about customer behavior, campaign effectiveness, and the likelihood of deposit subscription. By using machine learning techniques such as random forests, the project will help in predicting which customers are more likely to subscribe, ultimately enabling the bank to optimize its marketing strategies for higher conversion rates. This project not only demonstrates the power of data-driven decision-making but also provides actionable recommendations for improving the bank's future marketing campaigns. import pandas as pd import numpy as np data = pd.read_csv('C:/Users/Collins PC/Downloads/bank_marketting_campaign/bank.csv') data age job marital education default balance housing loan contact day month duration campaign pdays previous 0 59 admin. married secondary no 2343 yes no unknown 5 may 1042 1 -1 1 56 admin. married secondary no 45 no no unknown 5 may 1467 1 -1 2 41 technician married secondary no 1270 yes no unknown 5 may 1389 1 -1 3 55 services married secondary no 2476 yes no unknown 5 may 579 1 -1 4 54 admin. married tertiary no 184 no no unknown 5 may 673 2 -1 ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... 11157 33 bluecollar single primary no 1 yes no cellular 20 apr 257 1 -1 11158 39 services married secondary no 733 no no unknown 16 jun 83 4 -1 11159 32 technician single secondary no 29 no no cellular 19 aug 156 2 -1 11160 43 technician married secondary no 0 no yes cellular 8 may 9 2 172 11161 34 technician married secondary no 0 no no cellular 9 jul 628 1 -1 11162 rows × 17 columns data.info() RangeIndex: 11162 entries, 0 to 11161 Data columns (total 17 columns): # Column Non-Null Count Dtype --- ------------------- ----0 age 11162 non-null int64 1 job 11162 non-null object 2 marital 11162 non-null object 3 education 11162 non-null object 4 default 11162 non-null object 5 balance 11162 non-null int64 6 housing 11162 non-null object 7 loan 11162 non-null object 8 contact 11162 non-null object 9 day 11162 non-null int64 10 month 11162 non-null object 11 duration 11162 non-null int64 12 campaign 11162 non-null int64 13 pdays 11162 non-null int64 14 previous 11162 non-null int64 15 poutcome 11162 non-null object 16 deposit 11162 non-null object dtypes: int64(7), object(10) memory usage: 1.4+ MB data.columns Index(['age', 'job', 'marital', 'education', 'default', 'balance', 'housing', 'loan', 'contact', 'day', 'month', 'duration', 'campaign', 'pdays', 'previous', 'poutcome', 'deposit'], dtype='object') # Check Missing Values data.isnull().sum().sort_values(ascending = False) age day poutcome previous pdays campaign duration month contact job loan housing balance default education marital deposit dtype: int64 - print(f"Records: {data.shape[0]}") print(f"Columns: {data.shape[1]}") Records: 11162 Columns: 17 Customer Segmentation customer_data = data[['age', 'job', 'marital', 'education','balance','duration', 'campaign','poutcome', 'deposit']] customer_data age job marital education balance duration campaign poutcome deposit 0 59 admin. married secondary 2343 1042 1 unknown yes 1 56 admin. married secondary 45 1467 1 unknown yes 2 41 technician married secondary 1270 1389 1 unknown yes 3 55 married secondary 2476 579 1 unknown yes 4 54 admin. married tertiary 184 673 2 unknown yes ... ... services ... ... ... ... ... ... ... ... 11157 33 blue-collar single primary 1 257 1 unknown no 11158 39 services married secondary 733 83 4 unknown no 11159 32 technician single secondary 29 156 2 unknown no 11160 43 technician married secondary 0 9 2 failure no 11161 34 technician married secondary 0 628 1 unknown no 11162 rows × 9 columns customer_data['poutcome'].value_counts() poutcome unknown 8326 failure 1228 success 1071 other 537 Name: count, dtype: int64 # Import necessary libraries import pandas as pd from sklearn.preprocessing import StandardScaler from sklearn.cluster import KMeans import matplotlib.pyplot as plt k=10 # Selecting relevant features for clustering X = data[['age', 'balance', 'duration', 'campaign', 'pdays']] # Standardizing the data scaler = StandardScaler() X_scaled = scaler.fit_transform(X) # Apply K-Means clustering (choosing 4 clusters) kmeans = KMeans(n_clusters=k, random_state=42) data['Cluster'] = kmeans.fit_predict(X_scaled) # Define the cluster labels for 4 clusters cluster_labels = { 0: 'High-Balance Long-Term Customers', 1: 'Low-Balance Short-Term Customers', 2: 'Frequent Contacts, Low Engagement', 3: 'New or Low-Balance Customers' } # Check the cluster centers for analysis print("Cluster Centers:\n", kmeans.cluster_centers_) # Scatter plot with colors based on clusters plt.figure(figsize=(10, 6)) scatter = plt.scatter(data['balance'], data['duration'], c=data['Cluster'], cmap='viridis', label=data['Cluster']) centers = scaler.inverse_transform(kmeans.cluster_centers_) # Create the legend by mapping cluster numbers to labels handles, _ = scatter.legend_elements() labels = [cluster_labels[i] for i in range(4)] # Adjusting for 4 clusters plt.legend(handles, labels, title="Customer Segments") # Visualizing the clusters based on two important features: balance vs duration plt.scatter(data['balance'], data['duration'], c=data['Cluster'], cmap='viridis') plt.xlabel('Balance') plt.ylabel('Duration') plt.title('Customer Segmentation Based on Balance and Duration') # Show cluster centroid locations centroid_scatter = plt.scatter(centers[:, 1], centers[:, 2], c='red', s=200, marker='X', label='Centroids') # Adding centroid label to the legend plt.legend(handles + [centroid_scatter], labels + ['Centroids'], title="Customer Segments") # Show the plot plt.show() # View the segmented customers print(data) Cluster Centers: [[- - -] [- - - -] [- - - -] [- - - - -] [- - - - -] [- - - -] [- -] [- - - -] [- - - -] [- - -]] 0 1 2 3 4 ..- age- ..- job admin. admin. technician services admin. ... blue-collar services technician technician technician 0 1 2 3 4 ..- contact unknown unknown unknown unknown unknown ... cellular unknown cellular cellular cellular marital married married married married married ... single married single married married day month 5 may 5 may 5 may 5 may 5 may ... ... 20 apr 16 jun 19 aug 8 may 9 jul 0 1 2 3 4 ... 11157 deposit yes yes yes yes yes ... no Cluster 0 0 0 6 3 ... 4 - no no no no 4 4 1 8 education default secondary no secondary no secondary no secondary no tertiary no ... ... primary no secondary no secondary no secondary no secondary no duration- ..- campaign 1 1 1 1 2 ... 1 4 2 2 1 balance housing loan 2343 yes no 45 no no 1270 yes no 2476 yes no 184 no no ... ... ... 1 yes no 733 no no 29 no no 0 no yes 0 no no pdays -1 -1 -1 -1 -1 ... -1 -1 -1 172 -1 previous poutcome 0 unknown 0 unknown 0 unknown 0 unknown 0 unknown ... ... 0 unknown 0 unknown 0 unknown 5 failure 0 unknown \ \ [11162 rows x 18 columns] Key Business Insights: High-Balance Long-Term Customers: This segment represents customers who have maintained a high balance over time and exhibit longer engagement with the company. These customers likely have a strong relationship with the business and may respond well to personalized offers and loyalty programs aimed at retention and satisfaction. Low-Balance Short-Term Customers: Customers in this group have lower account balances and short engagement durations. They may be less financially invested in the company and could benefit from incentives to encourage higher balance maintenance or increased interaction, such as discounts or entry-level financial products. Frequent Contacts, Low Engagement: This segment is characterized by frequent contact from the company but low levels of engagement (short durations per interaction). These customers may require a different approach, such as improving the relevance of marketing efforts or refining communication strategies to increase their interest in the company’s offerings. New or Low-Balance Customers: These customers are either new to the business or have low balances and limited interaction history. Nurturing these customers with introductory offers, educational content, or tailored products could help convert them into more valuable long-term customers. Visual Representation: The scatter plot visualizes the clustering of customers based on balance and duration. The clusters are color-coded, with distinct centroids (marked with red 'X') indicating the center of each group. The visualization reveals the distinct nature of each customer segment, highlighting differences in engagement and financial behavior. By understanding these customer segments, the company can develop tailored marketing strategies and customer retention efforts, ultimately improving customer satisfaction and driving revenue growth. duration measures the length of the last contact (in seconds) between the customer and the bank during a marketing campaign. Specifically, it refers to the duration of the last phone call or interaction made by the bank with the customer. A longer duration generally indicates a more engaged or interested customer who stayed on the call longer. A shorter duration might suggest that the customer was less interested or cut the conversation short. The duration feature is important because longer interactions could be associated with a higher chance of the customer subscribing to the bank's offer or service. However, this variable is only known after the call, so it’s not usually used in predictive models before the interaction takes place. balance refers to the customer's account balance at the time of the last contact. It represents the amount of money the customer had in their bank account when the bank reached out to them during the marketing campaign. A higher balance indicates that the customer has more money in their account, which may reflect their financial well-being. A lower balance could suggest that the customer has fewer financial resources available. This feature can help the bank understand the customer's financial position, which may influence how likely they are to invest in the bank's products or services. Check Whether the number of clusters was okay def within_cluster_variation(data, label_col='cluster_label'): # Select only numeric columns for mean calculation numeric_data = data.select_dtypes(include=[np.number]) # Add back the label column for grouping numeric_data[label_col] = data[label_col] # Calculate centroids for each cluster by grouping by the label column centroids = numeric_data.groupby(label_col).mean() # Initialize WCSS wcss = 0 # Iterate over each cluster for label, centroid in centroids.iterrows(): # Select the data points that belong to the current cluster cluster_points = numeric_data[numeric_data[label_col] == label].drop(label_col, axis=1) # Calculate the squared distance between each point and the centroid squared_diff = (cluster_points - centroid).pow(2) # Sum the squared distances and add to the WCSS wcss += squared_diff.sum(axis=1).sum() return wcss # Ensure necessary libraries are imported from sklearn.cluster import KMeans import numpy as np import pandas as pd # Assuming X_scaled is the scaled feature matrix of the second dataset # and 'data' is a DataFrame containing the second dataset. # Define the range of clusters n_clusters = np.arange(2, 10) # Initialize an empty list to store errors for each value of k errors = [] # Iterate over the range of cluster numbers for k in n_clusters: # Perform k-means clustering with the current number of clusters kmeans = KMeans(n_clusters=k, n_init=10, max_iter=300, random_state=42) # Fit the k-means model on the scaled data kmeans.fit(X_scaled) # Predict cluster labels y_preds = kmeans.predict(X_scaled) # Add cluster labels to the data DataFrame (or replace if it already exists) data['cluster_label'] = y_preds # Ensure that the within_cluster_variation function works with the new dataset # You may need to adapt this function to match the new data structure error = within_cluster_variation(data, 'cluster_label') # Append the calculated error to the errors list errors.append(error) # After this, the errors list will contain the variation for each k between 2 and 20. This code above aims to evaluate the optimal number of clusters (k) for a given dataset by utilizing the K-means clustering algorithm and calculating the within-cluster sum of squares (WCSS) for each k. The process begins by applying K-means clustering to a scaled feature matrix, with the number of clusters ranging from 2 to 9. For each k, the model assigns cluster labels, and the WCSS is calculated using the within_cluster_variation function, which measures the squared distances between data points and their respective cluster centroids. The goal of this process is to identify the k that minimizes WCSS, as a lower WCSS indicates that data points are closer to their centroids, implying more cohesive clusters. This method helps in determining the optimal k by balancing between compactness within clusters and the number of clusters, which is critical in cluster analysis for making informed decisions about data segmentation. plt.figure(figsize=(12,8)) plt.xlabel('Number of Clusters (k)') plt.ylabel('Within-Cluster Sum of Squares (WCSS)') plt.title('Elbow Method for Determining Optimal Value of k') plt.scatter(n_clusters, errors) plt.plot(n_clusters, errors) plt.xticks(n_clusters) plt.show() The code generates an Elbow Method plot to help determine the optimal number of clusters (k) for a dataset. The plot visualizes the relationship between the number of clusters and the within-cluster sum of squares (WCSS), with the x-axis representing the number of clusters and the y-axis showing the WCSS values. As the number of clusters increases, the WCSS typically decreases, indicating tighter clusters. The "elbow" point in the curve represents the optimal k, where adding more clusters results in diminishing returns in terms of WCSS reduction. This point is where the curve bends or flattens, suggesting a balance between compactness and simplicity. # Import necessary libraries import pandas as pd from sklearn.preprocessing import StandardScaler from sklearn.cluster import KMeans import matplotlib.pyplot as plt # Number of clusters k = 7 # Selecting relevant features for clustering X = data[['age', 'balance', 'duration', 'campaign', 'pdays']] # Standardizing the data scaler = StandardScaler() X_scaled = scaler.fit_transform(X) # Apply K-Means clustering with k=7 clusters kmeans = KMeans(n_clusters=k, random_state=42) y_preds = kmeans.fit_predict(X_scaled) # Save predicted clusters in y_preds data['Cluster'] = y_preds # Assign predicted clusters to the dataset # Define cluster labels for 7 clusters cluster_labels = { 0: 'High-Balance Long-Term Customers', 1: 'Low-Balance Short-Term Customers', 2: 'Frequent Contacts, Low Engagement', 3: 'New or Low-Balance Customers', 4: 'High Engagement, Low Balance', 5: 'Infrequent Contacts, Moderate Balance', 6: 'Low Balance, Low Engagement' } # Check the cluster centers for analysis print("Cluster Centers:\n", kmeans.cluster_centers_) # Scatter plot with colors based on clusters plt.figure(figsize=(10, 6)) scatter = plt.scatter(data['balance'], data['duration'], c=data['Cluster'], cmap='viridis', label=data['Cluster']) centers = scaler.inverse_transform(kmeans.cluster_centers_) # Create the legend by mapping cluster numbers to labels for 7 clusters handles, _ = scatter.legend_elements() labels = [cluster_labels[i] for i in range(k)] # Adjusting for k=7 clusters plt.legend(handles, labels, title="Customer Segments") # Visualizing the clusters based on two important features: balance vs duration plt.xlabel('Balance') plt.ylabel('Duration') plt.title('Customer Segmentation Based on Balance and Duration') # Show cluster centroid locations centroid_scatter = plt.scatter(centers[:, 1], centers[:, 2], c='red', s=200, marker='X', label='Centroids') # Adding centroid label to the legend plt.legend(handles + [centroid_scatter], labels + ['Centroids'], title="Customer Segments") # Show the plot plt.show() # View the segmented customers print(data) Cluster Centers: [[-e-01 -e-e-e-02 -e-01] [-e-01 -e-02 -e-01 -e-e+00] [-e-e+00 -e-02 -e-02 -e-03] [-e+00 -e-03 -e-01 -e-01 -e-01] [-e-02 -e-01 -e-e+00 -e-01] [-e-01 -e-e+00 -e-01 -e-01] [-e-01 -e-01 -e-01 -e-01 -e-01]] 0 1 2 3 4 ..- age- ..- 0 1 2 3 4 ..- contact unknown unknown unknown unknown unknown ... cellular unknown cellular cellular 11161 cellular 0 1 2 3 4 ..- job admin. admin. technician services admin. ... blue-collar services technician technician technician deposit yes yes yes yes yes ... no no no no no marital married married married married married ... single married single married married day month 5 may 5 may 5 may 5 may 5 may ... ... 20 apr 16 jun 19 aug 8 may 9 Cluster 0 0 0 3 3 ... 6 6 6 1 5 education default secondary no secondary no secondary no secondary no tertiary no ... ... primary no secondary no secondary no secondary no secondary no balance housing loan 2343 yes no 45 no no 1270 yes no 2476 yes no 184 no no ... ... ... 1 yes no 733 no no 29 no no 0 no yes 0 no no duration- ..- campaign 1 1 1 1 2 ... 1 4 2 2 pdays -1 -1 -1 -1 -1 ... -1 -1 -1 172 628 1 -1 jul previous poutcome 0 unknown 0 unknown 0 unknown 0 unknown 0 unknown ... ... 0 unknown 0 unknown 0 unknown 5 failure 0 \ \ unknown cluster_label 1 1 1 8 8 ... 4 4 4 3 4 [11162 rows x 19 columns] Using the WCSS method to determine the optimal number of clusters, the K-Means clustering analysis identified seven distinct customer segments in a banking dataset based on key features such as age, balance, duration, campaign, and pdays. This segmentation uncovers meaningful patterns in customer behavior that can guide targeted marketing and retention strategies. The resulting scatter plot, which uses balance and duration as the primary axes, highlights the centroids of each cluster, representing the central tendencies of each customer segment. Key insights include: High-Balance, Long-Term Customers: These are high-value clients with substantial balances and extended engagement periods, making them prime candidates for loyalty programs or premium services. Low-Balance, Short-Term Customers: These customers are less engaged and may benefit from promotions designed to increase both their interaction and financial commitment. Frequent Contacts, Low Engagement: Customers in this segment are often contacted but show low engagement, suggesting potential over-contact or the need to refine communication strategies. New or Low-Balance Customers: These individuals offer growth potential through cross-selling and onboarding strategies to deepen their engagement and increase their balance. High Engagement, Low Balance: While these customers are highly engaged, their financial value is relatively low, indicating a need for tailored financial planning services to grow their balances. Infrequent Contacts, Moderate Balance: This stable but passive customer base could benefit from low-touch, high-impact interactions to maintain or increase their loyalty. Low Balance, Low Engagement: These at-risk customers may require focused retention efforts or personalized offers to prevent churn and re-engage them. These insights enable the bank to drive more targeted marketing efforts, enhance customer retention, and better allocate resources toward strengthening customer relationships. # Assuming 'data' is already a pandas DataFrame and you've selected your features: X = data[['age', 'balance', 'duration', 'campaign', 'pdays']] # Standardizing the data scaler = StandardScaler() X_scaled = scaler.fit_transform(X) # Apply K-Means clustering kmeans = KMeans(n_clusters=k, random_state=42) y_preds = kmeans.fit_predict(X_scaled) # Save the predictions # Adding the predicted clusters (y_preds) to the data DataFrame data['Predicted_Cluster'] = y_preds # Define cluster labels for easier interpretation (optional) cluster_labels = { 0: 'High-Balance Long-Term Customers', 1: 'Low-Balance Short-Term Customers', 2: 'Frequent Contacts, Low Engagement', 3: 'New or Low-Balance Customers', 4: 'High Engagement, Low Balance', 5: 'Infrequent Contacts, Moderate Balance', 6: 'Low Balance, Low Engagement' } # Check the cluster centers for analysis print("Cluster Centers:\n", kmeans.cluster_centers_) # Scatter plot with colors based on predicted clusters plt.figure(figsize=(10, 6)) scatter = plt.scatter(data['balance'], data['duration'], c=data['Predicted_Cluster'], cmap='viridis', label=data['Pre # Inverse transform the cluster centers back to original scale for plotting centers = scaler.inverse_transform(kmeans.cluster_centers_) # Plotting the cluster centroids plt.scatter(centers[:, 1], centers[:, 2], c='red', s=200, marker='X', label='Centroids') # Customize plot plt.xlabel('Balance') plt.ylabel('Duration') plt.title('Customer Segmentation: Balance vs Duration') plt.colorbar(scatter, label='Cluster') # Display the plot plt.show() # View the updated data with predicted clusters print(data.head()) # Show the first few rows of the data with the new column Cluster Centers: [[-e-01 -e-e-e-02 -e-01] [-e-01 -e-02 -e-01 -e-e+00] [-e-e+00 -e-02 -e-02 -e-03] [-e+00 -e-03 -e-01 -e-01 -e-01] [-e-02 -e-01 -e-e+00 -e-01] [-e-01 -e-e+00 -e-01 -e-01] [-e-01 -e-01 -e-01 -e-01 -e-01]] 0 1 2 3 4 age- job admin. admin. technician services admin. 0 1 2 3 4 day month 5 may 5 may 5 may 5 may 5 may 0 1 2 3 4 cluster_label 1 1 1 8 8 data.head() marital married married married married married duration- education default secondary no secondary no secondary no secondary no tertiary no campaign 1 1 1 1 2 pdays -1 -1 -1 -1 -1 Predicted_Cluster 0 0 0 3 3 balance housing loan 2343 yes no 45 no no 1270 yes no 2476 yes no 184 no no previous poutcome deposit 0 unknown yes 0 unknown yes 0 unknown yes 0 unknown yes 0 unknown yes contact unknown unknown unknown unknown unknown Cluster 0 0 0 3 3 \ \ age job marital education default balance housing loan contact day month duration campaign pdays previous pou 0 59 admin. married secondary no 2343 yes no unknown 5 may 1042 1 -1 0 1 56 admin. married secondary no 45 no no unknown 5 may 1467 1 -1 0 2 41 technician married secondary no 1270 yes no unknown 5 may 1389 1 -1 0 3 55 married secondary no 2476 yes no unknown 5 may 579 1 -1 0 4 54 admin. married tertiary no 184 no no unknown 5 may 673 2 -1 0 services import seaborn as sns plt.figure(figsize=(10, 6)) sns.boxplot(x='Predicted_Cluster', y='balance', data=data) plt.title('Distribution of Balance per Predicted Cluster') plt.xlabel('Predicted Cluster') plt.ylabel('Balance') plt.show() The box plot generated from this analysis highlights the distribution of customer balances across the predicted clusters identified through a clustering algorithm. Each cluster represents a segment of customers with similar behavioral or financial characteristics, based on key variables such as age, balance, and engagement metrics. Key Business Insights: Cluster-Specific Balance Trends: The plot reveals significant variations in balance levels across different clusters. Some clusters show a wide range of balances, indicating a diverse group of customers, while others are more concentrated, suggesting that customers in those segments have more uniform financial profiles. This allows the company to better understand the financial behavior of distinct customer segments. High-Balance Customer Segments: Certain clusters exhibit higher median balances compared to others, indicating these clusters represent more financially valuable customers. Targeting these segments with tailored financial products, premium services, or personalized offers could increase customer satisfaction and retention, leading to higher revenue generation. Low-Balance Customer Segments: Conversely, clusters with lower balances may require different strategies, such as offering incentives to increase engagement or crosssell opportunities that encourage these customers to invest more in their accounts. Understanding the balance distribution within each cluster provides actionable insights for addressing the needs of less engaged or lower-balance customers. import seaborn as sns # Group data by clusters and calculate the mean balance for each cluster cluster_balance = data.groupby('Predicted_Cluster')['balance'].mean().reset_index() # Plot the average balance per cluster using a bar plot plt.figure(figsize=(10, 6)) sns.barplot(x='Predicted_Cluster', y='balance', data=cluster_balance) plt.title('Average Balance per Cluster') plt.xlabel('Predicted Cluster') plt.ylabel('Average Balance') plt.show() Key Business Insights: Cluster-Specific Financial Value: The bar plot highlights significant differences in average balances across customer clusters. Certain clusters stand out with notably higher balances, indicating that these customers provide greater financial value. These high-value segments should be prioritized for premium services, investment opportunities, and loyalty programs, as they contribute substantially to the company's revenue. Identification of Low-Balance Clusters: Clusters with lower average balances represent customers who may require more targeted approaches. These could include personalized marketing campaigns, financial education initiatives, or balance-building strategies like discounts and incentivized savings programs. By nurturing these customers, the company can elevate them into higher-value segments over time. Strategic Resource Allocation: The analysis enables the company to allocate resources more strategically by focusing on high-value clusters for growth and retention, while simultaneously developing tailored strategies to enhance the engagement and financial contribution of lower-balance clusters. This data-driven approach allows the company to refine its financial products, strengthen customer retention efforts, and optimize marketing strategies, ultimately driving sustained growth and profitability. import pandas as pd import numpy as np import seaborn as sns import plotly.express as px import plotly.graph_objects as go import plotly.figure_factory as ff from plotly.subplots import make_subplots pd.set_option('display.precision', 2) customer_data age job marital education balance duration campaign poutcome deposit 0 59 admin. married secondary 2343 1042 1 unknown yes 1 56 admin. married secondary 45 1467 1 unknown yes 2 41 technician married secondary 1270 1389 1 unknown yes 3 55 married secondary 2476 579 1 unknown yes 4 54 admin. married tertiary 184 673 2 unknown yes ... ... services ... ... ... ... ... ... ... ... 11157 33 blue-collar single primary 1 257 1 unknown no 11158 39 services married secondary 733 83 4 unknown no 11159 32 technician single secondary 29 156 2 unknown no 11160 43 technician married secondary 0 9 2 failure no 11161 34 technician married secondary 0 628 1 unknown no 11162 rows × 9 columns # Get the top 4 ages top_age = customer_data['age'].value_counts().nlargest(4).index display(top_age) # Plot the scatter plot plt.figure(figsize=(15, 6)) sns.scatterplot(x='age', y='duration', data=customer_data[customer_data['age'].isin(top_age)], hue='age') # Set plot title and labels plt.title('Popular Age with Call Duration') plt.xlabel('Age') plt.ylabel('Call Duration') # Adjust the legend plt.legend(title='Popular Age', bbox_to_anchor=(1.05, 1), loc='upper left') # Show the plot plt.show() Index([31, 32, 34, 33], dtype='int64', name='age') This project examines the relationship between call duration and the most frequent customer age groups, based on a scatter plot visualization of the top four ages with the highest occurrence in the dataset. By analyzing these age groups, the company gains insights into customer engagement based on age and interaction time. Key Business Insights: Top Age Groups Engagement: The most popular age groups demonstrate distinct call duration patterns, indicating that certain ages tend to engage more or less during interactions. These insights can guide customer service and marketing teams to adjust their outreach and communication strategies based on age-related preferences. Call Duration and Customer Interaction: Age appears to influence the length of calls, with some age groups engaging in longer conversations. This could signal a higher interest or willingness to interact for specific ages, which may correspond to their likelihood of responding to marketing efforts, customer service initiatives, or follow-up calls. Tailored Customer Strategies: Understanding which age groups spend more time on calls can help the company develop targeted campaigns, customer service approaches, and product offerings tailored to these high-engagement age groups. Additionally, shorter call durations for other age groups may suggest a need for more concise communication or a different engagement strategy. By identifying these patterns, the business can refine its customer engagement tactics to improve call effectiveness, customer satisfaction, and overall interaction success. call_duration = customer_data[['age','duration','deposit']] # Value counts for the 'age' column age_counts = call_duration['age'].value_counts() print(age_counts) # Value counts for the 'duration' column duration_counts = call_duration['duration'].value_counts() print(duration_counts) age- - ..- Name: count, Length: 76, dtype: int64 duration- .- Name: count, Length: 1428, dtype: int64 import plotly.express as px import plotly.io as pio # Check which renderer is being used and switch if necessary print(pio.renderers) # If you're working in Jupyter Notebook or JupyterLab pio.renderers.default = 'jupyterlab' # Create the scatter plot fig = px.scatter(call_duration, x='duration', y='age', color='deposit') # Update the layout fig.update_layout(width=1000, height=500) fig.update_layout(title_text='Scatter Plot of Duration vs. Age (colored by Deposits)') # Display the plot fig.show() Renderers configuration ----------------------Default renderer: 'kaggle' Available renderers: ['plotly_mimetype', 'jupyterlab', 'nteract', 'vscode', 'notebook', 'notebook_connected', 'kaggle', 'azure', 'colab', 'cocalc', 'databricks', 'json', 'png', 'jpeg', 'jpg', 'svg', 'pdf', 'browser', 'firefox', 'chrome', 'chromium', 'iframe', 'iframe_connected', 'sphinx_gallery', 'sphinx_gallery_png'] The plot above focuses on analyzing the relationship between call duration and customer age and how these factors influence whether a customer subscribes to a deposit. Using a scatter plot created with Plotly Express, the plot visualizes data points where call duration (xaxis) and customer age (y-axis) are plotted, and points are color-coded based on whether a deposit was made. This allows us to observe patterns and trends in customer behavior. Key insights from the scatter plot include: Younger customers with shorter call durations tend to have a lower likelihood of subscribing to a deposit. This suggests that younger customers may require different engagement strategies, such as more personalized or targeted communications. Customers with longer call durations, regardless of age, show a higher probability of making a deposit, indicating that extended conversations may be linked to successful outcomes. This can inform sales teams to prioritize longer interactions with potential customers or improve the quality of conversations to drive engagement. Middle-aged to older customers with moderate call durations show a balanced distribution of deposits, indicating that while age is an important factor, the quality of interaction (measured by duration) plays a more significant role in influencing deposits. These insights can guide the bank's strategy for optimizing customer outreach and improving campaign results by focusing on conversation quality and tailoring engagement approaches based on age and interaction duration. from plotly.offline import iplot fig = px.box(x = call_duration["age"], labels={"x":"Age"}, title="5-Number-Summary(Box Plot) of Age of Customers") iplot(fig) This code above utilizes a box plot to provide a five-number summary of the age distribution of customers, offering a clear overview of how age is spread across the dataset. The plot reveals the minimum, first quartile (Q1), median, third quartile (Q3), and maximum age values, as well as any potential outliers. Key insights from the box plot include: Median age: The median age reflects the central tendency of the customer base, indicating the age group that the bank engages with most frequently. This can be useful in tailoring services or products to meet the needs of this core group. Interquartile range (IQR): The spread between the first and third quartiles shows where the bulk of the customer base lies in terms of age, helping to identify the most common age range for customers. For example, a wide IQR suggests a diverse age group, while a narrow IQR indicates a more concentrated age demographic. Outliers: Any outliers in the box plot represent customers who fall significantly outside the normal age range. These outliers may represent niche customer segments, such as younger or older customers, who may require specialized marketing or services. Skewness: The shape and position of the box plot can indicate if the age distribution is skewed. A left or right skew suggests that either younger or older customers dominate, influencing how campaigns might be adjusted to balance customer engagement across age groups. These insights help the bank understand the age composition of its customer base, identify key age segments, and make data-driven decisions for product offerings, marketing campaigns, and customer engagement strategies tailored to various age demographics. outliers_age = call_duration[call_duration['age'] > 75] outliers_age age duration deposit 1236 85 165 yes 1243 90 152 yes 1274 85 355 yes 1320 83 181 yes 1371 76 170 yes ... ... ... ... 10438 77 663 no 10562 88 318 no 10570 77 207 no 10618 78 838 no 10843 86 147 no 153 rows × 3 columns outliers_age['deposit'].value_counts() deposit yes 117 no 36 Name: count, dtype: int64 import matplotlib.pyplot as plt # store value counts in a variable called `product_line_counts` outliers_age_counts = outliers_age['deposit'].value_counts() # Plot the data plt.figure(figsize=(10, 8)) outliers_age_counts.plot(kind='barh', color='skyblue') # Add labels and title plt.xlabel('count') plt.ylabel('Deposits') plt.title('Frequency of positive deposits for customers above 75') # Show the plot plt.tight_layout() plt.show() This part of the project focuses on understanding the frequency of positive deposits for customers above the age of 75, a segment identified as outliers in the age distribution. The bar chart visualizes the count of customers in this older demographic who made a deposit, offering insights into their behavior and engagement with the bank's services. Key insights from the bar chart include: Engagement of older customers: The frequency of positive deposits among customers above 75 highlights how engaged this older age group is with the bank's deposit offerings. A higher frequency suggests that this group may still be responsive to targeted financial products, indicating a valuable opportunity for tailored senior-focused campaigns. Potential for specialized services: If the count is low, it may signal an under-served customer segment that requires specialized products or services to increase engagement. This could involve offering products that align more with the financial needs of older customers, such as retirement planning or fixed-term deposits with specific benefits. Understanding outliers: As outliers, customers above 75 represent a smaller portion of the overall customer base. However, their behavior can be critical in understanding the full scope of the bank's demographic reach and uncovering opportunities for inclusive financial products. This analysis helps the bank better understand and cater to its senior customers, ensuring that this age group is effectively engaged with suitable products and services. import plotly.express as px # Filter out rows with negative or zero balance filtered_data = customer_data[customer_data['balance'] > 0] # Create the scatter plot using filtered data fig = px.scatter( data_frame=filtered_data, x="age", y="job", color="deposit", size='balance', # Size based on positive balance hover_data=['deposit'], marginal_x="histogram", marginal_y="box" ) # Update layout with an appropriate title fig.update_layout( title_text="Customer Age, Job, and Deposits: Impact of Balance on Deposit Choice", titlefont={'size': 24, 'family': 'Serif'}, width=1000, height=550 ) # Show the plot fig.show() In this visualization, a scatter plot was created to explore the relationship between customer age, job type, and their decision to make a deposit, with the size of the points representing the customers' account balance. The data used in the plot only includes customers with a positive balance, ensuring focus on customers who have a financial stake in the company. Key Business Insights: Age and Job Influence on Deposit Choices: The scatter plot reveals patterns in how different age groups and job types influence the likelihood of making a deposit. Certain jobs and age ranges exhibit a higher propensity to make deposits, which could help the company tailor its financial products and marketing strategies to these segments. Balance as a Key Indicator: The size of the scatter points represents customer balance, and larger points indicate customers with higher balances. This allows for a quick visual correlation between customer balance and deposit decisions. Customers with higher balances appear more likely to make deposits, suggesting that those with greater financial resources tend to continue engaging with the company's deposit products. Distribution of Deposits by Age: The marginal histogram alongside the x-axis (age) provides insights into the distribution of ages among customers, allowing the company to identify which age groups are more frequent in their customer base. This data can assist in segmenting customers and customizing engagement strategies. Job Type and Balance Variability: The marginal box plot on the y-axis (job) highlights the variation in balance across different job categories. This insight allows the company to recognize which professions tend to maintain higher balances, potentially aiding in developing tailored financial solutions for specific professional groups. Visual Representation: The interactive scatter plot, enhanced by histograms and box plots, gives a comprehensive view of how age, job, and balance influence deposit decisions. This information is vital for crafting targeted campaigns, identifying high-potential customers, and optimizing depositrelated offerings. By leveraging these insights, the company can better understand customer behavior across different segments, improve deposit acquisition strategies, and enhance customer satisfaction. Conclusion The scatter plot reveals that older customers, particularly those in certain job categories like management and professionals, tend to have higher balances and are more likely to make deposits. Younger customers, especially those in lower-income job types, show a lower tendency to engage in deposit products. The plot also highlights that customers with higher balances are significantly more inclined to make deposits, suggesting that financial stability plays a key role in deposit behavior. Additionally, the box plot shows that balance variability is more pronounced in higher-paying professions, indicating potential opportunities for targeted financial products. customer_data['job'].value_counts() job management 2566 blue-collar 1944 technician 1823 admin. 1334 services 923 retired 778 self-employed 405 student 360 unemployed 357 entrepreneur 328 housemaid 274 unknown 70 Name: count, dtype: int64 customer_data.columns Index(['age', 'job', 'marital', 'education', 'balance', 'duration', 'campaign', 'poutcome', 'deposit'], dtype='object') job_analysis = customer_data[['job','age','poutcome', 'deposit']] job_analysis job age poutcome deposit 0 admin. 59 unknown yes 1 admin. 56 unknown yes 2 technician 41 unknown yes 3 services 55 unknown yes 4 admin. 54 unknown yes ... ... ... ... ... 11157 blue-collar 33 unknown no 11158 services 39 unknown no 11159 technician 32 unknown no 11160 technician 43 failure no 11161 technician 34 unknown no 11162 rows × 4 columns job_analysis['job'].value_counts() job management 2566 blue-collar 1944 technician 1823 admin. 1334 services 923 retired 778 self-employed 405 student 360 unemployed 357 entrepreneur 328 housemaid 274 unknown 70 Name: count, dtype: int64 job_analysis.columns Index(['job', 'age', 'poutcome', 'deposit'], dtype='object') import matplotlib.pyplot as plt import pandas as pd # Group by 'job' and 'deposit', then count occurrences of each job_deposit_counts = job_analysis.groupby(['job', 'deposit']).size().unstack(fill_value=0) # Plot the counts of deposits for each job job_deposit_counts.plot(kind='bar', figsize=(12, 8), stacked=True, color=['skyblue', 'orange']) # Add labels and title plt.xlabel('Job') plt.ylabel('Number of Deposits') plt.title('Frequency of deposit status per Customer Occupation') plt.xticks(rotation=45, ha='right') # Show the plot with a tight layout plt.tight_layout() plt.show() Key Insights from Job vs. Deposit Analysis: Job Categories and Deposits: The visualization reveals that customers in management, technicians, and blue-collar jobs have the highest frequency of term deposits. On the other hand, occupations such as housemaid and unemployed show significantly fewer deposits. This suggests that job type has a direct influence on the likelihood of subscribing to a term deposit, potentially due to varying levels of income or financial stability. Business Implications: Customers in higher-paying occupations, such as management and technicians, may have more disposable income, making them ideal targets for long-term savings products like term deposits. Conversely, lower-income job categories may need more personalized financial products or incentives to consider such investments. Strategic Recommendations: The bank could refine its marketing efforts by targeting high-income job categories with premium savings products, while also creating affordable and flexible deposit plans for customers in lower-income occupations. This differentiation will ensure that the bank maximizes its outreach and caters to the financial needs of all customer segments. customer_data.columns Index(['age', 'job', 'marital', 'education', 'balance', 'duration', 'campaign', 'poutcome', 'deposit'], dtype='object') import plotly.graph_objects as go from plotly.subplots import make_subplots # Create subsets based on job categories categories = ['management', 'blue-collar', 'technician', 'admin.', 'services', 'retired', 'self-employed', 'student', 'unemployed', 'entrepreneur', 'housemaid', 'unknown'] # Create subplots with 4 rows and 3 columns (since there are fig = make_subplots(rows=4, cols=3, specs=[[{'type': 'domain'}, {'type': 'domain'}, {'type': [{'type': 'domain'}, {'type': 'domain'}, {'type': [{'type': 'domain'}, {'type': 'domain'}, {'type': [{'type': 'domain'}, {'type': 'domain'}, {'type': subplot_titles=categories) 12 job categories) 'domain'}], 'domain'}], 'domain'}], 'domain'}]], # Add pie charts for each job category, analyzing deposit status distribution row_col_mapping = [(1, 1), (1, 2), (1, 3), (2, 1), (2, 2), (2, 3), (3, 1), (3, 2), (3, 3), (4, 1), (4, 2), (4, 3)] for i, category in enumerate(categories): # Subset for the current job category job_subset = customer_data[customer_data['job'] == category] # Add a pie chart for the current job category fig.add_trace(go.Pie(labels=job_subset["deposit"].value_counts().index, values=job_subset["deposit"].value_counts().values, name=category, hole=.4), row_col_mapping[i][0], row_col_mapping[i][1]) # Update layout to increase the size of the plot and add main title fig.update_layout( height=1000, width=1200, title_text="Deposit Status Distribution by Job Category", title_font_size=24 ) # Update traces for better display, showing percentage and distinct colors fig.update_traces(textinfo='percent+label', textfont_size=12, marker=dict(colors=['#636EFA', '#EF553B'], line=dict(color='#FFFFFF', width=2))) # Update annotations for subplot titles fig.update_annotations(font_size=16) # Show the plot fig.show() This plot above focuses on visualizing the distribution of deposit status (whether a customer subscribed to a term deposit or not) across different job categories using pie charts. By analyzing the deposit behavior for each job category, the bank can uncover trends and patterns in customer engagement based on their occupation. Key insights from the pie charts include: Job-based engagement: Each job category displays a different proportion of customers who subscribed to a term deposit. For example, categories such as management or retired may have higher percentages of positive deposits, suggesting that these groups are more responsive to deposit offers. Targeted marketing opportunities: Categories like blue-collar or unemployed may show lower engagement, indicating a need for more tailored financial products or marketing strategies to better capture their interest. Understanding the nuances between job categories helps the bank refine its approach and optimize resource allocation. Diverse customer behavior: The pie charts highlight the diversity in financial behavior across occupations. Job categories like self-employed or students might have specific financial needs that could be addressed by offering customized deposit schemes, such as flexible savings accounts or educational financial plans. Overall, this visualization allows the bank to identify which job categories have the highest deposit engagement and where there are opportunities to improve through more focused communication or specialized financial products, enhancing overall customer retention and growth. import plotly.express as px from plotly.offline import iplot # Filter the data for customers who have made a deposit ("yes") deposit_yes = customer_data[customer_data["deposit"] == "yes"] # Count the occurrences of each job category for customers with deposits marked "yes" JobDeposits = deposit_yes["job"].value_counts() # Create the pie chart fig = px.pie(values=JobDeposits, names=JobDeposits.index, color_discrete_sequence=["#98EECC", "#FFB6D9", "#99DBF5", "#F4A261", "#2A9D8F", "#E76F51", "#264653", template="plotly_dark" ) # Update traces for better text display inside the pie slices fig.update_traces(textposition='inside', textfont_size=20, textinfo='percent+label') # Update layout to customize the chart size and show the legend fig.update_layout(showlegend=True, width=1000, height=600, title_text="Customers with Deposits ('Yes') by Job Categor # Show the pie chart iplot(fig) import plotly.express as px from plotly.offline import iplot # Filter the data for customers who have not made a deposit ("no") deposit_no = customer_data[customer_data["deposit"] == "no"] # Count the occurrences of each job category for customers with deposits marked "no" JobDepositsNo = deposit_no["job"].value_counts() # Create the pie chart fig = px.pie(values=JobDepositsNo, names=JobDepositsNo.index, color_discrete_sequence=["#98EECC", "#FFB6D9", "#99DBF5", "#F4A261", "#2A9D8F", "#E76F51", "#264653", template="plotly_dark" ) # Update traces for better text display inside the pie slices fig.update_traces(textposition='inside', textfont_size=20, textinfo='percent+label') # Update layout to customize the chart size and show the legend fig.update_layout(showlegend=True, width=1000, height=600, title_text="Customers without Deposits ('No') by Job Categ # Show the pie chart iplot(fig) from plotly.offline import iplot import plotly.express as px # Get the counts of deposits per job job_deposit_counts = customer_data.groupby('job')['deposit'].value_counts().unstack(fill_value=0) # Summarize the total number of deposits ('yes' or 'no') for each job total_deposits_by_job = customer_data['job'].value_counts() # Display the summary display(total_deposits_by_job.to_frame()) # Plot the bar chart for the total deposits by job fig = px.bar( data_frame=total_deposits_by_job, x=total_deposits_by_job.index, # Job categories y=total_deposits_by_job.values, # Frequency of deposits color=total_deposits_by_job.index, # Color by job text_auto=".2s", # Add auto text formatting labels={"y": "Frequency", "index": "Job Categories"} # Rename axes ) # Update trace properties (font size, etc.) fig.update_traces(textfont_size=24) # Add a title to the plot fig.update_layout(title_text="Total Number of Customers by Job Category", title_font_size=24) # Show the plot using iplot iplot(fig) count job management 2566 blue-collar 1944 technician 1823 admin. 1334 services 923 retired 778 self-employed 405 student 360 unemployed 357 entrepreneur 328 housemaid 274 unknown 70 import plotly.express as px from plotly.offline import iplot import plotly.graph_objects as go from plotly.subplots import make_subplots fig = make_subplots(1,2,subplot_titles=('Age Distribution','Log Age Distribution')) fig.append_trace(go.Histogram(x=customer_data['age'], name='Age Distribution') ,1,1) fig.append_trace(go.Histogram(x=np.log10(customer_data['age']), name='Log Age Distribution') ,1,2) iplot(dict(data=fig)) Purpose: Age Distribution: The first histogram shows the original distribution of ages in the dataset. Log Age Distribution: The second histogram shows the distribution of the logarithmic transformation of ages, which can help detect patterns or trends that might not be visible in the raw age data (such as clustering in lower age ranges). This is useful for understanding both the raw distribution of ages and the transformed distribution import numpy as np import plotly.graph_objs as go from plotly.offline import iplot # Calculate quartiles and IQR for the 'balance' column Q25 = np.quantile(customer_data['balance'], q=0.25) Q75 = np.quantile(customer_data['balance'], q=0.75) IQR = Q75 - Q25 cut_off = IQR * 1.5 # Print number of outliers in the 'balance' column print('Number of Balance Lower Outliers:', customer_data[customer_data['balance'] <= (Q25 - cut_off)]['balance'].coun print('Number of Balance Upper Outliers:', customer_data[customer_data['balance'] >= (Q75 + cut_off)]['balance'].coun # Group the dataset by 'job' and sum values, sorting by 'balance' for analysis temp = customer_data.groupby('job').sum().sort_values('balance', ascending=False) # Create bar chart for balance distribution by job category data = [ go.Bar(x=temp.index, y=temp['balance'], name='Balance', text=temp['balance'], textposition='auto') ] # Define layout for the plot layout = go.Layout( title='Total Balance Distribution by Job Category', xaxis=dict(title='Job Categories', titlefont=dict(size=25)), yaxis=dict(title='Total Balance', titlefont=dict(size=25)), showlegend=True, width=1300, height=600 ) # Create figure and plot fig = go.Figure(data=data, layout=layout) iplot(fig) Number of Balance Lower Outliers: 4 Number of Balance Upper Outliers: 1052 from matplotlib.colors import ListedColormap colors = ["windows blue", "amber", "coral", "faded green"] # plot them as a palette sns.palplot(sns.xkcd_palette(colors)) customer_data age job marital education balance duration campaign poutcome deposit 0 59 admin. married secondary 2343 1042 1 unknown yes 1 56 admin. married secondary 45 1467 1 unknown yes 2 41 technician married secondary 1270 1389 1 unknown yes 3 55 married secondary 2476 579 1 unknown yes 4 54 admin. married tertiary 184 673 2 unknown yes ... ... services ... ... ... ... ... ... ... ... 11157 33 blue-collar single primary 1 257 1 unknown no 11158 39 services married secondary 733 83 4 unknown no 11159 32 technician single secondary 29 156 2 unknown no 11160 43 technician married secondary 0 9 2 failure no 11161 34 technician married secondary 0 628 1 unknown no 11162 rows × 9 columns import pandas as pd import plotly.figure_factory as ff # Grouping the data heatmap_data = customer_data.groupby(['job', 'marital']).size().unstack() # z x y Preparing the data for the heatmap = heatmap_data.values.tolist() = heatmap_data.columns.tolist() = heatmap_data.index.tolist() # Define your color palette colors = ['#FF5733', '#33FF57', '#3357FF', '#F1C40F'] # Create a colorscale with normalized levels cmap = [[i / (len(colors) - 1), color] for i, color in enumerate(colors)] # Function to format title def format_title(title, subtitle=None, subtitle_font=None, subtitle_font_size=None): title = f'{title}' if not subtitle: return title subtitle = f'{subtitle}' return f'{title}{subtitle}' # Create the annotated heatmap fig = ff.create_annotated_heatmap( z=z, x=x, y=y, xgap=3, ygap=3, colorscale=cmap # Use your color palette of choice ) # Update the layout title = format_title('Job vs Marital Status', 'Count of Deposits', 'Proxima Nova', 12) fig.update_layout( title_text=title, title_x=0.5, titlefont={'size': 24, 'family': 'Proxima Nova'}, template='plotly_dark', paper_bgcolor='#2B2E4A', plot_bgcolor='#2B2E4A', xaxis=dict(side='bottom', showgrid=False, title='Marital Status'), yaxis=dict(showgrid=False, autorange='reversed', title='Job'), ) # Show the figure fig.show() # data students performance fig = px.sunburst(customer_data, path=['education', 'deposit']) fig.update_layout(title_text="Education vs Deposit", titlefont={'size': 24, 'family':'Serif'}, width=750, height=750, ) fig.show() fig = px.histogram(customer_data, x="education", width=600, height=400, histnorm='percent', category_orders={ "education": ["secondary", "tertiary", "primary", "unknown"], "deposit": ["yes", "no"] }, color_discrete_map={ "yes": "RebeccaPurple", "no": "lightsalmon", }, template="simple_white" ) fig.update_layout(title="Education", font_family="San Serif", titlefont={'size': 20}, legend=dict( orientation="v", y=1, yanchor="top", x=1.0, xanchor="right" ) ).update_xaxes(categoryorder='total descending') # custom color colors = ['gray',] * 4 colors[3] = 'crimson' colors[0] = 'lightseagreen' fig.update_traces(marker_color=colors, marker_line_color=None, marker_line_width=2.5, opacity=None) fig.show() Key Insights from the Education vs. Deposit Analysis: Education's Role in Decision-Making: The plot reveals that customers with secondary and tertiary education levels make up the majority of term deposit subscribers. Customers with primary education or those whose education is unknown have significantly lower subscription rates. This suggests that higher education levels are correlated with increased financial literacy and, consequently, a higher likelihood to invest in term deposits. Business Implications: This insight indicates that the bank’s marketing campaigns for term deposits could be tailored more effectively. Customers with higher education levels may respond better to product features related to long-term savings and investment potential. On the other hand, those with lower education levels may need simpler messaging or incentives to consider such products. Strategic Recommendations: The bank could design targeted financial literacy programs or simplified product offerings for customers with primary education or unknown education levels. Personalized marketing strategies could further drive engagement among this segment, thereby improving conversion rates. import pandas as pd import plotly.figure_factory as ff # Grouping the data by education and marital status heatmap_data = customer_data.groupby(['education', 'marital']).size().unstack() # z x y Preparing the data for the heatmap = heatmap_data.values.tolist() # Values for the heatmap = heatmap_data.columns.tolist() # 'marital' categories = heatmap_data.index.tolist() # 'education' categories # Define your color palette for the heatmap colors = ['#FF5733', '#33FF57', '#3357FF', '#F1C40F'] # Custom colors # Create a colorscale with normalized levels cmap = [[i / (len(colors) - 1), color] for i, color in enumerate(colors)] # Function to format the title (from your previous code) def format_title(title, subtitle=None, subtitle_font=None, subtitle_font_size=None): title = f'{title}' if not subtitle: return title subtitle = f'{subtitle}' return f'{title}{subtitle}' # Create the annotated heatmap fig = ff.create_annotated_heatmap( z=z, x=x, y=y, xgap=3, # Adjust gap between cells on the x-axis ygap=3, # Adjust gap between cells on the y-axis colorscale=cmap, # Use your custom color palette showscale=True # Display the color scale ) # Format title title = format_title('Education vs Marital Status', 'Count of Customers', 'Proxima Nova', 12) # Update the layout fig.update_layout( title_text=title, title_x=0.5, titlefont={'size': 24, 'family': 'Proxima Nova'}, template='plotly_dark', # Keep dark background similar to the histogram paper_bgcolor='#2B2E4A', # Dark background for the paper plot_bgcolor='#2B2E4A', # Dark background for the plot xaxis=dict(side='bottom', showgrid=False, title='Marital Status'), # No grid lines yaxis=dict(showgrid=False, autorange='reversed', title='Education') # Reversed y-axis for better reading ) # Show the figure fig.show() import pandas as pd import plotly.figure_factory as ff # Filter the data for customers with deposits marked as 'yes' yes_deposit_data = customer_data[customer_data['deposit'] == 'yes'] # Grouping the filtered data by education and marital status heatmap_data = yes_deposit_data.groupby(['education', 'marital']).size().unstack() # z x y Preparing the data for the heatmap = heatmap_data.values.tolist() # Values for the heatmap = heatmap_data.columns.tolist() # 'marital' categories = heatmap_data.index.tolist() # 'education' categories # Define your color palette for the heatmap colors = ['#FF5733', '#33FF57', '#3357FF', '#F1C40F'] # Custom colors # Create a colorscale with normalized levels cmap = [[i / (len(colors) - 1), color] for i, color in enumerate(colors)] # Function to format the title (from your previous code) def format_title(title, subtitle=None, subtitle_font=None, subtitle_font_size=None): title = f'{title}' if not subtitle: return title subtitle = f'{subtitle}' return f'{title}{subtitle}' # Create the annotated heatmap fig = ff.create_annotated_heatmap( z=z, x=x, y=y, xgap=3, # Adjust gap between cells on the x-axis ygap=3, # Adjust gap between cells on the y-axis colorscale=cmap, # Use your custom color palette showscale=True # Display the color scale ) # Format title title = format_title('Education vs Marital Status (Deposits = Yes)', 'Count of Customers', 'Proxima Nova', 12) # Update the layout fig.update_layout( title_text=title, title_x=0.5, titlefont={'size': 24, 'family': 'Proxima Nova'}, template='plotly_dark', # Keep dark background similar to the histogram paper_bgcolor='#2B2E4A', # Dark background for the paper plot_bgcolor='#2B2E4A', # Dark background for the plot xaxis=dict(side='bottom', showgrid=False, title='Marital Status'), # No grid lines yaxis=dict(showgrid=False, autorange='reversed', title='Education') # Reversed y-axis for better reading ) # Show the figure fig.show() import pandas as pd import plotly.figure_factory as ff # Filter the data for customers with deposits marked as 'yes' yes_deposit_data = customer_data[customer_data['deposit'] == 'no'] # Grouping the filtered data by education and marital status heatmap_data = yes_deposit_data.groupby(['education', 'marital']).size().unstack() # z x y Preparing the data for the heatmap = heatmap_data.values.tolist() # Values for the heatmap = heatmap_data.columns.tolist() # 'marital' categories = heatmap_data.index.tolist() # 'education' categories # Define your color palette for the heatmap colors = ['#FF5733', '#33FF57', '#3357FF', '#F1C40F'] # Custom colors # Create a colorscale with normalized levels cmap = [[i / (len(colors) - 1), color] for i, color in enumerate(colors)] # Function to format the title (from your previous code) def format_title(title, subtitle=None, subtitle_font=None, subtitle_font_size=None): title = f'{title}' if not subtitle: return title subtitle = f'{subtitle}' return f'{title}{subtitle}' # Create the annotated heatmap fig = ff.create_annotated_heatmap( z=z, x=x, y=y, xgap=3, # Adjust gap between cells on the x-axis ygap=3, # Adjust gap between cells on the y-axis colorscale=cmap, # Use your custom color palette showscale=True # Display the color scale ) # Format title title = format_title('Education vs Marital Status (Deposits = No)', 'Count of Customers', 'Proxima Nova', 12) # Update the layout fig.update_layout( title_text=title, title_x=0.5, titlefont={'size': 24, 'family': 'Proxima Nova'}, template='plotly_dark', # Keep dark background similar to the histogram paper_bgcolor='#2B2E4A', # Dark background for the paper plot_bgcolor='#2B2E4A', # Dark background for the plot xaxis=dict(side='bottom', showgrid=False, title='Marital Status'), # No grid lines yaxis=dict(showgrid=False, autorange='reversed', title='Education') # Reversed y-axis for better reading ) # Show the figure fig.show() import plotly.express as px # Color palette colors = ['rgba(38, 24, 74, 0.8)', 'rgba(71, 58, 131, 0.8)', 'rgba(122, 120, 168, 0.8)', 'rgba(164, 163, 204, 0.85)', 'rgba(190, 192, 213, 1)'] # Creating the histogram with the 'deposit' column fig = px.histogram(customer_data, y="deposit", orientation='h', width=800, height=350, histnorm='percent', template="plotly_dark") # Updating layout fig.update_layout( title="Deposit Distribution", font_family="San Serif", bargap=0.2, barmode='group', titlefont={'size': 28}, paper_bgcolor='lightgray', plot_bgcolor='lightgray', legend=dict( orientation="v", y=1, yanchor="top", x=1.250, xanchor="right" ) ) # Adding annotations annotations = [] annotations.append(dict(xref='paper', yref='paper', x=0.0, y=1.2, text='no', font=dict(family='Arial', size=16, color=colors[2]), showarrow=False)) annotations.append(dict(xref='paper', yref='paper', x=0.50, y=0.85, text='47.39%', font=dict(family='Arial', size=20, color=colors[2]), showarrow=False)) annotations.append(dict(xref='paper', yref='paper', x=1.08, y=0.19, text='52.61%', font=dict(family='Arial', size=20, color=colors[2]), showarrow=False)) fig.update_layout( autosize=False, width=800, height=350, margin=dict( l=50, r=50, b=50, t=120, ), annotations=annotations ) # Removing gridlines fig.update_xaxes(showgrid=False) fig.update_yaxes(showgrid=False) # Display the figure fig.show() customer_data['deposit'].value_counts() deposit no 5873 yes 5289 Name: count, dtype: int64 data = pd.read_csv('C:/Users/Collins PC/Downloads/bank_marketting_campaign/bank.csv') data age job marital education default balance housing loan contact day month duration campaign pdays previous 0 59 admin. married secondary no 2343 yes no unknown 5 may 1042 1 -1 1 56 admin. married secondary no 45 no no unknown 5 may 1467 1 -1 2 41 technician married secondary no 1270 yes no unknown 5 may 1389 1 -1 3 55 services married secondary no 2476 yes no unknown 5 may 579 1 -1 4 54 admin. married tertiary no 184 no no unknown 5 may 673 2 -1 ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... 11157 33 bluecollar single primary no 1 yes no cellular 20 apr 257 1 -1 11158 39 services married secondary no 733 no no unknown 16 jun 83 4 -1 11159 32 technician single secondary no 29 no no cellular 19 aug 156 2 -1 11160 43 technician married secondary no 0 no yes cellular 8 may 9 2 172 11161 34 technician married secondary no 0 no no cellular 9 jul 628 1 -1 11162 rows × 17 columns import pandas as pd import datetime as dt # Convert 'day' and 'month' into a datetime column data['date'] = pd.to_datetime(data['day'].astype(str) + '-' + data['month'] + '-2023', format='%d-%b-%Y') # Today's date (for reference) today = dt.datetime.today() # Recency: Based on 'pdays' (handling -1 as a large number like 999) data['Recency'] = data['pdays'].apply(lambda x: x if x != -1 else 999) # Frequency: Using 'campaign' to represent how many times the customer was contacted data['Frequency'] = data['campaign'] # Monetary: Using 'balance' to represent the monetary value data['Monetary'] = data['balance'] # Aggregating RFM metrics per customer (assuming each age is unique for simplicity) rfm = data.groupby('age').agg({ 'Recency': 'min', # Minimum recency for the latest interaction 'Frequency': 'sum', # Sum of campaign interactions 'Monetary': 'sum' # Total balance as monetary value }).reset_index() # Create RFM segments by quartiles rfm['R_Score'] = pd.qcut(rfm['Recency'], 4, labels=[4, 3, 2, 1]) rfm['F_Score'] = pd.qcut(rfm['Frequency'].rank(method='first'), 4, labels=[1, 2, 3, 4]) rfm['M_Score'] = pd.qcut(rfm['Monetary'], 4, labels=[1, 2, 3, 4]) # Combine RFM scores rfm['RFM_Score'] = rfm['R_Score'].astype(str) + rfm['F_Score'].astype(str) + rfm['M_Score'].astype(str) # Segment customers based on RFM score def assign_segment(score): if score == '444': return 'Best Customers' elif score[0] == '4': # Recency is high return 'Recent Customers' elif score[1] == '4': # Frequency is high return 'Loyal Customers' elif score[2] == '4': # Monetary is high return 'High Monetary Customers' elif score in ['111', '211', '311']: # Low across all metrics return 'Low Value Customers' else: return 'Others' rfm['Segment'] = rfm['RFM_Score'].apply(assign_segment) # Display the result print(rfm) # Optionally save the result to a CSV rfm.to_csv('RFM_Customer_Segmentation.csv', index=False) 0 1 2 3 4 .- age- ..- Recency- ..- Frequency- ... 5 4 7 4 17 0 1 2 3 4 .- Segment Low Value Customers Low Value Customers Others Others Others ... Low Value Customers Low Value Customers Low Value Customers Low Value Customers Low Value Customers Monetary R_Score F_Score M_Score RFM_Score- ... ... ... ... ..- \ [76 rows x 9 columns] rfm age Recency Frequency Monetary R_Score F_Score M_Score RFM_Score Segment 0 18 93 13 1896 1 1 1 111 Low Value Customers 1 19 180 25 3681 1 1 1 111 Low Value Customers 2 20 91 40 20264 2 2 1 221 Others 3 21 89 59 31926 2 2 1 221 Others 4 22 38 79 40559 3 2 1 321 Others ... ... ... ... ... ... ... ... ... ... 71 89 999 5 553 1 1 1 111 Low Value Customers 72 90 999 4 713 1 1 1 111 Low Value Customers 73 92 96 7 1550 1 1 1 111 Low Value Customers 74 93 13 4 1550 3 1 1 311 Low Value Customers 75 95 999 17 2282 1 1 1 111 Low Value Customers 76 rows × 9 columns rfm['Segment'].value_counts() Segment Others Low Value Customers Recent Customers Loyal Customers Best Customers High Monetary Customers Name: count, dtype: int64 - rfm['RFM_Score'].value_counts() RFM_Score- Name: count, dtype: int64 import pandas as pd import seaborn as sns import matplotlib.pyplot as plt # Assuming rfm DataFrame already exists with 'F_Score' and 'M_Score' # Segmenting Loyal Customers loyal_customers = rfm[(rfm['RFM_Score'] == '444') | (['RFM_Score'] == '433') | (['RFM_Score'] == '344')] # Plotting Loyal Customers plt.figure(figsize=(10, 6)) sns.scatterplot(x='Frequency', y='Monetary', data=loyal_customers, color='blue', s=100) # Adding titles and labels plt.title('Loyal Customers: High Frequency and High Monetary', fontsize=16) plt.xlabel('Frequency (Number of Interactions)', fontsize=12) plt.ylabel('Monetary (Customer Balance)', fontsize=12) # Display the plot plt.show() Key Insights from Loyal Customer Segment: Loyal Customer Behavior: The scatter plot of Frequency vs. Monetary clearly shows that loyal customers tend to engage with the bank frequently and have high monetary contributions. This indicates a strong relationship between the bank and these customers, suggesting they are satisfied with the bank's offerings and may be prime candidates for upselling opportunities. Business Implications: These loyal customers are essential to the bank’s revenue and retention strategies. Focusing on this group can help the bank maintain its strong customer base while also exploring opportunities to increase customer lifetime value through premium products, personalized services, or rewards programs. Strategic Recommendations: The bank can further enhance loyalty by offering exclusive promotions, personalized financial advice, or reward programs to these customers. Such actions would not only reinforce their loyalty but could also drive higher revenue growth by deepening customer relationships. loyal_customers age Recency Frequency Monetary R_Score F_Score M_Score RFM_Score Segment 12 30 6 1179 594468 4 4 4 444 Best Customers 13 31 2 1295 661376 4 4 4 444 Best Customers 14 32 4 1223 704363 4 4 4 444 Best Customers 16 34 1 1070 615477 4 4 4 444 Best Customers 17 35 5 1238 537845 4 4 4 444 Best Customers 20 38 1 926 486068 4 4 4 444 Best Customers 24 42 1 925 444490 4 4 4 444 Best Customers # Filter "Others" segment and save it in a new DataFrame others_segment = rfm[rfm['Segment'] == 'Others'] others_segment age Recency Frequency Monetary R_Score F_Score M_Score RFM_Score Segment 2 20 91 40 20264 2 2 1 221 Others 3 21 89 59 31926 2 2 1 221 Others 4 22 38 79 40559 3 2 1 321 Others 5 23 83 136 66006 2 3 2 232 Others 6 24 80 228 140024 2 3 2 232 Others 7 25 10 349 161797 3 3 3 333 Others 29 47 10 605 379871 3 3 3 333 Others 34 52 48 580 343639 3 3 3 333 Others 36 54 70 452 277483 2 3 3 233 Others 41 59 56 432 301095 2 3 3 233 Others 43 61 91 130 266878 2 2 3 223 Others 45 63 91 84 88671 2 2 2 222 Others 46 64 89 65 81518 2 2 2 222 Others 47 65 92 38 81474 1 2 2 122 Others 48 66 87 59 42964 2 2 2 222 Others 49 67 84 64 51366 2 2 2 222 Others 51 69 56 36 54154 2 2 2 222 Others 52 70 88 30 49003 2 1 2 212 Others 53 71 91 54 74787 2 2 2 222 Others 54 72 64 56 53442 2 2 2 222 Others 55 73 40 64 51668 3 2 2 322 Others 56 74 93 37 71374 1 2 2 122 Others 57 75 94 38 86341 1 2 2 122 Others 58 76 87 39 53909 2 2 2 222 Others 59 77 60 61 100104 2 2 2 222 Others 66 84 92 8 167808 1 1 3 113 Others import pandas as pd import seaborn as sns import matplotlib.pyplot as plt # Assuming 'rfm' DataFrame already exists and 'others_segment' has been filtered # Filter "Others" segment (if not already done) others_segment = rfm[rfm['Segment'] == 'Others'] # Plotting the "Others" Segment plt.figure(figsize=(10, 6)) sns.scatterplot(x='Frequency', y='Monetary', data=others_segment, color='orange', s=100) # Adding titles and labels plt.title('Customer Segment: "Others" (Low Recency, Frequency, or Monetary)', fontsize=16) plt.xlabel('Frequency (Number of Interactions)', fontsize=12) plt.ylabel('Monetary (Customer Balance)', fontsize=12) # Display the plot plt.show() Key Business Insights from RFM Analysis & "Others" Segment: Customer Segment Insights: The “Others” segment, visualized through a scatter plot, represents customers with low recency, frequency, or monetary value. These are likely to be disengaged or inactive customers who haven't interacted much with the bank recently or who may not hold significant balances. Understanding this segment can help the bank in designing targeted re-engagement strategies, focusing on improving their experience and offering personalized incentives. Segmentation Value: By focusing on low-engagement customers, the bank can identify patterns that contribute to reduced interactions and devise retention strategies to prevent churn. Furthermore, increasing engagement within this group could lead to improved monetary contributions from underperforming segments. Strategic Decisions: Insights from this segment highlight the need for improved communication and offers that cater to the financial habits of these low-engagement customers. This analysis provides a starting point for more effective reactivation campaigns, personalized follow-ups, and possible product adjustments. data age job marital education default balance housing loan contact day ... duration campaign pdays previous pou 0 59 admin. married secondary no 2343 yes no unknown 5 ... 1042 1 -1 0 1 56 admin. married secondary no 45 no no unknown 5 ... 1467 1 -1 0 2 41 technician married secondary no 1270 yes no unknown 5 ... 1389 1 -1 0 3 55 services married secondary no 2476 yes no unknown 5 ... 579 1 -1 0 4 54 admin. married tertiary no 184 no no unknown 5 ... 673 2 -1 0 ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... 11157 33 bluecollar single primary no 1 yes no cellular 20 ... 257 1 -1 0 11158 39 services married secondary no 733 no no unknown 16 ... 83 4 -1 0 11159 32 technician single secondary no 29 no no cellular 19 ... 156 2 -1 0 11160 43 technician married secondary no 0 no yes cellular 8 ... 9 2 172 5 11161 34 technician married secondary no 0 no no cellular 9 ... 628 1 -1 0 11162 rows × 21 columns Feature Engineering Understand the data col = list(data.columns) categorical_features = [] numerical_features = [] for i in col: categorical_features = data.select_dtypes(include=['object']).columns.tolist() unique_values = data[i].unique() if i in categorical_features : categorical_features.append(i) else: numerical_features.append(i) print(f"Column: {i}") print(f"Unique Values ({len(unique_values)}): {unique_values}") print("-" * 50) print('Categorical Features:', *categorical_features) print('Numerical Features:', *numerical_features) Column: age Unique Values (76): [-] -------------------------------------------------Column: job Unique Values (12): ['admin.' 'technician' 'services' 'management' 'retired' 'blue-collar' 'unemployed' 'entrepreneur' 'housemaid' 'unknown' 'self-employed' 'student'] -------------------------------------------------Column: marital Unique Values (3): ['married' 'single' 'divorced'] -------------------------------------------------Column: education Unique Values (4): ['secondary' 'tertiary' 'primary' 'unknown'] -------------------------------------------------Column: default Unique Values (2): ['no' 'yes'] -------------------------------------------------Column: balance Unique Values (3805): [- ..- -134] -------------------------------------------------Column: housing Unique Values (2): ['yes' 'no'] -------------------------------------------------Column: loan Unique Values (2): ['no' 'yes'] -------------------------------------------------Column: contact Unique Values (3): ['unknown' 'cellular' 'telephone'] -------------------------------------------------Column: day Unique Values (31): [-] -------------------------------------------------Column: month Unique Values (12): ['may' 'jun' 'jul' 'aug' 'oct' 'nov' 'dec' 'jan' 'feb' -------------------------------------------------Column: duration Unique Values (1428): [- ..-] -------------------------------------------------Column: campaign Unique Values (36): [-] -------------------------------------------------Column: pdays Unique Values (472): [ -] -------------------------------------------------Column: previous Unique Values (34): [-] -------------------------------------------------Column: poutcome Unique Values (4): ['unknown' 'other' 'failure' 'success'] -------------------------------------------------Column: deposit Unique Values (2): ['yes' 'no'] -------------------------------------------------Column: date Unique Values (300): ['-:00:00', '-:00:00', '-:00:00', '-:00:00', '-:00:00', '-:00:00', '-:00:00', '-:00:00', '-:00:00', '-:00:00', ... '-:00:00', '-:00:00', '-:00:00', '-:00:00', '-:00:00', '-:00:00', '-:00:00', '-:00:00', '-:00:00', '-:00:00'] Length: 300, dtype: datetime64[ns] -------------------------------------------------Column: Recency Unique Values (472): [- 30 2 3 4 11 17 'mar' 'apr' 'sep'] - - - - -] -------------------------------------------------Column: Frequency Unique Values (36): [-] -------------------------------------------------Column: Monetary Unique Values (3805): [- ..- -134] -------------------------------------------------Categorical Features: job marital education default housing loan contact month poutcome deposit Numerical Features: age balance day duration campaign pdays previous date Recency Frequency Monetary Summary Report: Customer Dataset Overview The dataset consists of customer demographic, financial, and engagement-related information. Below is an overview of the key features, including their unique values and distribution. 1. Categorical Features Job: This column contains 12 distinct job categories, such as 'admin.', 'technician', 'management', and 'retired', with an 'unknown' category present. Marital Status: There are 3 marital status values: 'married', 'single', and 'divorced'. Education: Educational backgrounds are categorized into 4 groups: 'secondary', 'tertiary', 'primary', and 'unknown'. Default: Binary feature indicating if the customer has credit default, with values 'yes' and 'no'. Housing: Indicates whether the customer has a housing loan ('yes' or 'no'). Loan: A binary column showing whether the customer has taken a personal loan ('yes' or 'no'). Contact: Represents the method of contact during campaigns: 'unknown', 'cellular', or 'telephone'. Month: The month in which the campaign contact was made, ranging across 12 months from 'jan' to 'dec'. Poutcome: The outcome of previous marketing campaigns, with 4 possible values: 'unknown', 'other', 'failure', and 'success'. Deposit: The target variable indicating whether the customer subscribed to a term deposit ('yes' or 'no'). 2. Numerical Features Age: Ranges from 18 to 95, with 76 unique values, indicating a wide age distribution. Balance: Represents the bank balance of customers, with 3,805 unique values, including both positive and negative balances. Day: The day of the month the contact was made, with values ranging from 1 to 31. Duration: The length of the last contact, with 1,428 unique durations, reflecting significant variability. Campaign: The number of contacts made during the current campaign, ranging from 1 to 63, with 36 unique values. Pdays: Number of days since the client was last contacted from a previous campaign, with 472 unique values. Previous: Indicates the number of contacts performed before the current campaign, with values ranging from 0 to 58. Date: Dates span from May 5, 2024, to December 13, 2024, representing when each customer interaction occurred. Recency: Similar to pdays, with values ranging from 1 to 999, and 472 unique values indicating time since the last contact. Frequency: The number of contacts made to the customer, ranging from 1 to 63 across 36 unique values. Monetary: Similar to balance, this feature represents transactional amounts, with a wide range and 3,805 unique values. Key Insights: Demographics: The customer base spans diverse ages and occupations, with a notable presence of unknown values for job and education. Campaign Engagement: Contact methods vary, with both short-term and long-term interactions represented. Financial Attributes: Customers have a broad range of balances, indicating varying financial standing, which may affect their deposit decisions. This dataset provides a wealth of information, useful for segmentation, financial profiling, and predictive modeling to optimize marketing campaigns and understand customer behavior. # Select only the categorical columns categorical_data = data[categorical_features] # statistics about categorical data description = categorical_data.describe() description # printing more details specific to categorical data: for column in categorical_features: print(f"\nUnique values in {column}:") print(data[column].value_counts()) Unique values in job: job management 2566 blue-collar 1944 technician 1823 admin. 1334 services 923 retired 778 self-employed 405 student 360 unemployed 357 entrepreneur 328 housemaid 274 unknown 70 Name: count, dtype: int64 Unique values in marital: marital married 6351 single 3518 divorced 1293 Name: count, dtype: int64 Unique values in education: education secondary 5476 tertiary 3689 primary 1500 unknown 497 Name: count, dtype: int64 Unique values in default: default no 10994 yes 168 Name: count, dtype: int64 Unique values in housing: housing no 5881 yes 5281 Name: count, dtype: int64 Unique values in loan: loan no 9702 yes 1460 Name: count, dtype: int64 Unique values in contact: contact cellular 8042 unknown 2346 telephone 774 Name: count, dtype: int64 Unique values in month: month may 2824 aug 1519 jul 1514 jun 1222 nov 943 apr 923 feb 776 oct 392 jan 344 sep 319 mar 276 dec 110 Name: count, dtype: int64 Unique values in poutcome: poutcome unknown 8326 failure 1228 success 1071 other 537 Name: count, dtype: int64 Unique values in deposit: deposit no 5873 yes 5289 Name: count, dtype: int64 import pandas as pd import matplotlib.pyplot as plt import seaborn as sns # Example data for the month counts month_counts = pd.DataFrame({ 'month': ['may', 'aug', 'jul', 'jun', 'nov', 'apr', 'feb', 'oct', 'jan', 'sep', 'mar', 'dec'], 'count': [2824, 1519, 1514, 1222, 943, 923, 776, 392, 344, 319, 276, 110] }) # Plotting the month counts plt.figure(figsize=(10, 6)) sns.barplot(x='month', y='count', data=month_counts, color='skyblue') # Using a single color instead of palette # Adding titles and labels plt.title('Number of Transactions per Month', fontsize=16) plt.xlabel('Month', fontsize=12) plt.ylabel('Count', fontsize=12) # Rotate x-axis labels for better readability plt.xticks(rotation=45) # Display the plot plt.show() categorical_data.describe() job marital education default housing count 11162 11162 11162 11162 unique 12 3 4 2 2 top management married secondary no 5476 10994 freq 2566 6351 loan contact month poutcome deposit - 11162 11162 11162 11162 2 3 12 4 2 no no cellular may unknown no 5881 9702 8042 2824 8326 5873 Bank Marketing Campaign Analysis Objective: The goal of this project is to analyze customer data from a bank’s marketing campaign to understand patterns and factors influencing the success of deposit subscriptions. The dataset consists of several categorical variables related to customer demographics, financial behavior, and marketing outreach, along with a target variable indicating whether the customer subscribed to a deposit. Dataset Overview: The dataset contains 11,162 records, each representing a customer contacted during the marketing campaign. The variables included in the dataset provide insight into customer demographics, financial history, and engagement with the bank’s marketing efforts. Key Variables: Job: Represents the customer’s profession, with 12 unique job categories (e.g., management, technician, blue-collar). Most common: Management (2,566 customers) Marital Status: Indicates the marital status of the customer, with three categories (single, married, divorced). Most common: Married (6,351 customers) Education: Level of education, categorized into four levels (primary, secondary, tertiary, unknown). Most common: Secondary education (5,476 customers) Default: Indicates whether the customer has credit in default (yes or no). Most common: No (10,994 customers) Housing Loan: Whether the customer has a housing loan (yes or no). Most common: No (5,881 customers) Personal Loan: Indicates whether the customer has a personal loan (yes or no). Most common: No (9,702 customers) Contact Method: Type of communication used to contact the customer (cellular, telephone, unknown). Most common: Cellular (8,042 customers) Month: The month during which the customer was last contacted in the campaign, with 12 unique months (e.g., May, July, September). Most common: May (2,824 customers) Previous Campaign Outcome (Poutcome): Outcome of the previous marketing campaign (success, failure, unknown, other). Most common: Unknown (8,326 customers) Deposit Subscription (Target Variable): Indicates whether the customer subscribed to a deposit during the current campaign (yes or no). Most common: No (5,873 customers) Analysis Strategy: Exploratory Data Analysis (EDA): Examine the distribution of the categorical variables and explore any correlations between customer characteristics and their likelihood of subscribing to a deposit. Data Preprocessing: Handle missing values, encode categorical variables, and prepare the data for machine learning models. Predictive Modeling: Use classification algorithms such as logistic regression, decision trees, and random forests to predict whether a customer will subscribe to a deposit. Insights and Recommendations: Provide actionable insights for improving future marketing campaigns based on key influencing factors such as customer profession, previous campaign outcomes, and contact methods. Conclusion: This project will help the bank better understand customer segments that are more likely to subscribe to deposits, optimize targeting strategies, and improve the overall efficiency of future marketing campaigns. categorical_data job marital education default housing loan contact month poutcome deposit 0 admin. married secondary no yes no unknown may unknown yes 1 admin. married secondary no no no unknown may unknown yes 2 technician married secondary no yes no unknown may unknown yes married secondary no yes no unknown may unknown yes admin. married tertiary no no no unknown may unknown yes 3 4 ... services ... ... ... ... ... ... ... ... ... ... 11157 blue-collar single primary no yes no cellular apr unknown no jun unknown no 11158 services married secondary no no no unknown 11159 technician single secondary no no no cellular aug unknown no 11160 technician married secondary no no yes cellular may failure no 11161 technician married secondary no no no cellular jul unknown no 11162 rows × 10 columns month_mapping = { 'jan': 1, 'feb': 2, 'mar': 3, 'apr': 4, 'may': 5, 'jun': 6, 'jul': 7, 'aug': 8, 'sep': 9, 'oct': 10, 'nov': 11, 'dec': 12 } # Apply the mapping to convert the 'month' column to numerical values categorical_data['month_numerical'] = categorical_data['month'].map(month_mapping) # View the updated DataFrame categorical_data job marital education default housing loan contact month poutcome deposit month_numerical 0 admin. married secondary no yes no unknown may unknown yes 5 1 admin. married secondary no no no unknown may unknown yes 5 2 technician married secondary no yes no unknown may unknown yes 5 married secondary no yes no unknown may unknown yes 5 admin. married tertiary no no no unknown may unknown yes 5 3 4 ... services ... ... ... ... ... ... ... ... ... ... ... 11157 blue-collar single primary no yes no cellular apr unknown no 4 jun unknown no 6 11158 services married secondary no no no unknown 11159 technician single secondary no no no cellular aug unknown no 8 11160 technician married secondary no no yes cellular may failure no 5 11161 technician married secondary no no no cellular jul unknown no 7 11162 rows × 11 columns # Select only the categorical columns numerical_data = data[numerical_features] # statistics about categorical data numeric_description = numerical_data.describe() numeric_description # printing more details specific to categorical data: for column in numerical_features: print(f"\nUnique values in {column}:") print(data[column].value_counts()) Unique values in age: age- ..- Name: count, Length: 76, dtype: int64 Unique values in balance: balance- ..- -267 1 -134 1 Name: count, Length: 3805, dtype: int64 Unique values in day: day- Name: count, dtype: int64 Unique values in duration: duration- .- Name: count, Length: 1428, dtype: int64 Unique values in campaign: campaign- Name: count, dtype: int64 Unique values in pdays: pdays - ..- Name: count, Length: 472, dtype: int64 Unique values in previous: previous- - Name: count, dtype: int64 Unique values in date: date- ..- Name: count, Length: 300, dtype: int64 Unique values in Recency: Recency- ..- Name: count, Length: 472, dtype: int64 Unique values in Frequency: Frequency- Name: count, dtype: int64 Unique values in Monetary: Monetary 0 774 1 3 2 4 - ... 1 1 1 1 1 count, Length: 3805, dtype: int64 - -267 -134 Name: numerical_data age balance day duration campaign pdays previous date Recency Frequency Monetary 0 59 2343 5 1042 1 -1 - 999 1 2343 1 56 45 5 1467 1 -1 - 999 1 45 2 41 1270 5 1389 1 -1 - 999 1 1270 3 55 2476 5 579 1 -1 - 999 1 2476 4 54 184 5 673 2 -1 - 999 2 184 ... ... ... ... ... ... ... ... ... ... ... ... 11157 33 1 20 257 1 -1 - 999 1 1 11158 39 733 16 83 4 -1 - 999 4 733 11159 32 29 19 156 2 -1 - 999 2 29 11160 43 0 8 9 2 172 - 172 2 0 11161 34 0 9 628 1 -1 - 999 1 0 11162 rows × 11 columns numerical_data.describe() age balance day duration campaign count- pdays previous - date Recency Frequency Monetary - - - mean 41.23 1528.54 15.66 371.99 2.51 51.33 0.83 -:19:- 797.07 2.51 1528.54 min 18.00 -6847.00 1.00 2.00 1.00 -1.00 0.00 -:00:00 1.00 1.00 -6847.00 25% 32.00 122.00 8.00 138.00 1.00 -1.00 0.00 -:00:00 521.00 1.00 122.00 50% 39.00 550.00 15.00 255.00 2.00 -1.00 0.00 -:00:00 999.00 2.00 550.00 75% 49.00 1708.00 22.00 496.00 3.00 20.75 1.00 -:00:00 999.00 3.00 1708.00 max - 31.00 3881.00 63.00 854.00 58.00 -:00:00 999.00 63.00 - 8.42 347.13 2.72 108.76 2.29 NaN 351.28 2.72 3225.41 std 11.91 3225.41 Dataset Overview: The dataset contains 11,162 records, each with multiple numerical features capturing customer demographics, financial account status, and marketing campaign interactions. The dataset includes attributes such as customer age, balance, duration of engagement, and transactional details. Key Variables: Age: The age of the customers ranges from 18 to 95 years, with an average age of 41.23. This suggests the campaign targets both younger and older demographics, with a focus on middle-aged individuals (25% percentile: 32 years, 75% percentile: 49 years). Account Balance: The balance in customer accounts varies widely, with a mean balance of 1,528.54 and a significant range between -6,847 and 81,204. This indicates a mix of customers with negative balances (likely overdrawn) and others with substantial savings. The median balance is 550, showing that most customers have modest savings. Day: This represents the day of the month when the customer was last contacted during the campaign, with values ranging from 1 to 31. The median day is 15, indicating an even distribution of contact throughout the month. Duration: This measures the duration (in seconds) of the last call in the marketing campaign. The mean call duration is around 372 seconds, with a minimum of 2 seconds and a maximum of nearly 3,881 seconds. This wide variance may suggest different engagement levels during customer interactions, with longer durations possibly correlating with higher conversion rates. Campaign: The number of contacts made to a customer during the campaign ranges from 1 to 63, with an average of 2.51 contacts per customer. This suggests most customers are contacted 2-3 times, while some outliers have received more frequent follow-ups. Pdays (Days since Previous Campaign Contact): The number of days since the customer was last contacted in a previous campaign has an average of 51 days, but ranges from -1 (no previous contact) to 854 days. This large range highlights the need for better-targeted follow-up strategies for customers who were last contacted long ago. Previous Contacts: The number of previous contacts with the customer before the current campaign averages at 0.83, meaning that most customers are new or have minimal prior interactions with the bank’s marketing efforts. Date: This represents the date of the campaign contact, with records from January 2023 to December 2023. Most contacts occurred between May and August, with peaks likely in specific months. Recency (in days): Recency, which measures the number of days since the customer’s last interaction with the bank, has a mean of 797 days. However, the recency distribution skews heavily, with many customers having recent interactions (1 day) while others have had no contact for long periods (maximum recency of 999 days). Frequency: The frequency of transactions, measured as the number of customer interactions, averages 2.51, mirroring the number of campaign contacts. This variable provides insight into customer engagement levels over time. Monetary (Balance): This feature duplicates the balance variable, representing the monetary value of each customer’s account. It again highlights the wide variation in customer financial behavior, with most customers holding low balances (median: 550) and a few holding very high balances (maximum: 81,204). Analysis Strategy: Exploratory Data Analysis (EDA): Understand the distribution and relationships between variables, focusing on customer behavior across different age groups, balance levels, and campaign engagement metrics. Customer Segmentation: Cluster customers based on their age, balance, recency, frequency, and monetary value to create distinct customer profiles that can be used to tailor marketing strategies. Predictive Modeling: Build machine learning models to predict which customers are likely to subscribe to deposits based on their engagement metrics (e.g., duration of calls, campaign contacts) and financial profile. Campaign Effectiveness: Analyze the impact of variables such as campaign duration, the number of contacts, and recency on deposit subscription rates, to optimize future marketing efforts. numerical_data.columns Index(['age', 'balance', 'day', 'duration', 'campaign', 'pdays', 'previous', 'date', 'Recency', 'Frequency', 'Monetary'], dtype='object') numeric = numerical_data[['age', 'balance', 'day', 'duration', 'campaign', 'Recency', 'previous']] numeric age balance day duration campaign Recency previous 0 59 2343 5 1042 1 999 0 1 56 45 5 1467 1 999 0 2 41 1270 5 1389 1 999 0 3 55 2476 5 579 1 999 0 4 54 184 5 673 2 999 0 ... ... ... ... ... ... ... ... 11157 33 1 20 257 1 999 0 11158 39 733 16 83 4 999 0 11159 32 29 19 156 2 999 0 11160 43 0 8 9 2 172 5 11161 34 0 9 628 1 999 0 11162 rows × 7 columns import matplotlib.pyplot as plt import seaborn as sns import pandas as pd # Create the scatter plot plt.figure(figsize=(10, 6)) # Shorten the range for hue (recency) and sizes (duration) scatter = sns.scatterplot( x='age', y='balance', size='duration', hue='Recency', sizes=(40, 150), # Adjust size range palette='coolwarm_r', # Reverse palette for better clarity data=numeric, legend=False # Disable default legend ) # Customize the legend manually to make it shorter from matplotlib.lines import Line2D # Creating custom markers for the legend legend_elements = [ Line2D([0], [0], marker='o', color='w', Line2D([0], [0], marker='o', color='w', Line2D([0], [0], marker='o', color='w', Line2D([0], [0], marker='o', color='w', ] label='Low Recency', markersize=10, markerfacecolor='blue'), label='High Recency', markersize=10, markerfacecolor='red'), label='Short Duration', markersize=6, markerfacecolor='gray'), label='Long Duration', markersize=12, markerfacecolor='gray'), # Add the custom legend to the plot plt.legend(handles=legend_elements, loc='upper right') # Set plot title and labels plt.title('Relationship Between Age, Balance, Duration, and Recency') plt.xlabel('Age') plt.ylabel('Balance') # Show the plot plt.show() monthly_numeric = categorical_data[['month_numerical']] monthly_numeric month_numerical 0 5 1 5 2 5 3 5 4 5 ... ... 11157 4 11158 6 11159 8 11160 5 11161 7 11162 rows × 1 columns numeric_df = pd.concat([numeric, monthly_numeric ], axis=1) numeric_df age balance day duration campaign Recency previous month_numerical 0 59 2343 5 1042 1 999 0 5 1 56 45 5 1467 1 999 0 5 2 41 1270 5 1389 1 999 0 5 3 55 2476 5 579 1 999 0 5 4 54 184 5 673 2 999 0 5 ... ... ... ... ... ... ... ... ... 11157 33 1 20 257 1 999 0 4 11158 39 733 16 83 4 999 0 6 11159 32 29 19 156 2 999 0 8 11160 43 0 8 9 2 172 5 5 11161 34 0 9 628 1 999 0 7 11162 rows × 8 columns categorical_data.drop('month_numerical',axis=1,inplace =True) categorical_data job marital education default housing loan contact month poutcome deposit 0 admin. married secondary no yes no unknown may unknown yes 1 admin. married secondary no no no unknown may unknown yes 2 technician married secondary no yes no unknown may unknown yes married secondary no yes no unknown may unknown yes admin. married tertiary no no no unknown may unknown yes 3 4 ... services ... ... ... ... ... ... ... ... ... ... 11157 blue-collar single primary no yes no cellular apr unknown no jun unknown no 11158 services married secondary no no no unknown 11159 technician single secondary no no no cellular aug unknown no 11160 technician married secondary no no yes cellular may failure no 11161 technician married secondary no no no cellular jul unknown no 11162 rows × 10 columns categorical_data.drop('month',axis=1,inplace =True) categorical_data job marital education default housing loan contact poutcome deposit 0 admin. married secondary no yes no unknown unknown yes 1 admin. married secondary no no no unknown unknown yes 2 technician married secondary no yes no unknown unknown yes married secondary no yes no unknown unknown yes admin. married tertiary no no no unknown unknown yes 3 4 ... services ... ... ... ... ... ... ... ... ... 11157 blue-collar single primary no yes no cellular unknown no 11158 services married secondary no no no unknown unknown no 11159 technician single secondary no no no cellular unknown no 11160 technician married secondary no no yes cellular failure no 11161 technician married secondary no no no cellular unknown no 11162 rows × 9 columns from sklearn.preprocessing import LabelEncoder # Create a deep copy of the DataFrame cat = categorical_data.copy(deep=True) # Initialize LabelEncoder le = LabelEncoder() print('Label Encoder Transformation') for feature in categorical_data: cat[feature] = le.fit_transform(cat[feature]) print(f"{feature} : {cat[feature].unique()} = {le.inverse_transform(cat[feature].unique())}") # Display the transformed DataFrame print(cat.head()) Label Encoder Transformation job : [-] = ['admin.' 'technician' 'services' 'management' 'retired' 'blue-col lar' 'unemployed' 'entrepreneur' 'housemaid' 'unknown' 'self-employed' 'student'] marital : [1 2 0] = ['married' 'single' 'divorced'] education : [1 2 0 3] = ['secondary' 'tertiary' 'primary' 'unknown'] default : [0 1] = ['no' 'yes'] housing : [1 0] = ['yes' 'no'] loan : [0 1] = ['no' 'yes'] contact : [2 0 1] = ['unknown' 'cellular' 'telephone'] poutcome : [3 1 0 2] = ['unknown' 'other' 'failure' 'success'] deposit : [1 0] = ['yes' 'no'] job marital education default housing loan contact poutcome deposit- cat job marital education default housing loan contact poutcome deposit 0 0 1 1 0 1 0 2 3 1 1 0 1 1 0 0 0 2 3 1 2 9 1 1 0 1 0 2 3 1 3 7 1 1 0 1 0 2 3 1 4 0 1 2 0 0 0 2 3 1 ... ... ... ... ... ... ... ... ... ... 11157 1 2 0 0 1 0 0 3 0 11158 7 1 1 0 0 0 2 3 0 11159 9 2 1 0 0 0 0 3 0 11160 9 1 1 0 0 1 0 0 0 11161 9 1 1 0 0 0 0 3 0 11162 rows × 9 columns combined_df = pd.concat([cat, numeric_df], axis=1) combined_df job marital education default housing loan contact poutcome deposit age balance day duration campaign Recency 0 0 1 1 0 1 0 2 3 1 59 2343 5 1042 1 999 1 0 1 1 0 0 0 2 3 1 56 45 5 1467 1 999 2 9 1 1 0 1 0 2 3 1 41 1270 5 1389 1 999 3 7 1 1 0 1 0 2 3 1 55 2476 5 579 1 999 4 0 1 2 0 0 0 2 3 1 54 184 5 673 2 999 ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... 11157 1 2 0 0 1 0 0 3 0 33 1 20 257 1 999 11158 7 1 1 0 0 0 2 3 0 39 733 16 83 4 999 11159 9 2 1 0 0 0 0 3 0 32 29 19 156 2 999 11160 9 1 1 0 0 1 0 0 0 43 0 8 9 2 172 11161 9 1 1 0 0 0 0 3 0 34 0 9 628 1 999 11162 rows × 17 columns from sklearn.preprocessing import MinMaxScaler, StandardScaler scaler = MinMaxScaler() final_df = pd.DataFrame(scaler.fit_transform(combined_df), columns=combined_df.columns) final_df job marital education default housing loan contact poutcome deposit age balance day duration campaign Recency 0 0.00 0.5 0.33 0.0 1.0 0.0 1.0 1.0 1.0 0.53 0.10 0.13 2.68e-01 0.00 1.00 1 0.00 0.5 0.33 0.0 0.0 0.0 1.0 1.0 1.0 0.49 0.08 0.13 3.78e-01 0.00 1.00 2 0.82 0.5 0.33 0.0 1.0 0.0 1.0 1.0 1.0 0.30 0.09 0.13 3.58e-01 0.00 1.00 3 0.64 0.5 0.33 0.0 1.0 0.0 1.0 1.0 1.0 0.48 0.11 0.13 1.49e-01 0.00 1.00 4 0.00 0.5 0.67 0.0 0.0 0.0 1.0 1.0 1.0 0.47 0.08 0.13 1.73e-01 0.02 1.00 ... ... ... ... ... ... ... ... ... ... ... ... - 1.0 0.00 0.0 1.0 0.0 0.0 1.0 0.0 0.19 0.08 0.63 6.57e-02 0.00 1.00 - 0.5 0.33 0.0 0.0 0.0 1.0 1.0 0.0 0.27 0.09 0.50 2.09e-02 0.05 1.00 - 1.0 0.33 0.0 0.0 0.0 0.0 1.0 0.0 0.18 0.08 0.60 3.97e-02 0.02 1.00 - 0.5 0.33 0.0 0.0 1.0 0.0 0.0 0.0 0.32 0.08 0.23 1.80e-03 0.02 0.17 - 0.5 0.33 0.0 0.0 0.0 0.0 1.0 0.0 0.21 0.08 0.27 1.61e-01 0.00 1.00 ... ... ... ... 11162 rows × 17 columns Building a Model final_df.isnull().sum() job marital education default housing loan contact poutcome deposit age balance day duration campaign Recency previous month_numerical dtype: int64 - final_df.columns Index(['job', 'marital', 'education', 'default', 'housing', 'loan', 'contact', 'poutcome', 'deposit', 'age', 'balance', 'day', 'duration', 'campaign', 'Recency', 'previous', 'month_numerical'], dtype='object') # Separate target variable from independent variables y = final_df['deposit'] X = final_df.drop(columns=['deposit']) print(X.shape) print(y.shape) (11162, 16) (11162,) from from from from sklearn import metrics sklearn.model_selection import train_test_split sklearn.model_selection import cross_val_score sklearn.ensemble import RandomForestClassifier X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42) print(X_train.shape) print(y_train.shape) print(X_test.shape) print(y_test.shape) (8371, 16) (8371,) (2791, 16) (2791,) from sklearn.ensemble import RandomForestClassifier # Initialize and fit the RandomForestClassifier model = RandomForestClassifier(n_estimators=1000) model.fit(X_train, y_train) ▾ RandomForestClassifier i ? RandomForestClassifier(n_estimators=1000) feature_importances = pd.DataFrame({ 'features': X_train.columns, # Feature names from X_train 'importance': model.feature_importances_ # Feature importance values from the fitted model }) # Sorting the feature importance in descending order feature_importances = feature_importances.sort_values(by='importance', ascending=False).reset_index(drop=True) # Display the result print(feature_importances) features duration balance month_numerical age day Recency contact job campaign poutcome housing previous education marital loan default - importance 3.73e-01 8.41e-02 8.32e-02 8.10e-02 7.74e-02 4.87e-02 4.20e-02 3.75e-02 3.39e-02 3.31e-02 2.89e-02 2.57e-02 2.29e-02 1.78e-02 9.84e-03 1.34e-03 feature_importances features importance 0 duration 3.73e-01 1 balance 8.41e-02 2 month_numerical 8.32e-02 3 age 8.10e-02 4 day 7.74e-02 5 Recency 4.87e-02 6 contact 4.20e-02 7 job 3.75e-02 8 campaign 3.39e-02 9 poutcome 3.31e-02 10 housing 2.89e-02 11 previous 2.57e-02 12 education 2.29e-02 13 marital 1.78e-02 14 loan 9.84e-03 15 default 1.34e-03 The feature importance analysis provides insights into the key drivers influencing the model's predictions. The most impactful feature is duration (0.373), indicating that the length of a client’s interaction has the strongest influence on the likelihood of the target outcome. Other important factors include balance (0.0841), month_numerical (0.0832), and age (0.0810), highlighting the significance of a client's financial stability, the time of year, and age in predicting outcomes (client taking a term deposit). Less influential features include loan (0.00984) and default (0.00134), suggesting they have minimal effect on the model’s predictions. The analysis underscores the critical role of interaction time (duration) and financial factors in shaping the model's decisions, while personal demographics like education, marital status, and job contribute moderately. This feature importance ranking helps guide focus areas for further model improvement and potential business strategies. plt.figure(figsize=(15, 25)) plt.title('Feature Importances') plt.barh(range(len(feature_importances)), feature_importances['importance'], color='b', align='center') plt.yticks(range(len(feature_importances)), feature_importances['features']) plt.xlabel('Importance') plt.show() from sklearn.ensemble import RandomForestRegressor from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score # Predicting the target values for the test set predictions = model.predict(X_test) # Calculating regression metrics mae = mean_absolute_error(y_test, predictions) mse = mean_squared_error(y_test, predictions) rmse = mse ** 0.5 # or use sklearn.metrics.mean_squared_error with squared=False r2 = r2_score(y_test, predictions) # Printing the metrics print(f"Mean Absolute Error: {mae}") print(f"Mean Squared Error: {mse}") print(f"Root Mean Squared Error: {rmse}") print(f"R-squared: {r2}") Mean Absolute Error:- Mean Squared Error:- Root Mean Squared Error:- R-squared:- The model's performance metrics indicate moderate predictive accuracy. The Mean Absolute Error (MAE) is 0.156, which reflects the average difference between the predicted and actual values. The Mean Squared Error (MSE), also 0.156, further emphasizes the magnitude of error, while the Root Mean Squared Error (RMSE) of 0.395 shows that predictions deviate from actual values by approximately 39.5% on average. The R-squared value of 0.376 suggests that about 37.6% of the variability in the target variable is explained by the model, implying room for improvement in model accuracy. predictions = model.predict(X_test) tn, fp, fn, tp = metrics.confusion_matrix(y_test, predictions).ravel() y_test.value_counts() deposit- Name: count, dtype: int64 print(f"True positives: {tp}") print(f"False positives: {fp}") print(f"True negatives: {tn}") print(f"False negatives: {fn}\n") print(f"Accuracy: {metrics.accuracy_score(y_test, predictions)}") print(f"Precision: {metrics.precision_score(y_test, predictions)}") print(f"Recall: {metrics.recall_score(y_test, predictions)}") True positives: 1173 False positives: 267 True negatives: 1183 False negatives: 168 Accuracy:- Precision:- Recall:- The model demonstrated strong performance in classifying the target variable, as evidenced by an accuracy of 84.41%, meaning it correctly predicted the outcome for the majority of cases. With 1173 true positives and 1183 true negatives, the model accurately identified both positive and negative classes. The precision of 81.46% indicates that when the model predicted a positive outcome, it was correct 81.46% of the time, while the recall of 87.47% shows that it correctly identified 87.47% of all actual positives. Despite some false positives (267) and false negatives (168), the model's balance between precision and recall suggests reliable performance, though further tuning could reduce misclassifications Hyperparameter Tuning important_features = ['duration','balance','month_numerical','age','day','Recency','contact','job','campaign','poutco import pandas as pd from sklearn.model_selection import train_test_split from sklearn.ensemble import RandomForestClassifier # Assuming 'X' is your feature matrix and 'y' is your target variable # Replace 'X' and 'y' with your actual feature matrix and target variable # Assuming you already have your dataset loaded into 'data' # Create a DataFrame with important features important_features = ['duration','balance','month_numerical','age','day','Recency','contact','job','campaign', 'poutcome','housing','previous','education','marital','loan'] # Make a copy of our data train_df = final_df.copy() data_subset = train_df[important_features] # Split data into train and test sets X_train, X_test, y_train, y_test = train_test_split(data_subset, y, test_size=0.2, random_state=42) # Initialize Random Forest classifier rf_model = RandomForestClassifier() # Train the model rf_model.fit(X_train, y_train) # Evaluate the model accuracy = rf_model.score(X_test, y_test) print("Accuracy:", accuracy) Accuracy:- from sklearn.model_selection import GridSearchCV # Define the hyperparameters grid to search param_grid = { 'n_estimators': [1000, 1500, 1100], 'max_depth': [None, 10, 20], 'min_samples_split': [2, 5, 10], 'min_samples_leaf': [1, 2, 4], 'bootstrap': [True, False] } # Initialize Random Forest classifier rf_model = RandomForestClassifier() # Initialize GridSearchCV with the Random Forest classifier and the hyperparameters grid grid_search = GridSearchCV(estimator=rf_model, param_grid=param_grid, cv=5, scoring='accuracy', n_jobs=-1) # Perform grid search to find the best hyperparameters grid_search.fit(X_train, y_train) # Get the best hyperparameters best_params = grid_search.best_params_ # Train a new Random Forest model using the best hyperparameters best_rf_model = RandomForestClassifier(**best_params) best_rf_model.fit(X_train, y_train) # Evaluate the model accuracy = best_rf_model.score(X_test, y_test) print("Accuracy after hyperparameter tuning:", accuracy) Accuracy after hyperparameter tuning:- from sklearn.ensemble import RandomForestRegressor from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score # Predicting the target values for the test set predictions = best_rf_model.predict(X_test) # Calculating regression metrics mae = mean_absolute_error(y_test, predictions) mse = mean_squared_error(y_test, predictions) rmse = mse ** 0.5 # or use sklearn.metrics.mean_squared_error with squared=False r2 = r2_score(y_test, predictions) # Printing the metrics print(f"Mean Absolute Error: {mae}") print(f"Mean Squared Error: {mse}") print(f"Root Mean Squared Error: {rmse}") print(f"R-squared: {r2}") Mean Absolute Error:- Mean Squared Error:- Root Mean Squared Error:- R-squared:- import pickle # Save the trained model and scaler with open('best_rf_model.pkl', 'wb') as file: pickle.dump((model, scaler), file) best_params {'bootstrap': True, 'max_depth': 20, 'min_samples_leaf': 2, 'min_samples_split': 5, 'n_estimators': 1000} A/B Testing import pandas as pd from sklearn.model_selection import train_test_split from sklearn.ensemble import RandomForestClassifier from sklearn.preprocessing import StandardScaler import scipy.stats as stats # Load the dataset data = pd.read_csv('C:/Users/Collins PC/Downloads/bank_marketting_campaign/bank.csv') # Split data into training and testing sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # Keep track of the original test indices X_test_original = X_test.copy() # This keeps the test set as a DataFrame to work with indices # Standardize the data scaler = StandardScaler() X_train = scaler.fit_transform(X_train) X_test = scaler.transform(X_test) # Initialize and train the Random Forest model model = RandomForestClassifier(n_estimators=100, random_state=42) model.fit(X_train, y_train) # Train the model # Initialize the model # A/B Testing: Contact Method # Use the original X_test (before scaling) to identify the 'contact' column values contact_test = data.loc[X_test_original.index, 'contact'] # Get 'contact' column values for the test set group_a_idx = contact_test[contact_test == 'cellular'].index # Indices for 'cellular' group_b_idx = contact_test[contact_test == 'unknown'].index # Indices for 'unknown' # Get the actual subscription outcomes (y_test) for each group actual_a = y_test.loc[group_a_idx] actual_b = y_test.loc[group_b_idx] # Calculate the mean subscription rate for each group subscription_rate_a = actual_a.mean() subscription_rate_b = actual_b.mean() # Perform t-test to compare subscription rates t_stat, p_value = stats.ttest_ind(actual_a, actual_b) # Print results print(f"Subscription Rate (Cellular): {subscription_rate_a:.4f}") print(f"Subscription Rate (Unknown): {subscription_rate_b:.4f}") print(f"T-statistic: {t_stat:.4f}") print(f"P-value: {p_value:.4f}") # Interpret results if p_value < 0.05: # Assuming a significance level of 0.05 print("There is a statistically significant difference in subscription rates between the two contact methods.") else: print("There is no statistically significant difference in subscription rates between the two contact methods." Subscription Rate (Cellular): 0.5459 Subscription Rate (Unknown): 0.2363 T-statistic: 12.4389 P-value: 0.0000 There is a statistically significant difference in subscription rates between the two contact methods. Key Findings: Subscription Rate for "Cellular" Contact Method: The average subscription rate for individuals contacted via cellular was 54.59%. Subscription Rate for "Unknown" Contact Method: The average subscription rate for individuals with an unknown contact method was 23.63%. Statistical Test Results: A t-test was performed to compare the subscription rates between the two contact methods. The t-statistic was 12.44, with a p-value of 0.0000 (rounded). Since the p-value is significantly below the conventional threshold of 0.05, this indicates that the difference in subscription rates between the two groups is statistically significant. Conclusion: The A/B test results suggest that the contact method has a substantial impact on subscription rates. Specifically, individuals contacted via cellular had a significantly higher likelihood of subscribing compared to those for whom the contact method was unknown. This difference is unlikely to be due to random chance (p < 0.05). Recommendation: Based on these findings, it is advisable for the organization to prioritize or increase efforts toward using cellular contact methods for customer outreach, as this appears to lead to higher subscription rates. Loading [MathJax]/extensions/Safe.js