Loan Default Prediction Enhancement
10/11/2023
FINAL PROJECT
PREDICTING LOAN DEFAULTERS FOR
WELLS FARGO
Mobasshir Hasan, Ruhani Arora
Mobasshir Hasan, Ruhani Arora
Executive Summary
Business Problem: The financial institution faces the pivotal challenge of differentiating between potential
loan defaulters and genuine loan takers. An initial survey of our data shows a significant imbalance
between these cohorts. This imbalance not only jeopardizes the robustness of our predictions but also
poses substantial risks to the institution's financial health.
Objective: The primary aim was to implement an analytical model that could reliably discern potential
defaulters, thereby assisting in making informed loan approvals. This objective was twofold: firstly, to
preprocess and balance the skewed dataset and, secondly, to pinpoint the major influencing factors
dictating loan default tendencies.
Approach: We commenced with rigorous data preprocessing, converting variables to their rightful data
types and handling missing values. Recognizing the stark imbalance in our target variable, we employed
oversampling to obtain a balanced dataset. The heart of our approach was the decision tree model, which
provided insights into the most impactful variables after being fed the balanced data.
Our recent analysis of loan default prediction, utilizing a decision tree model, has yielded valuable insights
into the factors that influence loan defaults and the potential financial implications of using this model in
its current form. The objective was to predict and prevent loan defaults, thereby safeguarding the bank's
interests.
Key Influencing Factors:
•
•
•
•
DAYS_BIRTH (Age of the Applicant): Our analysis indicates that younger applicants exhibit a
heightened risk of loan defaults.
OCCUPATION_TYPE: Certain professional categories, especially laborers, demonstrated a notably
higher propensity to default.
CODE_GENDER: Gender surfaced as a significant influencer in our model, indicating that socioeconomic factors linked with gender can affect loan repayment behaviors.
ORGANIZATION_TYPE and CNT_CHILDREN: The nature of an applicant's professional affiliation
and their number of dependents further refine our understanding of default risks.
Financial Implications:
•
•
Potential Savings: By using our model, we can prevent defaults, resulting in potential savings of
approximately $22 million.
Potential Loss: The model's current state might lead to an estimated net loss of $6.29 million due
to false negatives and false positives.
Recommendation:
•
•
•
Before full-scale implementation, the model requires further refinement and validation to reduce
false predictions.
Enhance the model by incorporating additional data or employing more advanced techniques.
Regularly review and update the model to adapt to changing customer behaviors and economic
conditions.
Mobasshir Hasan, Ruhani Arora
Analysis
1. Data Preprocessing
(Note: for the Analysis part (data type conversion, handling missing values, data partition, balancing
the data, Data visualization, Decision tree, confusion Matrix), refer to the source code, credit.R file.)
Packages We Used: In this analysis, we leveraged several R packages tailored for specific functions:
•
•
•
•
•
rpart: This was instrumental for building the decision tree models.
rpart.plot: Enabled me to visualize the trees effectively.
readxl: This was vital for importing data from Excel files.
ROSE: Addressed the dataset's imbalance through its oversampling tools.
dplyr: A key tool I used for data manipulation, making tasks like filtering and grouping seamless.
Data Type Conversion: Upon importing the data, we noticed some variables were not correctly recognized.
To rectify this, we meticulously converted them to their proper, ensuring:
•
•
•
Accurate numerical operations.
Appropriate handling during the modeling phase.
Correct visualization output.
Handling Missing Values: To ensure the integrity of the dataset, we addressed missing values. We used
the median to fill gaps for numeric data because of its resistance to outliers. We opted for the mode of
categorical data, maintaining the original data distribution.
Data Partition: We partitioned the data to balance training and evaluating our model's performance. 80%
was allocated for training, with the remaining 20% reserved for testing. By setting a seed 123, we ensured
this process could be consistently replicated.
2. Addressing Data Imbalance:
The Imbalance Challenge: We identified a notable class imbalance in our dataset. Recognizing the issues
this could pose for our machine-learning model, we took steps to address it (see Appendix A).
Oversampling Technique: We applied the ROSE package to ensure our model wasn't biased towards the
majority class. This approach effectively balanced our training data, enhancing the model's potential
accuracy. (The output after addressing the data imbalance is shown in Appendix B).
3. Data Visualization Insights:
Understanding the nature and relationships within our data was paramount. To achieve this, we utilized:
Histogram: Distribution of Ages of Loan Applicants (Refer to Appendix C)
This histogram (Appendix C) illustrates the age distribution of loan applicants, with most falling between
40 and 55 years (represented by -15000 to -20000 days on the x-axis). The data reveals a steady uptick in
applicants from young adulthood, peaking in middle age and then sharply declining post the 55-year mark,
suggesting fewer applications from older individuals, possibly due to factors like retirement. The far left
suggests a smaller number of very young applicants. Overall, the distribution suggests that middle-aged
applicants are predominantly driven by mid-life financial needs, while other age groups might have
different financial stability or obligations influencing their loan-seeking behavior.
Scatter Plot: Relationship between Age and Employment Duration by Gender (Refer to Appendix D)
Mobasshir Hasan, Ruhani Arora
The scatter plot (Appendix D) showcases loan applicants' age and employment duration, differentiated by
gender. Most data points align around the 0 mark on the Days Employed axis, indicating many applicants
either recently started working or lack employment data. Ages primarily cluster around the -25000 mark,
or about 68 years old. Females, represented by pink/red dots, are denser in the data than males, with a
notable number near the 0 Days Employed line. There's no discernible correlation between age and
employment length. The chart underscores the dominance of female applicants and the significance of
the unspecified "XNA" gender category, suggesting further examination of data around the 0-employment
mark.
Bar Chart: Distribution of Loan Applicants by Gender (Refer to Appendix E)
The bar chart (Appendix E) reveals a clear gender distribution among loan applicants. Female applicants
substantially outnumber males, nearing 200,000, while males are around the 100,000 mark. A tiny fraction
labeled "XNA" might indicate missing data or undisclosed gender. This dominance of female applicants
underscores their significant presence in our dataset and can influence strategies in marketing and product
offerings tailored to customer demographics.
Bar Chart: Distribution of Loan Applicants by Occupation Type (Refer to Appendix F)
The bar chart (Appendix F) visualizes loan applicants' distribution based on their occupation types
(OCCUPATION_TYPE). Evidently, "Laborers" represent the largest group of loan applicants, significantly
outnumbering other professions. There's a diverse mix of occupations seeking loans, ranging from
"Accountants" to more niche roles like "Waiters/barmen staff." While certain jobs, such as "Cleaning staff"
and "Drivers," have a noticeable presence, others like "HR staff" and "IT staff" are less represented. It's
intriguing to see how dominant the "Laborers" category is, providing an insightful glimpse into the
professional backgrounds of those applying for loans.
4. Decision Tree Modeling Insights:
In our analysis (refer to Appendix G & H), we employed a decision tree to discern patterns that might
indicate an individual's likelihood to default on a loan. The decision tree model was constructed using the
rpart package in R, and specific control parameters were set for the tree's formation. The goal was to
ensure that the tree was neither too broad (which would overfit the training data) nor too simplistic (which
would underfit and be less informative). Upon visualizing the constructed decision tree, it became evident
to us how certain attributes play a pivotal role in determining an individual's propensity to default on a
loan. We also ran the code “Print full_tree$variable.importance” to see which variables are the most
significant one and whether it is aligning with our decision tree (Refer to Appendix I). Significant variables
to take into account are:
DAYS_BIRTH: The topmost node of the tree bifurcates based on the DAYS_BIRTH attribute. Observing the
splits, we noted a clear distinction between the two age cohorts. This elucidates that age is a primary
determinant when it comes to predicting loan defaults.
CODE_GENDER & OCCUPATION_TYPE: As we delved deeper into the branches, the gender attribute,
specifically the 'XNA' category, grabbed our attention. A total of 108,773 individuals fell into this category,
out of which 26,002 defaulted on their loans. Furthermore, the occupation of an applicant seemed to sway
the outcome significantly. For instance, occupations like "Accountants," "Core staff," and "Managers"
formed one subset, whereas another contrasting group comprised occupations such as "Cleaning staff"
and "Drivers." This indicates that the profession plays a non-trivial role in influencing loan default
tendencies.
Mobasshir Hasan, Ruhani Arora
NAME_FAMILY_STATUS: Venturing further, we observed that marital or family status was yet another
potent determinant. The tree prominently differentiated between those who are "Married," with 40,867
defaulters out of 138,401 in this category, and those with other statuses like "Civil marriage" or "Single."
These splits reinforced the idea that an applicant's familial status can impact their likelihood of default.
Each leaf node of the tree provided a clear count of defaulters and non-defaulters. This granularity allowed
us to gauge the predominant outcome for specific criteria combinations.
Metrics Calculations and Interpretations (Refer to Appendix J):
•
Error Rate: (FP + FN) / Total = (21644 + 2186) / 61503 = 0.373
Our model misclassifies 37.3% of samples, indicating potential financial losses either by approving likely
defaulters or denying creditworthy individuals.
•
Benchmark Error Rate: Proportion of actual "1"s = 4938 / 61503 = 0.080
With an error rate of 8%, simply predicting the most frequent class ("No Default") outperforms our model,
suggesting potential overfitting or other model inefficiencies.
•
Sensitivity: TP / (TP + FN) = 2752 / (2752 + 2186) = 0.557
Our model only detects 55.7% of actual defaulters, hinting at possible financial losses from undetected
defaults.
•
Specificity: TN / (TN + FP) = 34921 / (34921 + 21644) = 0.617
The model correctly identifies 61.7% of non-defaulters. However, a 38.3% false positive rate suggests
missed business opportunities and potential dissatisfaction among valid applicants.
Analysis and Interpretation:
The model's error rate stands at 37%, starkly contrasting with the benchmark rate of 8.0%. This
discrepancy raises concerns about potential overfitting or underfitting in the model design, emphasizing
the need to re-evaluate our chosen features. Using such a model in a loan scenario could result in
financially detrimental decisions, such as approving likely defaulters or denying valid applicants. Given this,
we should consider alternative modeling techniques, potentially leveraging Random Forest, Logistic
Regression, ensemble methods, or other algorithms better tailored to the dataset's characteristics.
Model Evaluation (Business Model):
•
•
•
•
True Positives (TP): Correctly predicted defaults - Loans that the bank would have approved
without our model but would lead to default. By predicting these, the bank saves money.
True Negatives (TN): Correctly predicted non-defaults - Loans that are safe and the bank should
approve.
False Positives (FP): Incorrectly predicted defaults - Loans that the bank would decline based on
our model, potentially losing a loyal and creditworthy customer.
False Negatives (FN): Missed predicted defaults - Loans that the bank would approve, even with
our model, leading to an unexpected default.
Let's make some assumptions:
1. Average Loan Amount: $10,000, Interest Rate: 5%
2. Expected Profit per Loan: 5% of $10,000 = $500 (ignoring costs for simplicity)
3. Loss if a customer defaults: 80% of loan amount = $8,000 (assuming only 20% is recovered)
Mobasshir Hasan, Ruhani Arora
4. Cost of False Positive: Lost profit due to declining a creditworthy customer = $500
Business Value Calculation:
1. Savings from True Positives: Our model correctly identified these defaults and, hence, loans that
the bank didn't sanction.
Savings from TP = TP * Loss if a customer default
Savings from TP = 2,752 * $8,000 = $22,016,000
2. Loss from False Negatives: Our model missed these defaults, so the bank approved these loans
and faced a default.
Loss from FN = FN * Loss if a customer default
Loss from FN = 2,186 * $8,000 = $17,488,000
3. Loss from False Positives: Our model incorrectly classified these good loans as defaults, so the
bank lost out on the profit.
Loss from FP = FP * Cost of False Positive
Loss from FP = 21,644 * $500 = $10,822,000
Net Business Value:
•
•
Net Value = Savings from TP - Loss from FN - Loss from FP
Net Value = $22,016,000 - $17,488,000 - $10,822,000 = -$6,294,000
Conclusion:
Using the model in its current state would result in an estimated net loss of $6,294,000, primarily due to
a high number of False Positives and False Negatives. This financial metric provides a tangible perspective
on the model's implications and highlights the urgent need for its refinement before implementation.
Relying on this model could potentially lead to significant financial losses and customer dissatisfaction. It's
imperative to refine the model, considering the points highlighted above, before contemplating its realworld implementation.
Recommendations & Next Steps:
1. Model Diversity and Robustness:
•
•
Different Techniques: We should explore ensemble methods like Random Forest or Logistic
Regression. These techniques often perform better than individual models, especially in complex
datasets.
Cross-Validation: To ensure our model's robustness, it's vital to employ cross-validation
techniques, ensuring it generalizes well across different sets of data.
2. Enhancing Model Sensitivity:
Reducing False Negatives: Given the significant financial implications of false negatives, it's crucial to focus
on improving the model's sensitivity.
3. Data Re-examination and Feature Engineering:
•
•
Regular Feature Importance Analysis: By regularly evaluating the significance of our variables, we
can ensure the model is capturing the most impactful factors, leading to better predictive power.
Reassess and Introduce Variables: We should continuously re-evaluate our existing variables and
consider introducing new features, enhancing our model's capability to capture loan default
nuances.
Mobasshir Hasan, Ruhani Arora
APPENDIX:
Dataset Source Link: https://www.kaggle.com/datasets/mishra5001/credit-card
Appendix A
Appendix B
Appendix C: Histogram: Distribution of Ages of Loan Applicants
Mobasshir Hasan, Ruhani Arora
Appendix D: Scatter Plot: Relationship between Age and Employment Duration by Gender
Appendix E: Bar Chart: Distribution of Loan Applicants by Gender
Appendix F: Bar Chart: Distribution of Loan Applicants by Occupation Type
Mobasshir Hasan, Ruhani Arora
Appendix G: (Decision tree at maxdepth =3)
Appendix H :(pruned tree)
Mobasshir Hasan, Ruhani Arora
Appendix I: Table according to the variable importance
Appendix J: Confusion Matrix
Mobasshir Hasan, Ruhani Arora
References:
Course Module and resources
General Machine Learning and Predictive Analytics:
• Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning. Springer. Link
SMOTE and Imbalanced Learning:
• Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: synthetic minority
over-sampling technique. Journal of artificial intelligence research, 16, 321-357.
https://www.jair.org/index.php/jair/article/view/10302
Credit Scoring and Loan Defaults:
• Hand, D. J., & Henley, W. E. (1997). Statistical Classification Methods in Consumer Credit Scoring:
a Review. Journal of the Royal Statistical Society: Series A (Statistics in Society), 160(3), 523541.https://academic.oup.com/jrsssa/article/160/3/523/-
• Thomas, L. C. (2000). A survey of credit and behavioral scoring: forecasting financial risk of
lending to consumers. International journal of forecasting, 16(2), 149172.https://www.sciencedirect.com/science/article/abs/pii/S-