Mobasshir Hasan | Freelancer Loan Default Prediction Enhancement

Loan Default Prediction Enhancement

10/11/2023 FINAL PROJECT PREDICTING LOAN DEFAULTERS FOR WELLS FARGO Mobasshir Hasan, Ruhani Arora Mobasshir Hasan, Ruhani Arora Executive Summary Business Problem: The financial institution faces the pivotal challenge of differentiating between potential loan defaulters and genuine loan takers. An initial survey of our data shows a significant imbalance between these cohorts. This imbalance not only jeopardizes the robustness of our predictions but also poses substantial risks to the institution's financial health. Objective: The primary aim was to implement an analytical model that could reliably discern potential defaulters, thereby assisting in making informed loan approvals. This objective was twofold: firstly, to preprocess and balance the skewed dataset and, secondly, to pinpoint the major influencing factors dictating loan default tendencies. Approach: We commenced with rigorous data preprocessing, converting variables to their rightful data types and handling missing values. Recognizing the stark imbalance in our target variable, we employed oversampling to obtain a balanced dataset. The heart of our approach was the decision tree model, which provided insights into the most impactful variables after being fed the balanced data. Our recent analysis of loan default prediction, utilizing a decision tree model, has yielded valuable insights into the factors that influence loan defaults and the potential financial implications of using this model in its current form. The objective was to predict and prevent loan defaults, thereby safeguarding the bank's interests. Key Influencing Factors: • • • • DAYS_BIRTH (Age of the Applicant): Our analysis indicates that younger applicants exhibit a heightened risk of loan defaults. OCCUPATION_TYPE: Certain professional categories, especially laborers, demonstrated a notably higher propensity to default. CODE_GENDER: Gender surfaced as a significant influencer in our model, indicating that socioeconomic factors linked with gender can affect loan repayment behaviors. ORGANIZATION_TYPE and CNT_CHILDREN: The nature of an applicant's professional affiliation and their number of dependents further refine our understanding of default risks. Financial Implications: • • Potential Savings: By using our model, we can prevent defaults, resulting in potential savings of approximately $22 million. Potential Loss: The model's current state might lead to an estimated net loss of $6.29 million due to false negatives and false positives. Recommendation: • • • Before full-scale implementation, the model requires further refinement and validation to reduce false predictions. Enhance the model by incorporating additional data or employing more advanced techniques. Regularly review and update the model to adapt to changing customer behaviors and economic conditions. Mobasshir Hasan, Ruhani Arora Analysis 1. Data Preprocessing (Note: for the Analysis part (data type conversion, handling missing values, data partition, balancing the data, Data visualization, Decision tree, confusion Matrix), refer to the source code, credit.R file.) Packages We Used: In this analysis, we leveraged several R packages tailored for specific functions: • • • • • rpart: This was instrumental for building the decision tree models. rpart.plot: Enabled me to visualize the trees effectively. readxl: This was vital for importing data from Excel files. ROSE: Addressed the dataset's imbalance through its oversampling tools. dplyr: A key tool I used for data manipulation, making tasks like filtering and grouping seamless. Data Type Conversion: Upon importing the data, we noticed some variables were not correctly recognized. To rectify this, we meticulously converted them to their proper, ensuring: • • • Accurate numerical operations. Appropriate handling during the modeling phase. Correct visualization output. Handling Missing Values: To ensure the integrity of the dataset, we addressed missing values. We used the median to fill gaps for numeric data because of its resistance to outliers. We opted for the mode of categorical data, maintaining the original data distribution. Data Partition: We partitioned the data to balance training and evaluating our model's performance. 80% was allocated for training, with the remaining 20% reserved for testing. By setting a seed 123, we ensured this process could be consistently replicated. 2. Addressing Data Imbalance: The Imbalance Challenge: We identified a notable class imbalance in our dataset. Recognizing the issues this could pose for our machine-learning model, we took steps to address it (see Appendix A). Oversampling Technique: We applied the ROSE package to ensure our model wasn't biased towards the majority class. This approach effectively balanced our training data, enhancing the model's potential accuracy. (The output after addressing the data imbalance is shown in Appendix B). 3. Data Visualization Insights: Understanding the nature and relationships within our data was paramount. To achieve this, we utilized: Histogram: Distribution of Ages of Loan Applicants (Refer to Appendix C) This histogram (Appendix C) illustrates the age distribution of loan applicants, with most falling between 40 and 55 years (represented by -15000 to -20000 days on the x-axis). The data reveals a steady uptick in applicants from young adulthood, peaking in middle age and then sharply declining post the 55-year mark, suggesting fewer applications from older individuals, possibly due to factors like retirement. The far left suggests a smaller number of very young applicants. Overall, the distribution suggests that middle-aged applicants are predominantly driven by mid-life financial needs, while other age groups might have different financial stability or obligations influencing their loan-seeking behavior. Scatter Plot: Relationship between Age and Employment Duration by Gender (Refer to Appendix D) Mobasshir Hasan, Ruhani Arora The scatter plot (Appendix D) showcases loan applicants' age and employment duration, differentiated by gender. Most data points align around the 0 mark on the Days Employed axis, indicating many applicants either recently started working or lack employment data. Ages primarily cluster around the -25000 mark, or about 68 years old. Females, represented by pink/red dots, are denser in the data than males, with a notable number near the 0 Days Employed line. There's no discernible correlation between age and employment length. The chart underscores the dominance of female applicants and the significance of the unspecified "XNA" gender category, suggesting further examination of data around the 0-employment mark. Bar Chart: Distribution of Loan Applicants by Gender (Refer to Appendix E) The bar chart (Appendix E) reveals a clear gender distribution among loan applicants. Female applicants substantially outnumber males, nearing 200,000, while males are around the 100,000 mark. A tiny fraction labeled "XNA" might indicate missing data or undisclosed gender. This dominance of female applicants underscores their significant presence in our dataset and can influence strategies in marketing and product offerings tailored to customer demographics. Bar Chart: Distribution of Loan Applicants by Occupation Type (Refer to Appendix F) The bar chart (Appendix F) visualizes loan applicants' distribution based on their occupation types (OCCUPATION_TYPE). Evidently, "Laborers" represent the largest group of loan applicants, significantly outnumbering other professions. There's a diverse mix of occupations seeking loans, ranging from "Accountants" to more niche roles like "Waiters/barmen staff." While certain jobs, such as "Cleaning staff" and "Drivers," have a noticeable presence, others like "HR staff" and "IT staff" are less represented. It's intriguing to see how dominant the "Laborers" category is, providing an insightful glimpse into the professional backgrounds of those applying for loans. 4. Decision Tree Modeling Insights: In our analysis (refer to Appendix G & H), we employed a decision tree to discern patterns that might indicate an individual's likelihood to default on a loan. The decision tree model was constructed using the rpart package in R, and specific control parameters were set for the tree's formation. The goal was to ensure that the tree was neither too broad (which would overfit the training data) nor too simplistic (which would underfit and be less informative). Upon visualizing the constructed decision tree, it became evident to us how certain attributes play a pivotal role in determining an individual's propensity to default on a loan. We also ran the code “Print full_tree$variable.importance” to see which variables are the most significant one and whether it is aligning with our decision tree (Refer to Appendix I). Significant variables to take into account are: DAYS_BIRTH: The topmost node of the tree bifurcates based on the DAYS_BIRTH attribute. Observing the splits, we noted a clear distinction between the two age cohorts. This elucidates that age is a primary determinant when it comes to predicting loan defaults. CODE_GENDER & OCCUPATION_TYPE: As we delved deeper into the branches, the gender attribute, specifically the 'XNA' category, grabbed our attention. A total of 108,773 individuals fell into this category, out of which 26,002 defaulted on their loans. Furthermore, the occupation of an applicant seemed to sway the outcome significantly. For instance, occupations like "Accountants," "Core staff," and "Managers" formed one subset, whereas another contrasting group comprised occupations such as "Cleaning staff" and "Drivers." This indicates that the profession plays a non-trivial role in influencing loan default tendencies. Mobasshir Hasan, Ruhani Arora NAME_FAMILY_STATUS: Venturing further, we observed that marital or family status was yet another potent determinant. The tree prominently differentiated between those who are "Married," with 40,867 defaulters out of 138,401 in this category, and those with other statuses like "Civil marriage" or "Single." These splits reinforced the idea that an applicant's familial status can impact their likelihood of default. Each leaf node of the tree provided a clear count of defaulters and non-defaulters. This granularity allowed us to gauge the predominant outcome for specific criteria combinations. Metrics Calculations and Interpretations (Refer to Appendix J): • Error Rate: (FP + FN) / Total = (21644 + 2186) / 61503 = 0.373 Our model misclassifies 37.3% of samples, indicating potential financial losses either by approving likely defaulters or denying creditworthy individuals. • Benchmark Error Rate: Proportion of actual "1"s = 4938 / 61503 = 0.080 With an error rate of 8%, simply predicting the most frequent class ("No Default") outperforms our model, suggesting potential overfitting or other model inefficiencies. • Sensitivity: TP / (TP + FN) = 2752 / (2752 + 2186) = 0.557 Our model only detects 55.7% of actual defaulters, hinting at possible financial losses from undetected defaults. • Specificity: TN / (TN + FP) = 34921 / (34921 + 21644) = 0.617 The model correctly identifies 61.7% of non-defaulters. However, a 38.3% false positive rate suggests missed business opportunities and potential dissatisfaction among valid applicants. Analysis and Interpretation: The model's error rate stands at 37%, starkly contrasting with the benchmark rate of 8.0%. This discrepancy raises concerns about potential overfitting or underfitting in the model design, emphasizing the need to re-evaluate our chosen features. Using such a model in a loan scenario could result in financially detrimental decisions, such as approving likely defaulters or denying valid applicants. Given this, we should consider alternative modeling techniques, potentially leveraging Random Forest, Logistic Regression, ensemble methods, or other algorithms better tailored to the dataset's characteristics. Model Evaluation (Business Model): • • • • True Positives (TP): Correctly predicted defaults - Loans that the bank would have approved without our model but would lead to default. By predicting these, the bank saves money. True Negatives (TN): Correctly predicted non-defaults - Loans that are safe and the bank should approve. False Positives (FP): Incorrectly predicted defaults - Loans that the bank would decline based on our model, potentially losing a loyal and creditworthy customer. False Negatives (FN): Missed predicted defaults - Loans that the bank would approve, even with our model, leading to an unexpected default. Let's make some assumptions: 1. Average Loan Amount: $10,000, Interest Rate: 5% 2. Expected Profit per Loan: 5% of $10,000 = $500 (ignoring costs for simplicity) 3. Loss if a customer defaults: 80% of loan amount = $8,000 (assuming only 20% is recovered) Mobasshir Hasan, Ruhani Arora 4. Cost of False Positive: Lost profit due to declining a creditworthy customer = $500 Business Value Calculation: 1. Savings from True Positives: Our model correctly identified these defaults and, hence, loans that the bank didn't sanction. Savings from TP = TP * Loss if a customer default Savings from TP = 2,752 * $8,000 = $22,016,000 2. Loss from False Negatives: Our model missed these defaults, so the bank approved these loans and faced a default. Loss from FN = FN * Loss if a customer default Loss from FN = 2,186 * $8,000 = $17,488,000 3. Loss from False Positives: Our model incorrectly classified these good loans as defaults, so the bank lost out on the profit. Loss from FP = FP * Cost of False Positive Loss from FP = 21,644 * $500 = $10,822,000 Net Business Value: • • Net Value = Savings from TP - Loss from FN - Loss from FP Net Value = $22,016,000 - $17,488,000 - $10,822,000 = -$6,294,000 Conclusion: Using the model in its current state would result in an estimated net loss of $6,294,000, primarily due to a high number of False Positives and False Negatives. This financial metric provides a tangible perspective on the model's implications and highlights the urgent need for its refinement before implementation. Relying on this model could potentially lead to significant financial losses and customer dissatisfaction. It's imperative to refine the model, considering the points highlighted above, before contemplating its realworld implementation. Recommendations & Next Steps: 1. Model Diversity and Robustness: • • Different Techniques: We should explore ensemble methods like Random Forest or Logistic Regression. These techniques often perform better than individual models, especially in complex datasets. Cross-Validation: To ensure our model's robustness, it's vital to employ cross-validation techniques, ensuring it generalizes well across different sets of data. 2. Enhancing Model Sensitivity: Reducing False Negatives: Given the significant financial implications of false negatives, it's crucial to focus on improving the model's sensitivity. 3. Data Re-examination and Feature Engineering: • • Regular Feature Importance Analysis: By regularly evaluating the significance of our variables, we can ensure the model is capturing the most impactful factors, leading to better predictive power. Reassess and Introduce Variables: We should continuously re-evaluate our existing variables and consider introducing new features, enhancing our model's capability to capture loan default nuances. Mobasshir Hasan, Ruhani Arora APPENDIX: Dataset Source Link: https://www.kaggle.com/datasets/mishra5001/credit-card Appendix A Appendix B Appendix C: Histogram: Distribution of Ages of Loan Applicants Mobasshir Hasan, Ruhani Arora Appendix D: Scatter Plot: Relationship between Age and Employment Duration by Gender Appendix E: Bar Chart: Distribution of Loan Applicants by Gender Appendix F: Bar Chart: Distribution of Loan Applicants by Occupation Type Mobasshir Hasan, Ruhani Arora Appendix G: (Decision tree at maxdepth =3) Appendix H :(pruned tree) Mobasshir Hasan, Ruhani Arora Appendix I: Table according to the variable importance Appendix J: Confusion Matrix Mobasshir Hasan, Ruhani Arora References: Course Module and resources General Machine Learning and Predictive Analytics: • Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning. Springer. Link SMOTE and Imbalanced Learning: • Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: synthetic minority over-sampling technique. Journal of artificial intelligence research, 16, 321-357. https://www.jair.org/index.php/jair/article/view/10302 Credit Scoring and Loan Defaults: • Hand, D. J., & Henley, W. E. (1997). Statistical Classification Methods in Consumer Credit Scoring: a Review. Journal of the Royal Statistical Society: Series A (Statistics in Society), 160(3), 523541.https://academic.oup.com/jrsssa/article/160/3/523/- • Thomas, L. C. (2000). A survey of credit and behavioral scoring: forecasting financial risk of lending to consumers. International journal of forecasting, 16(2), 149172.https://www.sciencedirect.com/science/article/abs/pii/S-