Mobasshir Hasan | Freelancer Financial Inclusion Model Development

Financial Inclusion Model Development

Predict Creditworthiness for Underserved & Small Businesses Team: The Outlier Mobasshir Hasan Executive Summary Challenge Overview Addressing the need for a fair, private, and explanatory credit model excluding traditional credit scores for Underserved demographic and Small Businesses Data Integration Objective Integration of THREE diverse datasets focusing on demographics, socioeconomic factors, financial behavior in terms of utility bills, and small business Progression through Models Evolution from Random Forest (75% accuracy) to refined features (76%) and achieving breakthrough with Stacking Ensemble (87.8% accuracy) Discovery of Influential Variables Path Forward & Implementation Call Identification of pivotal factors: • Sales from small business owners • Utility Bill - Repayment status • Amount of previous payment from utility bills Recommendations include model refinement, expanded data utilization, and adaptability to changing trends Broader Impact & Innovation With its high accuracy and balanced accuracy, our model minimizes financial risks by accurately identifying creditworthy individuals, thus preventing defaults and optimizing loan portfolio performance Our Understanding Metropolitan Bank Provides a broad range of business, commercial and personal banking products and services to small and middle-market businesses, public entities and affluent individuals in the New York metropolitan area. Underserved Communities 33.9M 33.2M Small Businesses in the US 53% Rejection of Credit Loan With Startups of Color 99% Of all firms 27% Rejection of Credit Loan With White Owned Startups 3 Factors Determining Credit Worthiness • Most small businesses fail due to cash flow problems and a lack of demand for their product or service. • Startups of color were more likely than white-owned firms to say they were denied because they did not have the necessary documentation required by the lender. 4 Dataset Selection 17 variables • Personal information • Socioeconomic information • Loan application context • Living conditions • Kaggle Source: https://www.kaggle.com/datasets/mishra 5001/credit-card • 23 variables • Demographic information • Utility Bill Statements • Payment History • Payment's amount UCI Source: UCI dataset licensed under a Creative Commons Attribution 4.0 International (CC BY 4.0) license. https://www.divaportal.org/smash/get/diva2:829365/FULLTEXT01.pdf • 19 variables • Small Business Sales Details • Business employment • Venture Capital Source • Cash Flow Source: https://www.census.gov/econ/sbo/ U.S. Census Bureau New data set 59 variables (1 Target Variable common in each dataset) • Personal information • Socioeconomic information • Loan application context • Living conditions • Utility Bill Statements • Payment´s amount • Demographic information • Small Businesses Information • Reason to use: The variables we choose for our model collectively provide a comprehensive picture of the client’s information(secondary source of data) without considering traditional metrics used in determining creditworthiness. Also, we merged these datasets into a single combined_data frame to unify the variables and form a comprehensive view of each applicant's profile. Data Preprocessing & Balancing Initial Setup Essential Tools and Libraries • Loaded vital R packages: dplyr, ggplot2, caret, ROSE, readxl, and randomForest. • Set up the working environment and imported datasets. Data Standardization Missing Values Class Imbalance Addressed Uniformity in Data Integrity Through Imputation Ensuring Balanced Analysis • Standardized categorical variables for consistent analysis. • Converted all data into appropriate formats. • Strategically filled numeric gaps with medians. • Applied modes to categorical gaps to maintain distribution. • Identified significant class imbalance in the dataset. • Applied ROSE's oversampling to neutralize bias. Dataset Integration Comprehensive Data Melding • Unified data, default, and pums_var into a single dataset. • Ensured alignment and consistency across all variables. Model Building Credit Card Dataset (Kaggle) Utility Bills Dataset Refined Random Forest Random Forest Classification Feature Selection Small Business Dataset • Data Preprocessing • Datasets Balancing • Datasets Integration SVM Logistic Regression Model GBM • Random Forest Model • Its robustness against overfitting. • Its ability to handle a large number of input variables. • Feature Selection • Identified the most significant predictors. • Selected a subset of top features. Stacking Ensemble Method • Used multiple base models (variations of decision trees) and a meta-model (logistic regression). • We split our combined_data into training (train_data) and testing (test_data) sets, maintaining an 80/20 split. 7 Feature Selection Key Influencing Factors: The model highlighted the following variables as the most significant predictors of creditworthiness: • Sales from Survey of Business Owners and Self-Employed Persons (SBO) • Utility Bill Repayment status in September from UCI dataset • Amount of previous payment (NT dollar) from utility bills in UCI dataset • Employment experience of owner obtained from Kaggle dataset 8 Model Accuracy Random Forest Confusion Matrix 0 Feature Selection Random Forest Confusion Matrix 1 0 1 Stacking Ensemble Method Confusion Matrix M R 0 17434 1422 0 20328 1697 M 20328 1697 1 4416 1604 1 4866 1665 R 4866 1665 76.53% 2e-16 77.02% 2e-16 87.8% 3.487e-6 Accuracy P-value Accuracy P-value Accuracy P-value 53% 79% 49.52% 80.69% 94.74% 81.82% Specificity Sensitivity Specificity Sensitivity Specificity 9 Sensitivity Conclusion & Recommendations Recommendation Conclusion • ”Low Sales" and ”Utility Bill Payment" came out to be the major factors influencing the credit worthiness for underserved community • Startups of color were more likely than white-owned firms to say they were denied because they did not have the necessary documentation required by the lender - > • Accuracy Achieved: 87.8% • Balanced Accuracy: 88.28% • High Sensitivity: 94.74%, effectively identifying noncreditworthy applicants • Low False Positive Rate: Just 1 out of 41 cases • Model Strengths: Robust in managing class imbalance, showcasing reliable predictions for the 'non-creditworthy' class • • • • Refine algorithm parameters to improve specificity without compromising sensitivity. This aims to strike a better balance between detecting true positives and reducing false negatives. Incorporate broader data sets to deepen insights and refine predictions. Consider factors that may affect creditworthiness beyond current scope. Regularly update the model to reflect evolving economic trends and validate using advanced techniques to ensure robustness across diverse scenarios. Proactively identify and correct for biases. This includes implementing fairness constraints to uphold ethical standards and comply with regulatory requirements. 10 Recommended Sources of Data Collection Partnerships with national grid, brokers etc. for determining the utility bills and payment trends Using data from existing small businesses present within Metropolitan Bank to build predictive analytics for identifying future trends Partnerships with government surveys and data like U.S. Survey Bureau GDPR Reports and Analysis 11 Appendix – 1. Prohibited Variables for Credit Worthiness 12 Appendix – 2. Assumptions • Underserved group has annual income less than 100k grants • Annual avg annual pay of hispanic – 63k$ • Annual avg annual pay of black – 52k$ • Annual avg annual pay of all races – 74k$ • Annual avg annual pay of women– 41k$ • Underserved – low income, geographic location(rural area), racial, ethnic minorities, immigrants, non native people, students, small business owners (like freelancers etc.) 13 Appendix – 3. Box-Plot Income Vs Credit_worth(Target Variable) •The graph compares total income levels between two groups: noncreditworthy (0) and creditworthy (1) individuals. •Income Insights: The median income of creditworthy individuals is higher, with a wider range of incomes indicating greater variability. •Outliers Noted: Several outliers, particularly in the creditworthy group, suggest the presence of individuals with significantly higher incomes than their peers. 4. Variable Importance based on Random forest model 14 Appendix 5. Variable Importance Ensemble Method 6. Summary Statistics: Ensemble Model 15 Appendix – 7. Research • https://www.fedsmallbusiness.org/reports/survey/2023/2023-report-on-startup-firms-owned-by-people-of-color • https://www.frbsf.org/research-and-insights/publications/community-development-investment-review/2021/08/the-racialized-rootsof-financial-exclusion/ • Credit scores and revenue are crucial inputs into a lender’s decision but are not the only factors. Other factors that are potentially relevant to the decision to lend, including a firm’s collateral, cash flows, and documentation, may account for some of the gaps in access to financing • General Machine Learning and Predictive Analytics: • Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning. Springer. Link • SMOTE and Imbalanced Learning: • Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: synthetic minority over-sampling technique. Journal of artificial intelligence research, 16, 321-357. https://www.jair.org/index.php/jair/article/view/10302 • Credit Scoring and Loan Defaults: • Hand, D. J., & Henley, W. E. (1997). Statistical Classification Methods in Consumer Credit Scoring: a Review. Journal of the Royal Statistical Society: Series A (Statistics in Society), 160(3), 523-541.https://academic.oup.com/jrsssa/article/160/3/523/- • Thomas, L. C. (2000). A survey of credit and behavioral scoring: forecasting financial risk of lending to consumers. International journal of forecasting, 16(2), 149-172.https://www.sciencedirect.com/science/article/abs/pii/S- 16