Financial Inclusion Model Development
Predict Creditworthiness for
Underserved & Small Businesses
Team: The Outlier
Mobasshir Hasan
Executive Summary
Challenge
Overview
Addressing the need for a
fair, private, and
explanatory credit model
excluding traditional credit
scores for Underserved
demographic and Small
Businesses
Data Integration
Objective
Integration of THREE
diverse datasets
focusing on
demographics,
socioeconomic factors,
financial behavior in
terms of utility bills,
and small business
Progression
through Models
Evolution from
Random Forest (75%
accuracy) to refined
features (76%) and
achieving
breakthrough with
Stacking Ensemble
(87.8% accuracy)
Discovery of
Influential
Variables
Path Forward &
Implementation
Call
Identification of pivotal
factors:
• Sales from small
business owners
• Utility Bill - Repayment
status
• Amount of previous
payment from utility
bills
Recommendations
include model
refinement,
expanded data
utilization, and
adaptability to
changing trends
Broader
Impact &
Innovation
With its high accuracy
and balanced
accuracy, our model
minimizes financial
risks by accurately
identifying
creditworthy
individuals, thus
preventing defaults
and optimizing loan
portfolio performance
Our Understanding
Metropolitan Bank
Provides a broad range of business, commercial and personal banking products and services to small and middle-market
businesses, public entities and affluent individuals in the New York metropolitan area.
Underserved Communities
33.9M
33.2M
Small Businesses in the US
53%
Rejection of Credit Loan
With Startups of Color
99%
Of all firms
27%
Rejection of Credit Loan
With White Owned Startups
3
Factors Determining Credit Worthiness
• Most small businesses fail due to cash flow problems and a lack of demand for their product or service.
• Startups of color were more likely than white-owned firms to say they were denied because they did not have the
necessary documentation required by the lender.
4
Dataset Selection
17 variables
•
Personal information
•
Socioeconomic information
•
Loan application context
•
Living conditions
•
Kaggle
Source:
https://www.kaggle.com/datasets/mishra
5001/credit-card
•
23 variables
•
Demographic
information
•
Utility Bill Statements
•
Payment History
•
Payment's amount
UCI
Source: UCI dataset licensed under a Creative
Commons Attribution 4.0 International (CC BY 4.0)
license. https://www.divaportal.org/smash/get/diva2:829365/FULLTEXT01.pdf
•
19 variables
•
Small Business Sales Details
•
Business employment
•
Venture Capital Source
•
Cash Flow
Source: https://www.census.gov/econ/sbo/
U.S.
Census
Bureau
New
data set
59 variables (1 Target Variable common in
each dataset)
•
Personal information
•
Socioeconomic information
•
Loan application context
•
Living conditions
•
Utility Bill Statements
•
Payment´s amount
•
Demographic information
•
Small Businesses Information
• Reason to use: The variables we choose for our model collectively
provide
a
comprehensive
picture
of
the
client’s
information(secondary source of data) without considering
traditional metrics used in determining creditworthiness. Also, we
merged these datasets into a single combined_data frame to unify
the variables and form a comprehensive view of each applicant's
profile.
Data Preprocessing & Balancing
Initial Setup
Essential Tools and
Libraries
• Loaded vital R
packages: dplyr,
ggplot2, caret, ROSE,
readxl, and
randomForest.
• Set up the working
environment and
imported datasets.
Data
Standardization
Missing Values
Class Imbalance
Addressed
Uniformity in Data
Integrity Through
Imputation
Ensuring Balanced
Analysis
• Standardized
categorical variables
for consistent
analysis.
• Converted all data
into appropriate
formats.
• Strategically
filled numeric
gaps with
medians.
• Applied modes to
categorical gaps
to maintain
distribution.
• Identified significant
class imbalance in the
dataset.
• Applied ROSE's
oversampling to
neutralize bias.
Dataset
Integration
Comprehensive Data
Melding
• Unified data, default,
and pums_var into a
single dataset.
• Ensured alignment
and consistency
across all variables.
Model Building
Credit
Card
Dataset
(Kaggle)
Utility
Bills
Dataset
Refined Random
Forest
Random Forest
Classification
Feature Selection
Small
Business
Dataset
• Data Preprocessing
• Datasets Balancing
• Datasets Integration
SVM
Logistic
Regression
Model
GBM
• Random Forest Model
• Its robustness against overfitting.
• Its ability to handle a large number of input
variables.
• Feature Selection
• Identified the most significant predictors.
• Selected a subset of top features.
Stacking Ensemble Method
• Used multiple base models (variations of decision
trees) and a meta-model (logistic regression).
• We split our combined_data into training (train_data) and
testing (test_data) sets, maintaining an 80/20 split.
7
Feature Selection
Key Influencing Factors: The model
highlighted the following variables as the
most significant predictors of
creditworthiness:
• Sales from Survey of Business Owners and
Self-Employed Persons (SBO)
• Utility Bill Repayment status in September
from UCI dataset
• Amount of previous payment (NT dollar)
from utility bills in UCI dataset
• Employment experience of owner
obtained from Kaggle dataset
8
Model Accuracy
Random Forest
Confusion Matrix
0
Feature Selection Random Forest
Confusion Matrix
1
0
1
Stacking Ensemble Method
Confusion Matrix
M
R
0
17434
1422
0
20328
1697
M 20328
1697
1
4416
1604
1
4866
1665
R 4866
1665
76.53%
2e-16
77.02%
2e-16
87.8%
3.487e-6
Accuracy
P-value
Accuracy
P-value
Accuracy
P-value
53%
79%
49.52%
80.69%
94.74%
81.82%
Specificity
Sensitivity
Specificity
Sensitivity
Specificity
9
Sensitivity
Conclusion & Recommendations
Recommendation
Conclusion
• ”Low Sales" and ”Utility Bill Payment" came out to be the
major factors influencing the credit worthiness for
underserved community
• Startups of color were more likely than white-owned firms
to say they were denied because they did not have the
necessary documentation required by the lender - >
• Accuracy Achieved: 87.8%
• Balanced Accuracy: 88.28%
• High Sensitivity: 94.74%, effectively identifying noncreditworthy applicants
• Low False Positive Rate: Just 1 out of 41 cases
• Model Strengths: Robust in managing class imbalance,
showcasing reliable predictions for the 'non-creditworthy'
class
•
•
•
•
Refine algorithm parameters to improve specificity
without compromising sensitivity. This aims to strike
a better balance between detecting true positives
and reducing false negatives.
Incorporate broader data sets to deepen insights
and refine predictions. Consider factors that may
affect creditworthiness beyond current scope.
Regularly update the model to reflect evolving
economic trends and validate using advanced
techniques to ensure robustness across diverse
scenarios.
Proactively identify and correct for biases. This
includes implementing fairness constraints to
uphold ethical standards and comply with
regulatory requirements.
10
Recommended Sources of Data Collection
Partnerships with national grid, brokers etc. for determining the utility bills and
payment trends
Using data from existing small businesses present within Metropolitan Bank to
build predictive analytics for identifying future trends
Partnerships with government surveys and data like U.S. Survey Bureau
GDPR Reports and Analysis
11
Appendix –
1. Prohibited Variables for Credit Worthiness
12
Appendix – 2. Assumptions
• Underserved group has annual income less than 100k grants
• Annual avg annual pay of hispanic – 63k$
• Annual avg annual pay of black – 52k$
• Annual avg annual pay of all races – 74k$
• Annual avg annual pay of women– 41k$
• Underserved – low income, geographic location(rural area), racial, ethnic minorities, immigrants, non native people,
students, small business owners (like freelancers etc.)
13
Appendix –
3. Box-Plot Income Vs Credit_worth(Target
Variable)
•The graph compares total income levels between two groups: noncreditworthy (0) and creditworthy (1) individuals.
•Income Insights: The median income of creditworthy individuals is higher,
with a wider range of incomes indicating greater variability.
•Outliers Noted: Several outliers, particularly in the creditworthy group,
suggest the presence of individuals with significantly higher incomes than
their peers.
4. Variable Importance based on Random
forest model
14
Appendix
5. Variable Importance
Ensemble Method
6. Summary Statistics:
Ensemble Model
15
Appendix –
7. Research
• https://www.fedsmallbusiness.org/reports/survey/2023/2023-report-on-startup-firms-owned-by-people-of-color
• https://www.frbsf.org/research-and-insights/publications/community-development-investment-review/2021/08/the-racialized-rootsof-financial-exclusion/
• Credit scores and revenue are crucial inputs into a lender’s decision but are not the only factors. Other factors that are potentially
relevant to the decision to lend, including a firm’s collateral, cash flows, and documentation, may account for some of the gaps in access
to financing
• General Machine Learning and Predictive Analytics:
• Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning. Springer. Link
• SMOTE and Imbalanced Learning:
• Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: synthetic minority over-sampling technique. Journal of
artificial intelligence research, 16, 321-357. https://www.jair.org/index.php/jair/article/view/10302
• Credit Scoring and Loan Defaults:
• Hand, D. J., & Henley, W. E. (1997). Statistical Classification Methods in Consumer Credit Scoring: a Review. Journal of the Royal
Statistical Society: Series A (Statistics in Society), 160(3), 523-541.https://academic.oup.com/jrsssa/article/160/3/523/-
• Thomas, L. C. (2000). A survey of credit and behavioral scoring: forecasting financial risk of lending to consumers. International journal of
forecasting, 16(2), 149-172.https://www.sciencedirect.com/science/article/abs/pii/S-
16