MS Business Analytics Capstone Project - Yamin Sun Leader
Predicting 30-day Patient Readmissions
Final Summary
Submitted to
Vince Baldasare of Dayton’s Children’s Hospital
by
The University of Dayton BAN 710 DCH Project Team:
Ruinan Chen
Venkat Nurukurthi
Cameron Luckhaupt
Yamin Sun
March 24, 2024
2
Table of Contents
1. Executive Summary: .............................................................................................................................. 1
2. Background Information:....................................................................................................................... 1
3. Problem Background: ............................................................................................................................ 1
3. Problem Statement:................................................................................................................................ 2
4. Project Timeline:..................................................................................................................................... 2
5. Scope:...................................................................................................................................................... 4
5.1 In Scope ........................................................................................................................................... 4
5.2 Out of Scope: ................................................................................................................................... 4
6. Data:......................................................................................................................................................... 4
7. Assumptions:.......................................................................................................................................... 5
8. Method ..................................................................................................................................................... 6
8.1 Data Cleansing and Pre-Processing ................................................................................................ 6
8.2 Exploratory Data Analysis ................................................................................................................ 6
8.3 Feature Extraction ............................................................................................................................ 7
8.3.1 Correlation: ............................................................................................................................. 8
8.3.2 Feature Importance Scores .................................................................................................... 9
8.4 Modeling .......................................................................................................................................... 9
8.4.1 Oversampling ....................................................................................................................... 10
8.4.2 Undersampling ..................................................................................................................... 10
8.5 Final Model ..................................................................................................................... 12
8.6 Hyper Parameter Tuning ................................................................................................................ 12
8.7 Model Calibration ........................................................................................................................... 13
9. Results: ................................................................................................................................................. 14
9.1 Output Comparisons: .................................................................................................................. 15
10. Benefits: .............................................................................................................................................. 16
11. Risks: ................................................................................................................................... 17
12. Conclusion: ......................................................................................................................................... 17
References ............................................................................................................................ 18
Appendix ............................................................................................................................................. 19
13. Dayton Children's Hospital Readmission Prediction System ......................................................... 19
14. Application Setup ............................................................................................................................... 19
14.1 Environment Setup ...................................................................................................................... 19
14.2 Requirements Setup .................................................................................................................... 19
14.3 Start Application ........................................................................................................................... 20
15. Application Screens ........................................................................................................................... 21
15.1 Login Screen ................................................................................................................................ 21
15.2 Django Admin Users .................................................................................................................... 22
15.3 Django Admin Groups .................................................................................................................. 23
15.3 Dashboard Page .......................................................................................................................... 23
15.4 Patients Page ............................................................................................................................... 24
1
1. Executive Summary:
Dayton Children’s Hospital needs a predictive model to identify those patients who are likely to
be readmitted in 30 days. To address this problem, eight different machine learning classification
models under multiple thresholds were developed. Three were identified that meet Dayton
Children's Hospital’s needs in different ways by manipulating the threshold values of a gradient
boosting methodology executed in Python These three models provide significant improvement
in combinations of sensitivity and precision vs. Dayton Children’s Hospital’s internal algorithm
executed at either 0.4 or 0.5 threshold. Dayton’s Children Hospital can now choose which model
or experiment across all three to integrate into their existing Power BI dashboard to identify those
patients who are at risk of returning in 30 days at the time of discharge.
2. Background Information:
Dayton Children’s Hospital has been providing healthcare since 1919. Dayton Children’s Hospital
has 9 locations currently in Dayton, Vandalia, Beavercreek, Springboro, Kettering, Springfield,
Warren County, Lima, and Sugarcreek. Dayton Children’s is a verified Level 1 Pediatric Trauma
Center. Currently, Dayton Children’s provides care for about 150 patients in multiple areas of care.
The hospital provides 35 specialty areas ranging from Anesthesiology to Urology to even Sports
Medicine. Dayton Children’s serves patients from newborns to the age of 21. The hospital is
staffed with more than 1700 full-time and part-time employees and 50 physicians and is currently
the only pediatric hospital in the Dayton area.
3. Problem Background:
The main problem facing Dayton Children’s Hospital is that they currently can only serve a certain
number of children at a time, and they want the ability to better know when a patient is at higher
risk of readmission. A screening process is needed to determine whether a patient is more at risk
of Readmission within 30 days. The current process is based on both clinical and social factors
adding up to a score out of 14. Dayton Children's Hospital has found some success in the current
model with the current approach having an accuracy of 89.5% and a sensitivity of 16.6% (see table
below) but needs a more precise and scalable way to determine whether the patient should need to
be readmitted as they are being discharged.
2
Fig 1. DCH 0.5 Performance Metrics
Dayton Children’s Hospital wants the BAN 710 DCH team to develop a model that might have
more success based on advanced modeling techniques.
Another requirement from Dayton Children’s Hospital is the ability to have a model that is
scalable. Therefore, the hospital can adopt it for any type of patient.
3. Problem Statement:
Dayton Children’s Hospital currently faces a challenge in efficiently predicting the likelihood of
patient readmissions within 30 days post-discharge. The existing algorithm works by finding the
average of some clinically scored columns related to the patient. It does not accurately reflect the
complexities and fully account for the weights of these factors in predicting readmissions. Current
precision is low at 16.6% and needs to be improved.
3
4. Project Timeline:
Week of
Activities
Jan. 15 - Jan. 21
Project Proposal and Data Set received from DCH, Descriptive Analysis
Jan. 22 - Jan. 28
Draft Analysis Proposal, Further Descriptive Analysis including binning,
feature selection and initial logistic regression analysis
Jan. 29 - Feb. 4
Analysis Proposal Review with Vince, Decision tree analysis, explore
alternate models (lasso, ridge, bootstrapping and gradient boosting)
Feb. 5 - Feb. 11
Further model optimization and convergence
Feb. 12 - Feb. 18
Midterm Presentation of model progression w/Vince
Feb. 19 - Feb. 25
Validation of improved model with new data.
Feb. 26 - Mar. 3
Explore SMOTE, improve confusion matrix, AUC-ROC, sensitivity, and
accuracy.
Mar. 4 - Mar. 10
Feature Analysis and Model Selection
Mar. 11 - Mar. 17
Final Model and Business Insights
Mar. 18 - Mar. 24
Final Presentation to Vince and Final Paper
4
5. Scope:
5.1 In Scope
● Data Analysis and Feature Selection: Finding the best features that affect the readmission
in 30 days using descriptive analysis and decision trees feature importance scores.
● Model Training and Testing: Training the model using 70-30 split technique and tuning
models to improve accuracy.
● Algorithm Selection: Selecting a Machine learning algorithm based on the best performing
predictive model with the relationship or interactions of the selected features.
5.2 Out of Scope:
● Long-term Patient Outcome Predict: Prediction of readmission beyond 30 days. Current
project scope targets readmission in 30 days.
● Integration with Hospital Systems: Team does not have access to the hospital's existing
systems such as the existing Power BI dashboard.
6. Data:
● Four types of identifiers
○ Individual patients
○ Unique hospital admission
○ The department the patient was admitted to
○ The department the patient was discharged from
● Discharged date and Admission date
● Age of patients
● Flag admissions data
5
○ Specific primary payor identification
○ Presence of Social Factors 1 and 2
○ The medication history was not reviewed during the admission
○ The complex chronic condition
○ Preterm delivery indication based on types 1 and 2
● Measurement data
○ Flag for Social risk level at admission
○ Counts of utilization types 1, 2, and 3 over the past 6 months
○ Count of medications the patient was taking before admission
● The risk probability for each encounter using the current algorithm
● The variable to predict
○ The patient was readmitted within 30 days of discharge
7. Assumptions:
The following assumptions were made throughout this project:
● Data (from 1/1/2023 to 12/31/2023) are assumed to be representative of future patient data.
● The machine learning models are scalable to handle more significant amounts of data and
computing resources.
● We assume all the factors are independent.
● The data collection process is standardized and reproducible, as everyone makes uniform
scores, particularly regarding the basis for choosing when not to collect data (such as for
reviewing medications or Social Factor 3).
● The patterns and relationships of data will remain stable over time.
● The models assume the features used are relevant and have a predictive relationship with
the outcome.
6
8. Method
8.1 Data Cleansing and Pre-Processing
Data cleaning is identifying and correcting errors or inaccuracies in data to ensure quality,
transforming raw data ready for analysis. We began by addressing removing any missing values.
We found that Medications_Prior_To_Arrival had 311 missing values, so we replaced the missing
values with -1.
Furthermore, we created new columns from existing ones. The first column, Duration_of_Stay,
calculates “Discharge_Date” subtracted by “Admission_Date”. “Duration_of_Stay” allowed us to
determine whether extended stays versus short stays affect readmission status. Additionally, we
extracted Admission_Month and Discharge_Month from the respective date columns. These
month columns allowed us to determine if the time of the year affects readmission status.
We
have
certain
columns
in
our
data,
like
“Primary_Payor_Type_1”,
“Medications_Reviewed_During_Admission”,
“Social_Factor_1
&
Social_Factor_2”,
“Complex_Chronic_Condition”, “Preterm_Type_1 & 2”, and “ReadmitWithin30Days” can have
values either 0 or 1 only. However, Python considers these columns as int64, which can have value
any that falls in the much wider range of values. To make our data more accurate and our work
more efficient, we converted these columns into Boolean, True or False, to better represent the
nature of data.
8.2 Exploratory Data Analysis
In our exploratory analysis of the initial dataset, we were presented with our first issue: 21 variables
across 12,663 observations, including numeric, datetime, and categorical types. Of those, 1,340
represented the true cases of readmission, representing 10.6% of the entire dataset. We cleaned
data involving handling missing values and converting certain columns to appropriate data types.
And the cleaned dataset has 16 variables for 12,305 entries. Key variables like 'Patient_Age' range
from 0 to 26, while others show imbalances, such as 'Social_Factor_1'. 'Duration_of_Stay' varies
widely, and 'ReadmitWithin30Days' indicates many patients aren't readmitted within 30 days. This
dataset is ready for in-depth analysis and application for machine learning models. We performed
our exploratory data analysis using the Python data profiling package. Please see Appendix 13.1
to view full data profiling reports of the initial and cleaned dataset.
7
Duration_of_Stay: The recorded new column Duration_of_Stay is significant in predicting the
target ReadmitWithin30Days shown in Fig 2.
Figure 2. Mosaic plot of Duration
8.3 Feature Extraction
Feature Extraction involves finding out the key features in the dataset which contribute most
significantly to the desired model output. In simple words it's the process of finding the best
independent variable which really contributes to predicting the target variable. Reducing the
number of features helps machine learning to process and extract meaningful patterns or
relationships.
As part of feature extraction, we performed Correlation and Features Important Scoring.
8.3.1 Correlation:
Correlation measures how closely two variables move together in the same or opposite directions.
To identify relationships between different features and select optimal features, we created a
correlation matrix shown in Fig 3.
8
Fig 3. Correlation Matrix of Feature Columns.
● Preterm_Type_1 and Preterm_Type_2 is showing high positive correlation
● Admission_Department_ID_Transformed and Discharge_Department_ID_Transformed
are highly correlated.
Based on the clinical domain knowledge, we removed Preterm Type 2 and Admission Department
ID Transformed. We also removed Discharge_Month during this phase; it is not a significant
feature in terms of contributing to the prediction of readmissions.
8.3.2 Feature Importance Scores
Important scores typically refer to metrics that assess the significance of each feature in predicting
the target variable. We calculated the importance scores of each feature using Random Forest
Algorithm.
9
Fig 4. Feature Importance Scores
Figure 4 shows the importance score of each feature. Models developed using 13 features and less
are having low performance when compared to all classifications models developed using 14
features and 15 features based on the order of importance scores. Models with 14 features and 15
features based on the order of importance scores show the same results in terms of performance
metrics precision, sensitivity, AUC, accuracy and F1 Score. Based on clinical domain knowledge
we used all the 15 features for our final model.
8.4 Modeling
We implemented Oversampling using SMOTE and Undersampling using Random Majority Class
Removal. Fig 5 below shows an example of these techniques.
Fig 5. Image showing comparison of how oversampling and undersampling works
10
8.4.1 Oversampling
The oversampling technique implemented using SMOTE instead of simply duplicating the
minority. SMOTE (Synthetic Minority Over-sampling Technique) involves creating additional
examples of the minority class (i.e Readmits within 30 days) by generating synthetic records that
resemble the existing minority class records.
SMOTE uses k nearest neighbors
(default k=5) internally. For each
minority class instance, This is
done by choosing minority class k
nearest neighbors and then drawing
a line between the instance and its
chosen
neighbor.
Synthetic
instances are generated along this
line. Fig 6 shows a simple detailed
visual of how SMOTE works
Fig 6. SMOTE Working Image
8.4.2 Undersampling
Undersampling using Random Minority deletion which is reducing the number of records in the
minority class (i.e, Not readmitted within 30 days) by randomly removing some of them. Fig 7
shows detailed visual representation how Undersampling works.
It turns out that both the Oversampling and Undersampling techniques did not improve the model's
performance in predicting the readmission. So, we used the dataset with an imbalance in the target
variable.
Undersampling Performance Metrics
Oversampling Performance Metrics
Fig 7. Multiple Classifier Models Performance Metrics with Oversampling and Undersampling.
11
After evaluating the performance results of the selected classifiers with Undersampling and
Oversampling, the models are unable to reach baseline models performance which are DCH 0.4
and DCH 0.5
In the Algorithm Selection process we trained multiple classification machine learning models
including Logistic Regression, Support Vector Machine, Random Forest, Gradient Boosting, KNearest Neighbors, Naive Bayes, Decision Tree and XGBoost. With that we compared it to both
Dayton Children’s model with a prediction threshold of 0.5 and a threshold at 0.4. With that
Gradient Boosting, K-Nearest Neighbors and Naive Bayes outperformed both of
DaytonChildren’s models.
Fig 8. Performance Metric of Multiple Classifiers along with DCH
Fig 8 is a visual representation of how different classifiers perform compared to the baseline
model DCH 0.5. The performance is measured in terms of two metrics: precision and sensitivity.
The classifiers which fall under quadrant 1 (Q1) will have both higher precision and sensitivity
than DCH 0.5. K-Nearest Neighbours default classifier falls under Q1 and the default Gradient
Boosting classifier falls under quadrant 4 (Q4) because its precision is slightly less (by 0.8%)
than DCH 0.5.
12
8.5 Final Model
The Gradient Boosting model was chosen as the final model due to its superior performance
metrics, robustness in handling imbalanced datasets and scope of learning from mistakes. It
creates n trees correcting mistakes of previous tree.
It is well suited for Anomaly detections, fraud detection, and similar applications in which the
target column has minority class distribution in the range of 1% to 10%. By implementing
Hyperparameter Tuning and Model Calibration the precision can be improved.
8.6 Hyper Parameter Tuning
We performed hyperparameter tuning to determine the optimal hyperparameters. We tested
different gradient-boosting hyperparameters like n_estimators, learning_rate, max_depth, and
max_featues. The model with hyperparameters 151 n_estimators trees with a learning rate of 0.13
having 15 features and max_depth with 3 performed the best amongst other parameters., we
compared that to the Default model having 100 trees with a learning rate of 0.1. The performance
metrics showed that there was no significant difference between the two models in Fig 9.
Fig 9. Best fit models with custom hyper parameters
8.7 Model Calibration
We tested the final Gradient Boosting model with 100 thresholds values starting from 0 to 1 with
the step value of 0.01. We then plotted how well the model performed at each threshold as shown
in Figure 10. We measured performance using two keys, precision & sensitivity, for each
threshold. For better visualization and understanding, we highlighted the precision and sensitivity
values in solid lines which are greater than baseline model i.e, DCH 0.4.
13
The threshold values that lie in the highlighted green area are the best threshold values which we
can use. All threshold values that lie in this area are having high sensitivity, precision and AUC
when compared to baseline model DCH 0.4.
Fig 10. Gradient Boosting Precision & Sensitivity vs Threshold w.r.t DCH 0.4
We identified 11 threshold values that showed the best performance in terms of sensitivity,
precision and AUC. To find the optimal threshold, we have plotted the Zoomed ROC Curve, shown
in the below Fig 11, to show performances of the model at all these thresholds. The optimal
threshold is the one that strikes the best balance between sensitivity and precision. It lies at the top
right corner maintaining balance in between Sensitivity and Precision. Based on the ROC curve
threshold value with 0.32 is the optimal threshold.
14
Fig 11. Best Thresholds - Zoomed Precision vs Recall Curve Compared with DCH 0.4
9. Results:
We trained multiple machine learning models on Python, JMP, and R. Based on modeling, we
helped to better predict whether a patient is at higher risk of readmission using both clinical and
social factors. The Gradient Boosting was ultimately chosen as the final model to improve the
hospital’s ability to predict and manage patient readmissions.
We fine-tuned the Gradient Boosting model by adjusting its hyperparameters. We also tested it
with multiple thresholds to find the balance between precision and sensitivity, resulting in three
models that offer different trade-offs.
Dayton Children’s Hospital has the opportunity to enhance results by implementing a Gradient
Boosting Model tailored to different operational needs. Here are three carefully calibrated options:
● Trade Off Sensitivity for Precision (Threshold 0.38)
○ This model prioritizes precision, which means it aims to minimize false positives.
○ With a threshold of 0.38, this model can identify a smaller, more precise group of
patients for intervention.
● Trade Off Precision for Sensitivity (Threshold 0.28)
15
○ This model emphasizes sensitivity, capturing a larger number of patients who might
be at risk.
○ It is suitable when the hospital aims to cast a wider net to ensure fewer high-risk
patients are missed.
● Balanced Threshold (Threshold 0.32)
○ This model offers a balance between sensitivity and precision.
○ It provides a well-rounded approach to patient readmission prediction.
9.1 Output Comparisons:
A key metric for Dayton Children's Hospital is “identifying the most at-risk patients as
accurately as possible.” Using this guideline the 0.28 threshold of the gradient boosting model
talks to the same number of patients as the DCH 0.4 model but identifies 36.6% more actual
patients. This represents a significant improvement in the identification of at-risk patients. The
0.38 threshold model captures the same number of likely to be readmitted patients but requires
talking to 29.3% fewer patients, as shown in Figure 12.
Figure 12. Consultation Efficiency Results within Different Thresholds
16
All three gradient boosting models predict readmission in 8 out of 9 discharge departments
whereas DCH 0.4 and 0.5 models predict readmission in 7 of 9 discharge departments. However
all three gradient boosting models disproportionately focus on discharge department-
as shown in Figure 13.
Figure 13. Pie Chart of Models Predictions in Departments.
10. Benefits:
Implementation of advanced machine learning classification technique to predict patient
readmission in 30 days offers a range of benefits. First, there is a significant improvement in
prediction accuracy and sensitivity compared to the current algorithm. This improvement ensures
more reliable and precise predictions, contributing to better decision-making processes.
Additionally, resource management within the hospital is streamlined as the predictions enable
efficient allocation of resources. Identifying high-risk patients at an early stage provides Dayton
Children’s Hospital with opportunities for early intervention, ultimately preventing unnecessary
readmissions.
Furthermore, the scalability of machine learning models is an advantage, allowing incorporation
of new significant factors without requiring a fundamental change in the overall approach. The
adaptability enhances the model's capability to evolve with changing requirements.
17
11. Risks:
One such risk is Overfitting, where models might excel with training data but fail to predict new
data accurately. To counter this, we implement validation techniques and maintain model
simplicity.
Another important point is threshold selection. The threshold for classifying probabilities into
predictions can greatly influence performance metrics. We carefully calibrate this to ensure
optimal results for our project.
Furthermore, bias is a problem we had throughout the project. Models may inadvertently learn
biases present in the training data, which could lead to unfair outcomes. We’re committed to using
diverse datasets to minimize this risk.
Lastly, the quality of data is paramount. Inaccurate data can lead to error predictions. We
emphasize rigorous data cleaning and validation to uphold the highest quality standards. We aim
to build robust, fair, and accurate predictive models by acknowledging and addressing these risks.
12. Conclusion:
The University of Dayton BAN 710 team has developed 3 versions of the Gradient Boosting
model, each calibrated with a different threshold to optimize either sensitivity, precision, or a
balance of both. The multi-solution model aims to enhance patient readmission predictions for
Dayton Children’s Hospital. The solutions incorporate an advanced machine learning model,
specially gradient boosting, coupled with threshold optimization. The approach applies a selection
across various combinations of sensitivity and precisions. The development process involved data
preprocessing, exploratory data analysis, and feature extraction. The team addressed imbalance,
handled missing values, and identified relevant features for models. The models were trained and
tested on various platforms, such as JMP, R, and Python. The final model was programmed in
Python to facilitate easy integration into Dayton Children's Hospital’s existing Power BI
dashboard. The improved model can readily be utilized to help the hospital’s balance resource
utilization with the improved prediction of patient readmissions. The BAN 710 team’s work
represents a significant step forward in the use of data to improve outcomes in the healthcare
industry.
18
References
Michailidis, P., Dimitriadou, A., Papadimitriou, T., & Gogas, P. (2022). Forecasting Hospital
Readmissions with Machine Learning. Healthcare (Basel, Switzerland), 10(6), 981.
https://doi.org/10.3390/healthcare-
Davis, S., Zhang, J., Lee, I., Rezaei, M., Greiner, R., McAlister, F. A., & Padwal, R. (2022).
Effective hospital readmission prediction models using machine-learned features. BMC
health services research, 22(1), 1415. https://doi.org/10.1186/s--y
Chaudhary, K. (2023, September 24). How to deal with Imbalanced data in classification?
Medium.https://medium.com/game-of-bits/how-to-deal-with-imbalanced-data-inclassification-bd03cfc66066
Wang S, Zhu X. Predictive Modeling of Hospital Readmission: Challenges and Solutions.
IEEE/ACM Trans Comput Biol Bioinform. 2022 Sep-Oct;19(5):-. doi:
10.1109/TCBB-.
Epub
2022
Oct
10.
PMID:-.
https://ieeexplore.ieee.org/document/-
19
Appendix
13. Dayton Children's Hospital Readmission Prediction System
We developed an application using Python Django Framework by implementing MVC
architecture. Dayton Children’s Hospital Readmission Prediction System (DCH RPS). All features
of the dashboard are secured by role based authentication mechanisms. Users of Admin group only
have access to all the features of the dashboard. Users of the Department group have access to
limited features.
A user-friendly dashboard displays statistics like readmission data, discharge data, and
department-wise readmission within the last 30 days in comparison with previous 30 days (i,e.,
previous 60 days) from the current date. The dashboard will show readmission predictions for the
patients who were discharged in the last 30 days. Users can navigate to the Patients page where
they can access all patients' predictions. From this page they can also add new patients to see the
predictions.
13. 1 EDA Output
● Data profiling with initial dataset: Click here.
● Data profiling with cleaned dataset: Click here.
14. Application Setup
14.1 Environment Setup
Install Python from https://www.python.org/downloads/ version Python 3.9.18 is recommended.
Once downloaded and installed make sure its PATH is added to environment variables. Check
python version after installation from terminal using the following command python --version
14.2 Requirements Setup
In the project root directory run the following command to install required python packages pip
install -r requirements.txt
20
14.3 Start Application
Make sure you upload a dataset file in csv format in the data folder with name data.csv and run
the following command to start the application server in development mode.
python manage.py makemigrations
python manage.py migrate
python manage.py runserver
Application can be accessible from http://localhost:8000
21
15. Application Screens
15.1 Login Screen
Users of the application need to be created from the Django Admin Page, based on the group the
user added they will see the content in the application.
22
15.2 Django Admin Users
Django Admin page can be accessible for http://localhost:8000/admin/login/?next=/admin/
For the first admin superuser we have to create with terminal using the following command
python manage.py createsuperuser
23
15.3 Django Admin Groups
15.3 Dashboard Page
24
15.4 Patients Page