Yamin Sun | Freelancer Ms Business Analytics Capstone Project Yamin Sun Leader

MS Business Analytics Capstone Project - Yamin Sun Leader

Predicting 30-day Patient Readmissions Final Summary Submitted to Vince Baldasare of Dayton’s Children’s Hospital by The University of Dayton BAN 710 DCH Project Team: Ruinan Chen Venkat Nurukurthi Cameron Luckhaupt Yamin Sun March 24, 2024 2 Table of Contents 1. Executive Summary: .............................................................................................................................. 1 2. Background Information:....................................................................................................................... 1 3. Problem Background: ............................................................................................................................ 1 3. Problem Statement:................................................................................................................................ 2 4. Project Timeline:..................................................................................................................................... 2 5. Scope:...................................................................................................................................................... 4 5.1 In Scope ........................................................................................................................................... 4 5.2 Out of Scope: ................................................................................................................................... 4 6. Data:......................................................................................................................................................... 4 7. Assumptions:.......................................................................................................................................... 5 8. Method ..................................................................................................................................................... 6 8.1 Data Cleansing and Pre-Processing ................................................................................................ 6 8.2 Exploratory Data Analysis ................................................................................................................ 6 8.3 Feature Extraction ............................................................................................................................ 7 8.3.1 Correlation: ............................................................................................................................. 8 8.3.2 Feature Importance Scores .................................................................................................... 9 8.4 Modeling .......................................................................................................................................... 9 8.4.1 Oversampling ....................................................................................................................... 10 8.4.2 Undersampling ..................................................................................................................... 10 8.5 Final Model ..................................................................................................................... 12 8.6 Hyper Parameter Tuning ................................................................................................................ 12 8.7 Model Calibration ........................................................................................................................... 13 9. Results: ................................................................................................................................................. 14 9.1 Output Comparisons: .................................................................................................................. 15 10. Benefits: .............................................................................................................................................. 16 11. Risks: ................................................................................................................................... 17 12. Conclusion: ......................................................................................................................................... 17 References ............................................................................................................................ 18 Appendix ............................................................................................................................................. 19 13. Dayton Children's Hospital Readmission Prediction System ......................................................... 19 14. Application Setup ............................................................................................................................... 19 14.1 Environment Setup ...................................................................................................................... 19 14.2 Requirements Setup .................................................................................................................... 19 14.3 Start Application ........................................................................................................................... 20 15. Application Screens ........................................................................................................................... 21 15.1 Login Screen ................................................................................................................................ 21 15.2 Django Admin Users .................................................................................................................... 22 15.3 Django Admin Groups .................................................................................................................. 23 15.3 Dashboard Page .......................................................................................................................... 23 15.4 Patients Page ............................................................................................................................... 24 1 1. Executive Summary: Dayton Children’s Hospital needs a predictive model to identify those patients who are likely to be readmitted in 30 days. To address this problem, eight different machine learning classification models under multiple thresholds were developed. Three were identified that meet Dayton Children's Hospital’s needs in different ways by manipulating the threshold values of a gradient boosting methodology executed in Python These three models provide significant improvement in combinations of sensitivity and precision vs. Dayton Children’s Hospital’s internal algorithm executed at either 0.4 or 0.5 threshold. Dayton’s Children Hospital can now choose which model or experiment across all three to integrate into their existing Power BI dashboard to identify those patients who are at risk of returning in 30 days at the time of discharge. 2. Background Information: Dayton Children’s Hospital has been providing healthcare since 1919. Dayton Children’s Hospital has 9 locations currently in Dayton, Vandalia, Beavercreek, Springboro, Kettering, Springfield, Warren County, Lima, and Sugarcreek. Dayton Children’s is a verified Level 1 Pediatric Trauma Center. Currently, Dayton Children’s provides care for about 150 patients in multiple areas of care. The hospital provides 35 specialty areas ranging from Anesthesiology to Urology to even Sports Medicine. Dayton Children’s serves patients from newborns to the age of 21. The hospital is staffed with more than 1700 full-time and part-time employees and 50 physicians and is currently the only pediatric hospital in the Dayton area. 3. Problem Background: The main problem facing Dayton Children’s Hospital is that they currently can only serve a certain number of children at a time, and they want the ability to better know when a patient is at higher risk of readmission. A screening process is needed to determine whether a patient is more at risk of Readmission within 30 days. The current process is based on both clinical and social factors adding up to a score out of 14. Dayton Children's Hospital has found some success in the current model with the current approach having an accuracy of 89.5% and a sensitivity of 16.6% (see table below) but needs a more precise and scalable way to determine whether the patient should need to be readmitted as they are being discharged. 2 Fig 1. DCH 0.5 Performance Metrics Dayton Children’s Hospital wants the BAN 710 DCH team to develop a model that might have more success based on advanced modeling techniques. Another requirement from Dayton Children’s Hospital is the ability to have a model that is scalable. Therefore, the hospital can adopt it for any type of patient. 3. Problem Statement: Dayton Children’s Hospital currently faces a challenge in efficiently predicting the likelihood of patient readmissions within 30 days post-discharge. The existing algorithm works by finding the average of some clinically scored columns related to the patient. It does not accurately reflect the complexities and fully account for the weights of these factors in predicting readmissions. Current precision is low at 16.6% and needs to be improved. 3 4. Project Timeline: Week of Activities Jan. 15 - Jan. 21 Project Proposal and Data Set received from DCH, Descriptive Analysis Jan. 22 - Jan. 28 Draft Analysis Proposal, Further Descriptive Analysis including binning, feature selection and initial logistic regression analysis Jan. 29 - Feb. 4 Analysis Proposal Review with Vince, Decision tree analysis, explore alternate models (lasso, ridge, bootstrapping and gradient boosting) Feb. 5 - Feb. 11 Further model optimization and convergence Feb. 12 - Feb. 18 Midterm Presentation of model progression w/Vince Feb. 19 - Feb. 25 Validation of improved model with new data. Feb. 26 - Mar. 3 Explore SMOTE, improve confusion matrix, AUC-ROC, sensitivity, and accuracy. Mar. 4 - Mar. 10 Feature Analysis and Model Selection Mar. 11 - Mar. 17 Final Model and Business Insights Mar. 18 - Mar. 24 Final Presentation to Vince and Final Paper 4 5. Scope: 5.1 In Scope ● Data Analysis and Feature Selection: Finding the best features that affect the readmission in 30 days using descriptive analysis and decision trees feature importance scores. ● Model Training and Testing: Training the model using 70-30 split technique and tuning models to improve accuracy. ● Algorithm Selection: Selecting a Machine learning algorithm based on the best performing predictive model with the relationship or interactions of the selected features. 5.2 Out of Scope: ● Long-term Patient Outcome Predict: Prediction of readmission beyond 30 days. Current project scope targets readmission in 30 days. ● Integration with Hospital Systems: Team does not have access to the hospital's existing systems such as the existing Power BI dashboard. 6. Data: ● Four types of identifiers ○ Individual patients ○ Unique hospital admission ○ The department the patient was admitted to ○ The department the patient was discharged from ● Discharged date and Admission date ● Age of patients ● Flag admissions data 5 ○ Specific primary payor identification ○ Presence of Social Factors 1 and 2 ○ The medication history was not reviewed during the admission ○ The complex chronic condition ○ Preterm delivery indication based on types 1 and 2 ● Measurement data ○ Flag for Social risk level at admission ○ Counts of utilization types 1, 2, and 3 over the past 6 months ○ Count of medications the patient was taking before admission ● The risk probability for each encounter using the current algorithm ● The variable to predict ○ The patient was readmitted within 30 days of discharge 7. Assumptions: The following assumptions were made throughout this project: ● Data (from 1/1/2023 to 12/31/2023) are assumed to be representative of future patient data. ● The machine learning models are scalable to handle more significant amounts of data and computing resources. ● We assume all the factors are independent. ● The data collection process is standardized and reproducible, as everyone makes uniform scores, particularly regarding the basis for choosing when not to collect data (such as for reviewing medications or Social Factor 3). ● The patterns and relationships of data will remain stable over time. ● The models assume the features used are relevant and have a predictive relationship with the outcome. 6 8. Method 8.1 Data Cleansing and Pre-Processing Data cleaning is identifying and correcting errors or inaccuracies in data to ensure quality, transforming raw data ready for analysis. We began by addressing removing any missing values. We found that Medications_Prior_To_Arrival had 311 missing values, so we replaced the missing values with -1. Furthermore, we created new columns from existing ones. The first column, Duration_of_Stay, calculates “Discharge_Date” subtracted by “Admission_Date”. “Duration_of_Stay” allowed us to determine whether extended stays versus short stays affect readmission status. Additionally, we extracted Admission_Month and Discharge_Month from the respective date columns. These month columns allowed us to determine if the time of the year affects readmission status. We have certain columns in our data, like “Primary_Payor_Type_1”, “Medications_Reviewed_During_Admission”, “Social_Factor_1 & Social_Factor_2”, “Complex_Chronic_Condition”, “Preterm_Type_1 & 2”, and “ReadmitWithin30Days” can have values either 0 or 1 only. However, Python considers these columns as int64, which can have value any that falls in the much wider range of values. To make our data more accurate and our work more efficient, we converted these columns into Boolean, True or False, to better represent the nature of data. 8.2 Exploratory Data Analysis In our exploratory analysis of the initial dataset, we were presented with our first issue: 21 variables across 12,663 observations, including numeric, datetime, and categorical types. Of those, 1,340 represented the true cases of readmission, representing 10.6% of the entire dataset. We cleaned data involving handling missing values and converting certain columns to appropriate data types. And the cleaned dataset has 16 variables for 12,305 entries. Key variables like 'Patient_Age' range from 0 to 26, while others show imbalances, such as 'Social_Factor_1'. 'Duration_of_Stay' varies widely, and 'ReadmitWithin30Days' indicates many patients aren't readmitted within 30 days. This dataset is ready for in-depth analysis and application for machine learning models. We performed our exploratory data analysis using the Python data profiling package. Please see Appendix 13.1 to view full data profiling reports of the initial and cleaned dataset. 7 Duration_of_Stay: The recorded new column Duration_of_Stay is significant in predicting the target ReadmitWithin30Days shown in Fig 2. Figure 2. Mosaic plot of Duration 8.3 Feature Extraction Feature Extraction involves finding out the key features in the dataset which contribute most significantly to the desired model output. In simple words it's the process of finding the best independent variable which really contributes to predicting the target variable. Reducing the number of features helps machine learning to process and extract meaningful patterns or relationships. As part of feature extraction, we performed Correlation and Features Important Scoring. 8.3.1 Correlation: Correlation measures how closely two variables move together in the same or opposite directions. To identify relationships between different features and select optimal features, we created a correlation matrix shown in Fig 3. 8 Fig 3. Correlation Matrix of Feature Columns. ● Preterm_Type_1 and Preterm_Type_2 is showing high positive correlation ● Admission_Department_ID_Transformed and Discharge_Department_ID_Transformed are highly correlated. Based on the clinical domain knowledge, we removed Preterm Type 2 and Admission Department ID Transformed. We also removed Discharge_Month during this phase; it is not a significant feature in terms of contributing to the prediction of readmissions. 8.3.2 Feature Importance Scores Important scores typically refer to metrics that assess the significance of each feature in predicting the target variable. We calculated the importance scores of each feature using Random Forest Algorithm. 9 Fig 4. Feature Importance Scores Figure 4 shows the importance score of each feature. Models developed using 13 features and less are having low performance when compared to all classifications models developed using 14 features and 15 features based on the order of importance scores. Models with 14 features and 15 features based on the order of importance scores show the same results in terms of performance metrics precision, sensitivity, AUC, accuracy and F1 Score. Based on clinical domain knowledge we used all the 15 features for our final model. 8.4 Modeling We implemented Oversampling using SMOTE and Undersampling using Random Majority Class Removal. Fig 5 below shows an example of these techniques. Fig 5. Image showing comparison of how oversampling and undersampling works 10 8.4.1 Oversampling The oversampling technique implemented using SMOTE instead of simply duplicating the minority. SMOTE (Synthetic Minority Over-sampling Technique) involves creating additional examples of the minority class (i.e Readmits within 30 days) by generating synthetic records that resemble the existing minority class records. SMOTE uses k nearest neighbors (default k=5) internally. For each minority class instance, This is done by choosing minority class k nearest neighbors and then drawing a line between the instance and its chosen neighbor. Synthetic instances are generated along this line. Fig 6 shows a simple detailed visual of how SMOTE works Fig 6. SMOTE Working Image 8.4.2 Undersampling Undersampling using Random Minority deletion which is reducing the number of records in the minority class (i.e, Not readmitted within 30 days) by randomly removing some of them. Fig 7 shows detailed visual representation how Undersampling works. It turns out that both the Oversampling and Undersampling techniques did not improve the model's performance in predicting the readmission. So, we used the dataset with an imbalance in the target variable. Undersampling Performance Metrics Oversampling Performance Metrics Fig 7. Multiple Classifier Models Performance Metrics with Oversampling and Undersampling. 11 After evaluating the performance results of the selected classifiers with Undersampling and Oversampling, the models are unable to reach baseline models performance which are DCH 0.4 and DCH 0.5 In the Algorithm Selection process we trained multiple classification machine learning models including Logistic Regression, Support Vector Machine, Random Forest, Gradient Boosting, KNearest Neighbors, Naive Bayes, Decision Tree and XGBoost. With that we compared it to both Dayton Children’s model with a prediction threshold of 0.5 and a threshold at 0.4. With that Gradient Boosting, K-Nearest Neighbors and Naive Bayes outperformed both of DaytonChildren’s models. Fig 8. Performance Metric of Multiple Classifiers along with DCH Fig 8 is a visual representation of how different classifiers perform compared to the baseline model DCH 0.5. The performance is measured in terms of two metrics: precision and sensitivity. The classifiers which fall under quadrant 1 (Q1) will have both higher precision and sensitivity than DCH 0.5. K-Nearest Neighbours default classifier falls under Q1 and the default Gradient Boosting classifier falls under quadrant 4 (Q4) because its precision is slightly less (by 0.8%) than DCH 0.5. 12 8.5 Final Model The Gradient Boosting model was chosen as the final model due to its superior performance metrics, robustness in handling imbalanced datasets and scope of learning from mistakes. It creates n trees correcting mistakes of previous tree. It is well suited for Anomaly detections, fraud detection, and similar applications in which the target column has minority class distribution in the range of 1% to 10%. By implementing Hyperparameter Tuning and Model Calibration the precision can be improved. 8.6 Hyper Parameter Tuning We performed hyperparameter tuning to determine the optimal hyperparameters. We tested different gradient-boosting hyperparameters like n_estimators, learning_rate, max_depth, and max_featues. The model with hyperparameters 151 n_estimators trees with a learning rate of 0.13 having 15 features and max_depth with 3 performed the best amongst other parameters., we compared that to the Default model having 100 trees with a learning rate of 0.1. The performance metrics showed that there was no significant difference between the two models in Fig 9. Fig 9. Best fit models with custom hyper parameters 8.7 Model Calibration We tested the final Gradient Boosting model with 100 thresholds values starting from 0 to 1 with the step value of 0.01. We then plotted how well the model performed at each threshold as shown in Figure 10. We measured performance using two keys, precision & sensitivity, for each threshold. For better visualization and understanding, we highlighted the precision and sensitivity values in solid lines which are greater than baseline model i.e, DCH 0.4. 13 The threshold values that lie in the highlighted green area are the best threshold values which we can use. All threshold values that lie in this area are having high sensitivity, precision and AUC when compared to baseline model DCH 0.4. Fig 10. Gradient Boosting Precision & Sensitivity vs Threshold w.r.t DCH 0.4 We identified 11 threshold values that showed the best performance in terms of sensitivity, precision and AUC. To find the optimal threshold, we have plotted the Zoomed ROC Curve, shown in the below Fig 11, to show performances of the model at all these thresholds. The optimal threshold is the one that strikes the best balance between sensitivity and precision. It lies at the top right corner maintaining balance in between Sensitivity and Precision. Based on the ROC curve threshold value with 0.32 is the optimal threshold. 14 Fig 11. Best Thresholds - Zoomed Precision vs Recall Curve Compared with DCH 0.4 9. Results: We trained multiple machine learning models on Python, JMP, and R. Based on modeling, we helped to better predict whether a patient is at higher risk of readmission using both clinical and social factors. The Gradient Boosting was ultimately chosen as the final model to improve the hospital’s ability to predict and manage patient readmissions. We fine-tuned the Gradient Boosting model by adjusting its hyperparameters. We also tested it with multiple thresholds to find the balance between precision and sensitivity, resulting in three models that offer different trade-offs. Dayton Children’s Hospital has the opportunity to enhance results by implementing a Gradient Boosting Model tailored to different operational needs. Here are three carefully calibrated options: ● Trade Off Sensitivity for Precision (Threshold 0.38) ○ This model prioritizes precision, which means it aims to minimize false positives. ○ With a threshold of 0.38, this model can identify a smaller, more precise group of patients for intervention. ● Trade Off Precision for Sensitivity (Threshold 0.28) 15 ○ This model emphasizes sensitivity, capturing a larger number of patients who might be at risk. ○ It is suitable when the hospital aims to cast a wider net to ensure fewer high-risk patients are missed. ● Balanced Threshold (Threshold 0.32) ○ This model offers a balance between sensitivity and precision. ○ It provides a well-rounded approach to patient readmission prediction. 9.1 Output Comparisons: A key metric for Dayton Children's Hospital is “identifying the most at-risk patients as accurately as possible.” Using this guideline the 0.28 threshold of the gradient boosting model talks to the same number of patients as the DCH 0.4 model but identifies 36.6% more actual patients. This represents a significant improvement in the identification of at-risk patients. The 0.38 threshold model captures the same number of likely to be readmitted patients but requires talking to 29.3% fewer patients, as shown in Figure 12. Figure 12. Consultation Efficiency Results within Different Thresholds 16 All three gradient boosting models predict readmission in 8 out of 9 discharge departments whereas DCH 0.4 and 0.5 models predict readmission in 7 of 9 discharge departments. However all three gradient boosting models disproportionately focus on discharge department- as shown in Figure 13. Figure 13. Pie Chart of Models Predictions in Departments. 10. Benefits: Implementation of advanced machine learning classification technique to predict patient readmission in 30 days offers a range of benefits. First, there is a significant improvement in prediction accuracy and sensitivity compared to the current algorithm. This improvement ensures more reliable and precise predictions, contributing to better decision-making processes. Additionally, resource management within the hospital is streamlined as the predictions enable efficient allocation of resources. Identifying high-risk patients at an early stage provides Dayton Children’s Hospital with opportunities for early intervention, ultimately preventing unnecessary readmissions. Furthermore, the scalability of machine learning models is an advantage, allowing incorporation of new significant factors without requiring a fundamental change in the overall approach. The adaptability enhances the model's capability to evolve with changing requirements. 17 11. Risks: One such risk is Overfitting, where models might excel with training data but fail to predict new data accurately. To counter this, we implement validation techniques and maintain model simplicity. Another important point is threshold selection. The threshold for classifying probabilities into predictions can greatly influence performance metrics. We carefully calibrate this to ensure optimal results for our project. Furthermore, bias is a problem we had throughout the project. Models may inadvertently learn biases present in the training data, which could lead to unfair outcomes. We’re committed to using diverse datasets to minimize this risk. Lastly, the quality of data is paramount. Inaccurate data can lead to error predictions. We emphasize rigorous data cleaning and validation to uphold the highest quality standards. We aim to build robust, fair, and accurate predictive models by acknowledging and addressing these risks. 12. Conclusion: The University of Dayton BAN 710 team has developed 3 versions of the Gradient Boosting model, each calibrated with a different threshold to optimize either sensitivity, precision, or a balance of both. The multi-solution model aims to enhance patient readmission predictions for Dayton Children’s Hospital. The solutions incorporate an advanced machine learning model, specially gradient boosting, coupled with threshold optimization. The approach applies a selection across various combinations of sensitivity and precisions. The development process involved data preprocessing, exploratory data analysis, and feature extraction. The team addressed imbalance, handled missing values, and identified relevant features for models. The models were trained and tested on various platforms, such as JMP, R, and Python. The final model was programmed in Python to facilitate easy integration into Dayton Children's Hospital’s existing Power BI dashboard. The improved model can readily be utilized to help the hospital’s balance resource utilization with the improved prediction of patient readmissions. The BAN 710 team’s work represents a significant step forward in the use of data to improve outcomes in the healthcare industry. 18 References Michailidis, P., Dimitriadou, A., Papadimitriou, T., & Gogas, P. (2022). Forecasting Hospital Readmissions with Machine Learning. Healthcare (Basel, Switzerland), 10(6), 981. https://doi.org/10.3390/healthcare- Davis, S., Zhang, J., Lee, I., Rezaei, M., Greiner, R., McAlister, F. A., & Padwal, R. (2022). Effective hospital readmission prediction models using machine-learned features. BMC health services research, 22(1), 1415. https://doi.org/10.1186/s--y Chaudhary, K. (2023, September 24). How to deal with Imbalanced data in classification? Medium.https://medium.com/game-of-bits/how-to-deal-with-imbalanced-data-inclassification-bd03cfc66066 Wang S, Zhu X. Predictive Modeling of Hospital Readmission: Challenges and Solutions. IEEE/ACM Trans Comput Biol Bioinform. 2022 Sep-Oct;19(5):-. doi: 10.1109/TCBB-. Epub 2022 Oct 10. PMID:-. https://ieeexplore.ieee.org/document/- 19 Appendix 13. Dayton Children's Hospital Readmission Prediction System We developed an application using Python Django Framework by implementing MVC architecture. Dayton Children’s Hospital Readmission Prediction System (DCH RPS). All features of the dashboard are secured by role based authentication mechanisms. Users of Admin group only have access to all the features of the dashboard. Users of the Department group have access to limited features. A user-friendly dashboard displays statistics like readmission data, discharge data, and department-wise readmission within the last 30 days in comparison with previous 30 days (i,e., previous 60 days) from the current date. The dashboard will show readmission predictions for the patients who were discharged in the last 30 days. Users can navigate to the Patients page where they can access all patients' predictions. From this page they can also add new patients to see the predictions. 13. 1 EDA Output ● Data profiling with initial dataset: Click here. ● Data profiling with cleaned dataset: Click here. 14. Application Setup 14.1 Environment Setup Install Python from https://www.python.org/downloads/ version Python 3.9.18 is recommended. Once downloaded and installed make sure its PATH is added to environment variables. Check python version after installation from terminal using the following command python --version 14.2 Requirements Setup In the project root directory run the following command to install required python packages pip install -r requirements.txt 20 14.3 Start Application Make sure you upload a dataset file in csv format in the data folder with name data.csv and run the following command to start the application server in development mode. python manage.py makemigrations python manage.py migrate python manage.py runserver Application can be accessible from http://localhost:8000 21 15. Application Screens 15.1 Login Screen Users of the application need to be created from the Django Admin Page, based on the group the user added they will see the content in the application. 22 15.2 Django Admin Users Django Admin page can be accessible for http://localhost:8000/admin/login/?next=/admin/ For the first admin superuser we have to create with terminal using the following command python manage.py createsuperuser 23 15.3 Django Admin Groups 15.3 Dashboard Page 24 15.4 Patients Page