Mordecai Oladayo Abayomi | Freelancer Diabetes Milletus Analysis With R

DIABETES MILLETUS ANALYSIS WITH R

THE DIABETES MELLITUS MODEL USING R Fig 1; Receiver Operating Characteristic (ROC) curve for the model ROC Curve 1.5 1.0 0.5 0.0 −0.5 Specificity Fig 1 represent Receiver Operating Characteristic (ROC) curve, a key tool for evaluating the performance of a binary classification model, for predicting diabetes in this case. The curve plots the True Positive Rate (Sensitivity) on the y-axis against the False Positive Rate (1 - Specificity) on the x-axis. The curve in this plot rises steeply from the origin and bends towards the top-left corner before leveling off, indicating a good model performance. This shape suggests that the model achieves a high true positive rate while maintaining a low false positive rate across various classification thresholds. The curve's significant distance from the diagonal line visually confirms the model's strong discriminative ability. The curve's shape is consistent with the previously mentioned high Area Under the Curve (AUC) value of approximately 0.81, reinforcing the model's effectiveness in distinguishing between diabetic and non-diabetic cases. Fig 2; Box plot for the model Predicted Probabilities by Actual Outcome 0 1 Fig 2 illustrates the distribution of predicted probabilities for diabetes by actual outcome in the logistic regression model. The x-axis represents the actual outcome, with 0 indicating non-diabetic cases and 1 indicating diabetic cases. The y-axis shows the predicted probabilities ranging from 0 to 1. There's a clear separation between the two groups: for non-diabetic individuals (0), the box is lower, with the median around 0.2-0.3 and most predictions falling below 0.4. For diabetic individuals (1), the box is higher, with the median around 0.7-0.8 and most predictions above 0.6. This distinct separation demonstrates the model's strong discriminative ability. The minimal overlap between the boxes suggests that the model effectively distinguishes between diabetic and non-diabetic cases. Some outliers are visible, particularly for the non-diabetic group, indicating a few cases where the model's predictions were less accurate. APPENDIX library(readxl) > data <- read_excel("C:/Users/HP/Downloads/data.xlsx") > View(data) > names(data) <- c("BMI", "FamilyHistory", "Sex", "Age", "Diabetes") > data$FamilyHistory <- as.factor(data$FamilyHistory) > data$Sex <- as.factor(data$Sex) > data$Diabetes <- as.factor(data$Diabetes) > # Standardize continuous variables > data$BMI_scaled <- scale(data$BMI) > data$Age_scaled <- scale(data$Age) > # Objective 1: Investigate effect of demographic variables > model <- glm(Diabetes ~ BMI_scaled + FamilyHistory + Sex + Age_scaled, + data = data, family = binomial) > model Call: glm(formula = Diabetes ~ BMI_scaled + FamilyHistory + Sex + Age_scaled, family = binomial, data = data) Coefficients: (Intercept) -0.48980 BMI_scaled FamilyHistory- 2.67905 Sex1 -0.71748 Degrees of Freedom: 298 Total (i.e. Null); 294 Residual Null Deviance: 410.4 Residual Deviance: 306.9 AIC: 316.9 > # Summary of the model > summary_model <- summary(model) > print(summary_model) Call: Age_scaled -0.06077 glm(formula = Diabetes ~ BMI_scaled + FamilyHistory + Sex + Age_scaled, family = binomial, data = data) Deviance Residuals: Min 1Q Median 3Q Max -2.1832 - Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -0.48980 BMI_scaled 0.23827 - * 0.04015 - FamilyHistory- Sex1 -0.71748 Age_scaled - <2e-16 *** 0.30120 - * -0.06077 0.14349 - --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 410.40 on 298 degrees of freedom Residual deviance: 306.94 on 294 degrees of freedom AIC: 316.94 Number of Fisher Scoring iterations: 4 > # Objective 2: Determine lowest and highest effect > # Calculate odds ratios and confidence intervals > odds_ratios <- exp(coef(model)) > ci <- exp(confint(model)) Waiting for profiling to be done... > # Combine odds ratios and CI into a data frame > effect_sizes <- data.frame( + OddsRatio = odds_ratios, + CI_Lower = ci[,1], + CI_Upper = ci[,2] +) > # Sort by absolute effect size > effect_sizes$Variable <- rownames(effect_sizes) > effect_sizes <- effect_sizes[order(abs(log(effect_sizes$OddsRatio))), ] > print(effect_sizes) OddsRatio CI_Lower CI_Upper Variable BMI_scaled - BMI_scaled Age_scaled - Age_scaled (Intercept) Sex1 - (Intercept) Sex1 FamilyHistory- FamilyHistory1 > # Objective 3: Model fit assessment > # Calculate various fit statistics > hoslem_test <- hoslem.test(as.numeric(data$Diabetes) - 1, fitted(model)) > pseudo_r2 <- 1 - model$deviance/model$null.deviance > pseudo_r2 [1]- > # ROC curve and AUC > roc_obj <- roc(data$Diabetes, fitted(model)) Setting levels: control = 0, case = 1 Setting direction: controls < cases > auc_value <- auc(roc_obj) > auc_value Area under the curve: 0.8127 > # Print model fit statistics > cat("\nModel Fit Statistics:\n") Model Fit Statistics: > cat("Hosmer-Lemeshow Test p-value:", hoslem_test$p.value, "\n") Hosmer-Lemeshow Test p-value:- > cat("Pseudo R-squared:", pseudo_r2, "\n") Pseudo R-squared:- > cat("AUC:", auc_value, "\n") AUC:- > # Optional: Visualization > # Plot ROC curve > plot(roc_obj, main="ROC Curve") > # Plot predicted probabilities by actual outcome > predicted_probs <- predict(model, type = "response") > boxplot(predicted_probs ~ data$Diabetes, + main="Predicted Probabilities by Actual Outcome", + ylab="Predicted Probability", xlab="Actual Outcome") > # Additional: Checking multicollinearity > vif_values <- vif(model) > print(vif_values) BMI_scaled FamilyHistory- - Sex- Age_scaled- model_linearity <- glm(Diabetes ~ BMI_scaled + I(BMI_scaled^2) + + Age_scaled + I(Age_scaled^2) + + FamilyHistory + Sex, + data = data, family = binomial) > summary(model_linearity) Call: glm(formula = Diabetes ~ BMI_scaled + I(BMI_scaled^2) + Age_scaled + I(Age_scaled^2) + FamilyHistory + Sex, family = binomial, data = data) Deviance Residuals: Min 1Q Median 3Q Max -2.1934 - Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) BMI_scaled -0.40083 0.27919 - 0.06591 - I(BMI_scaled^2) -0.01005 Age_scaled -0.09547 0.04384 - - I(Age_scaled^2) -0.09030 0.15811 - FamilyHistory- - <2e-16 *** Sex1 -0.70452 0.30214 - * --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 410.40 on 298 degrees of freedom Residual deviance: 306.57 on 292 degrees of freedom AIC: 320.57 Number of Fisher Scoring iterations: 4