DIABETES MILLETUS ANALYSIS WITH R
THE DIABETES MELLITUS MODEL USING R
Fig 1; Receiver Operating Characteristic (ROC) curve for the model
ROC Curve
1.5
1.0
0.5
0.0
−0.5
Specificity
Fig 1 represent Receiver Operating Characteristic (ROC) curve, a key tool for evaluating the
performance of a binary classification model, for predicting diabetes in this case. The curve plots
the True Positive Rate (Sensitivity) on the y-axis against the False Positive Rate (1 - Specificity)
on the x-axis. The curve in this plot rises steeply from the origin and bends towards the top-left
corner before leveling off, indicating a good model performance. This shape suggests that the
model achieves a high true positive rate while maintaining a low false positive rate across various
classification thresholds. The curve's significant distance from the diagonal line visually confirms
the model's strong discriminative ability. The curve's shape is consistent with the previously
mentioned high Area Under the Curve (AUC) value of approximately 0.81, reinforcing the model's
effectiveness in distinguishing between diabetic and non-diabetic cases.
Fig 2; Box plot for the model
Predicted Probabilities by Actual Outcome
0
1
Fig 2 illustrates the distribution of predicted probabilities for diabetes by actual outcome in the
logistic regression model. The x-axis represents the actual outcome, with 0 indicating non-diabetic
cases and 1 indicating diabetic cases. The y-axis shows the predicted probabilities ranging from 0
to 1. There's a clear separation between the two groups: for non-diabetic individuals (0), the box
is lower, with the median around 0.2-0.3 and most predictions falling below 0.4. For diabetic
individuals (1), the box is higher, with the median around 0.7-0.8 and most predictions above 0.6.
This distinct separation demonstrates the model's strong discriminative ability. The minimal
overlap between the boxes suggests that the model effectively distinguishes between diabetic and
non-diabetic cases. Some outliers are visible, particularly for the non-diabetic group, indicating a
few cases where the model's predictions were less accurate.
APPENDIX
library(readxl)
> data <- read_excel("C:/Users/HP/Downloads/data.xlsx")
> View(data)
> names(data) <- c("BMI", "FamilyHistory", "Sex", "Age", "Diabetes")
> data$FamilyHistory <- as.factor(data$FamilyHistory)
> data$Sex <- as.factor(data$Sex)
> data$Diabetes <- as.factor(data$Diabetes)
> # Standardize continuous variables
> data$BMI_scaled <- scale(data$BMI)
> data$Age_scaled <- scale(data$Age)
> # Objective 1: Investigate effect of demographic variables
> model <- glm(Diabetes ~ BMI_scaled + FamilyHistory + Sex + Age_scaled,
+
data = data, family = binomial)
> model
Call: glm(formula = Diabetes ~ BMI_scaled + FamilyHistory + Sex + Age_scaled,
family = binomial, data = data)
Coefficients:
(Intercept)
-0.48980
BMI_scaled FamilyHistory-
2.67905
Sex1
-0.71748
Degrees of Freedom: 298 Total (i.e. Null); 294 Residual
Null Deviance:
410.4
Residual Deviance: 306.9 AIC: 316.9
> # Summary of the model
> summary_model <- summary(model)
> print(summary_model)
Call:
Age_scaled
-0.06077
glm(formula = Diabetes ~ BMI_scaled + FamilyHistory + Sex + Age_scaled,
family = binomial, data = data)
Deviance Residuals:
Min
1Q Median
3Q
Max
-2.1832 -
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept)
-0.48980
BMI_scaled
0.23827 - *
0.04015
-
FamilyHistory-
Sex1
-0.71748
Age_scaled
- <2e-16 ***
0.30120 - *
-0.06077
0.14349 -
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 410.40 on 298 degrees of freedom
Residual deviance: 306.94 on 294 degrees of freedom
AIC: 316.94
Number of Fisher Scoring iterations: 4
> # Objective 2: Determine lowest and highest effect
> # Calculate odds ratios and confidence intervals
> odds_ratios <- exp(coef(model))
> ci <- exp(confint(model))
Waiting for profiling to be done...
> # Combine odds ratios and CI into a data frame
> effect_sizes <- data.frame(
+
OddsRatio = odds_ratios,
+
CI_Lower = ci[,1],
+
CI_Upper = ci[,2]
+)
> # Sort by absolute effect size
> effect_sizes$Variable <- rownames(effect_sizes)
> effect_sizes <- effect_sizes[order(abs(log(effect_sizes$OddsRatio))), ]
> print(effect_sizes)
OddsRatio CI_Lower CI_Upper
Variable
BMI_scaled
-
BMI_scaled
Age_scaled
-
Age_scaled
(Intercept)
Sex1
-
(Intercept)
Sex1
FamilyHistory- FamilyHistory1
> # Objective 3: Model fit assessment
> # Calculate various fit statistics
> hoslem_test <- hoslem.test(as.numeric(data$Diabetes) - 1, fitted(model))
> pseudo_r2 <- 1 - model$deviance/model$null.deviance
> pseudo_r2
[1]-
> # ROC curve and AUC
> roc_obj <- roc(data$Diabetes, fitted(model))
Setting levels: control = 0, case = 1
Setting direction: controls < cases
> auc_value <- auc(roc_obj)
> auc_value
Area under the curve: 0.8127
> # Print model fit statistics
> cat("\nModel Fit Statistics:\n")
Model Fit Statistics:
> cat("Hosmer-Lemeshow Test p-value:", hoslem_test$p.value, "\n")
Hosmer-Lemeshow Test p-value:-
> cat("Pseudo R-squared:", pseudo_r2, "\n")
Pseudo R-squared:-
> cat("AUC:", auc_value, "\n")
AUC:-
> # Optional: Visualization
> # Plot ROC curve
> plot(roc_obj, main="ROC Curve")
> # Plot predicted probabilities by actual outcome
> predicted_probs <- predict(model, type = "response")
> boxplot(predicted_probs ~ data$Diabetes,
+
main="Predicted Probabilities by Actual Outcome",
+
ylab="Predicted Probability", xlab="Actual Outcome")
> # Additional: Checking multicollinearity
> vif_values <- vif(model)
> print(vif_values)
BMI_scaled FamilyHistory-
-
Sex-
Age_scaled-
model_linearity <- glm(Diabetes ~ BMI_scaled + I(BMI_scaled^2) +
+
Age_scaled + I(Age_scaled^2) +
+
FamilyHistory + Sex,
+
data = data, family = binomial)
> summary(model_linearity)
Call:
glm(formula = Diabetes ~ BMI_scaled + I(BMI_scaled^2) + Age_scaled +
I(Age_scaled^2) + FamilyHistory + Sex, family = binomial,
data = data)
Deviance Residuals:
Min
1Q Median
3Q
Max
-2.1934 -
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept)
BMI_scaled
-0.40083
0.27919 -
0.06591
-
I(BMI_scaled^2) -0.01005
Age_scaled
-0.09547
0.04384 - -
I(Age_scaled^2) -0.09030
0.15811 -
FamilyHistory-
- <2e-16 ***
Sex1
-0.70452
0.30214 - *
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 410.40 on 298 degrees of freedom
Residual deviance: 306.57 on 292 degrees of freedom
AIC: 320.57
Number of Fisher Scoring iterations: 4