Rizki Mayandi Hasibuan

Paper 6

Project Introduction Body fat prediction is a process that involves estimating the percentage of body fat in an individual. This is an important aspect of health and fitness assessment as excessive body fat can lead to various health problems such as heart disease, diabetes, and hypertension. There are different methods that can be used to predict body fat percentage, including anthropometric measurements such as skinfold thickness, body mass index (BMI), and waist circumference. Other methods include bioelectrical impedance analysis, dual-energy X-ray absorptiometry (DXA), and air displacement plethysmography. In recent years, machine learning techniques have also been used to predict body fat percentage. These methods involve training a machine learning model on a dataset of individuals with known body fat percentages and using the model to predict the body fat percentage of new individuals based on their anthropometric measurements and other characteristics such as age, sex, and physical activity level. However, it is important to note that while body fat prediction can be useful in assessing health and fitness, it is only one aspect of overall health and should not be used as the sole indicator of health. A balanced diet, regular exercise, and a healthy lifestyle are all important factors in maintaining overall health and well-being. In this project, we want to measure the factors affect the body density. Our data set includes the features of 252 individuals. We want to investigate the impact of wrist circumference, height, abdomen 2 circumference, and hip circumference in the body density. This purpose will be done by multiple regression model. The data were generously supplied by Dr. A. Garth Fisher who gave permission to freely distribute the data and use for non-commercial purposes in the website “Www.kaggle.com” Research question: What are effective factors on body fat? There are several variables in this data set. We consider just four of them for making hypotheses. First of all, we describe these variables Wrist: Wrist circumference is the measurement of the distance around the wrist. This measurement is often used in health assessments as an indicator of overall body size and health. Height: Height can have an effect on body fat percentage in several ways. Firstly, taller individuals generally have a larger overall body size, which can result in a higher body fat percentage when compared to shorter individuals with the same body weight. This is because taller individuals tend to have more muscle mass and bone mass, which can contribute to their overall body weight and, in turn, their body fat percentage. Abdomen: The abdomen can have a significant effect on body fat percentage. Excess fat around the abdomen, also known as visceral fat, can be particularly harmful to health and is associated with an increased risk of cardiovascular disease, type 2 diabetes, and other health problems. Hip: The effect of hip on body fat percentage is related to the distribution of body fat. Some individuals tend to accumulate fat around their hips and thighs, which is known as gynoid or pear-shaped fat distribution. Other individuals may accumulate fat more around their abdomen, which is known as android or apple-shaped fat distribution. Hypotheses: 1. Wrist has a significant effect on body fat 2. Height has a significant effect on body fat 3. Abdomen has a significant effect on body fat 4. Hip has a significant effect on body fat setwd("C:/Users/Sepehr/Desktop") df <- read.csv("bodyfat.csv", header = TRUE) attach(df) model <- lm( Density ~ Wrist + Height + Abdomen + Hip, data = df) summary(model) 1 ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## Call: lm(formula = Density ~ Wrist + Height + Abdomen + Hip, data = df) Residuals: Min 1Q - - Median- 3Q- Max- Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept- < 2e-16 *** Wrist-e-05 *** Height- * Abdomen - -15.937 < 2e-16 *** Hip- *** --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 0.01032 on 247 degrees of freedom Multiple R-squared: 0.7108, Adjusted R-squared: 0.7062 F-statistic: 151.8 on 4 and 247 DF, p-value: < 2.2e-16 Report The total p-value of the regression model is less than 0.05. Therefore, we can state that the regression is significant. The probability of each variable is shown in the table of coefficients. As we can see, all the probabilities are less than 0.05. In other words, we can conclude that all the four variables affect the body density. Also, all the t values are more than 1.96. Therefore, we can conclude all the four variables have a significant effect on the density of body. Adjusted R-squared Adjusted r-squared can be defined as the proportion of variance explained by the model while taking into account both the number of predictor variables and the number of samples used in the regression analysis. The adjusted r-squared increases only when adding an additional variable to the model improves its predictive capability more than expected by chance alone. In our model, the adjusted R-squared is equal to 0.7062. This means than the independent variables could have explained 70 % of the changes in the variance of the dependent variable which is density of body. The left of that is related to other factors like errors. Regression assumptions 1. Linearity: The relationship between the dependent and independent variables should be linear, which means that the change in the dependent variable should be proportional to the change in the independent variable. 2. No autocorrelation: The residuals should be independent of each other, which means that there should be no correlation between the residuals at different values of the independent variable. Autocorrelation can occur when there is a time series or spatial data. 3. No multicollinearity: There should be no perfect or high correlation among the independent variables. This can cause problems with interpreting the coefficients of the regression model. 4.Homoscedasticity: The variance of the errors (or residuals) should be constant across all levels of the independent variable. In other words, the spread of the residuals should be the same for all values of the independent variable. 5.Normality: The residuals should be normally distributed. This means that the frequency distribution of the residuals should follow a normal (bell-shaped) curve. 2 Now, we should check all the five assumptions to state that our model is valid 1 0.02 0.00 −0.02 Residuals 0.04 res = residuals(model) plot(df$Density, res, ylab="Residuals", xlab="density of body") abline(0, 0) 1.00 1.02 1.04 1.06 1.08 1.10 density of body Here we see that linearity seems to hold reasonably well, as the red line is close to the dashed line. 2 library(lmtest) ## Warning: package 'lmtest' was built under R version 3.6.3 ## Loading required package: zoo ## Warning: package 'zoo' was built under R version 3.6.3 ## ## Attaching package: 'zoo' ## The following objects are masked from 'package:base': ## ## as.Date, as.Date.numeric dwtest(model) ## ## Durbin-Watson test 3 ## ## data: model ## DW = 1.6786, p-value =- ## alternative hypothesis: true autocorrelation is greater than 0 For checking autocorrelation, we used durbin watson test. Because the value of DW is between 1.5 and 2.5, we can state that there is little autocorrelation between the terms. 3 library(car) ## Loading required package: carData vif(model) ## Wrist Height Abdomen Hip ##- We can use VIF factor. VIF value <= 4 suggests no multicollinearity whereas a value of >= 10 implies serious multicollinearity. Above all, a correlation table should also solve the purpose. As we see the VIFs for the independent variables are less than 4. Therefore, there isn’t multicolinearity. 4 library(lmtest) bptest(model) ## ## studentized Breusch-Pagan test ## ## data: model ## BP = 2.863, df = 4, p-value = 0.581 The presence of non-constant variance in the error terms results in heteroskedasticity. Generally, non-constant variance arises in presence of outliers or extreme leverage values. Look like, these values get too much weight, thereby disproportionately influences the model’s performance. When this phenomenon occurs, the confidence interval for out of sample prediction tends to be unrealistically wide or narrow. Breusch-Pagan test is for checking the hetrosgedasticty. Because the p-value of the test is more than 0.05, we can conclude that there is no hetrosgedasticity in the model. 5 library(tseries) ## Warning: package 'tseries' was built under R version 3.6.3 ## Registered S3 method overwritten by 'quantmod': ## method from ## as.zoo.data.frame zoo jarque.bera.test(res) ## ## Jarque Bera Test ## ## data: res ## X-squared = 4.7211, df = 2, p-value = 0.09437 Normal Distribution of error terms: If the error terms are non- normally distributed, confidence intervals may become too wide or narrow. Once confidence interval becomes unstable, it leads to difficulty in estimating coefficients based on minimization of least squares. Presence of non – normal distribution suggests that there 4 are a few unusual data points which must be studied closely to make a better model. Jarque bera is statistical test for checking the normality of residuals. Because the p-value is more than 0.05, we can conclude that the residuals follow the normal distribution Obtained results based on hypotheses: 1.Because the p-value for “wrist” is less than 0.05 and t-statistic is more than 1.96, we can conclude that this factor has a significant effect on body fat. 2.Because the p-value for “height” is less than 0.05 and t-statistic is more than 1.96, we can conclude that this factor has a significant effect on body fat. 3. Because the p-value for “abdomen” is less than 0.05 and t-statistic is more than 1.96, we can conclude that this factor has a significant effect on body fat. 4. Because the p-value for “hip” is less than 0.05 and t-statistic is more than 1.96, we can conclude that this factor has a significant effect on body fat. Conclusion The multiple regression model shows the relationship between the independent variables and the density of body. The explanation of the independent was quitely acceptable. 5

Scheduled maintenance