Paper 6
Project
Introduction Body fat prediction is a process that involves estimating the percentage of body fat in an
individual. This is an important aspect of health and fitness assessment as excessive body fat can lead to
various health problems such as heart disease, diabetes, and hypertension. There are different methods
that can be used to predict body fat percentage, including anthropometric measurements such as skinfold
thickness, body mass index (BMI), and waist circumference. Other methods include bioelectrical impedance
analysis, dual-energy X-ray absorptiometry (DXA), and air displacement plethysmography.
In recent years, machine learning techniques have also been used to predict body fat percentage. These
methods involve training a machine learning model on a dataset of individuals with known body fat percentages
and using the model to predict the body fat percentage of new individuals based on their anthropometric
measurements and other characteristics such as age, sex, and physical activity level. However, it is important
to note that while body fat prediction can be useful in assessing health and fitness, it is only one aspect of
overall health and should not be used as the sole indicator of health. A balanced diet, regular exercise, and a
healthy lifestyle are all important factors in maintaining overall health and well-being.
In this project, we want to measure the factors affect the body density. Our data set includes the features of
252 individuals. We want to investigate the impact of wrist circumference, height, abdomen 2 circumference,
and hip circumference in the body density. This purpose will be done by multiple regression model. The
data were generously supplied by Dr. A. Garth Fisher who gave permission to freely distribute the data and
use for non-commercial purposes in the website “Www.kaggle.com”
Research question: What are effective factors on body fat?
There are several variables in this data set. We consider just four of them for making hypotheses. First of all,
we describe these variables
Wrist: Wrist circumference is the measurement of the distance around the wrist. This measurement is often
used in health assessments as an indicator of overall body size and health.
Height: Height can have an effect on body fat percentage in several ways. Firstly, taller individuals generally
have a larger overall body size, which can result in a higher body fat percentage when compared to shorter
individuals with the same body weight. This is because taller individuals tend to have more muscle mass and
bone mass, which can contribute to their overall body weight and, in turn, their body fat percentage.
Abdomen: The abdomen can have a significant effect on body fat percentage. Excess fat around the abdomen,
also known as visceral fat, can be particularly harmful to health and is associated with an increased risk of
cardiovascular disease, type 2 diabetes, and other health problems.
Hip: The effect of hip on body fat percentage is related to the distribution of body fat. Some individuals tend
to accumulate fat around their hips and thighs, which is known as gynoid or pear-shaped fat distribution.
Other individuals may accumulate fat more around their abdomen, which is known as android or apple-shaped
fat distribution.
Hypotheses: 1. Wrist has a significant effect on body fat 2. Height has a significant effect on body fat 3.
Abdomen has a significant effect on body fat 4. Hip has a significant effect on body fat
setwd("C:/Users/Sepehr/Desktop")
df <- read.csv("bodyfat.csv", header = TRUE)
attach(df)
model <- lm( Density ~ Wrist + Height + Abdomen + Hip, data = df)
summary(model)
1
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
Call:
lm(formula = Density ~ Wrist + Height + Abdomen + Hip, data = df)
Residuals:
Min
1Q
- -
Median-
3Q-
Max-
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept- < 2e-16 ***
Wrist-e-05 ***
Height- *
Abdomen
- -15.937 < 2e-16 ***
Hip- ***
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.01032 on 247 degrees of freedom
Multiple R-squared: 0.7108, Adjusted R-squared: 0.7062
F-statistic: 151.8 on 4 and 247 DF, p-value: < 2.2e-16
Report
The total p-value of the regression model is less than 0.05. Therefore, we can state that the regression is
significant. The probability of each variable is shown in the table of coefficients. As we can see, all the
probabilities are less than 0.05. In other words, we can conclude that all the four variables affect the body
density. Also, all the t values are more than 1.96. Therefore, we can conclude all the four variables have a
significant effect on the density of body.
Adjusted R-squared
Adjusted r-squared can be defined as the proportion of variance explained by the model while taking into
account both the number of predictor variables and the number of samples used in the regression analysis.
The adjusted r-squared increases only when adding an additional variable to the model improves its predictive
capability more than expected by chance alone. In our model, the adjusted R-squared is equal to 0.7062.
This means than the independent variables could have explained 70 % of the changes in the variance of the
dependent variable which is density of body. The left of that is related to other factors like errors.
Regression assumptions
1. Linearity: The relationship between the dependent and independent variables should be linear, which
means that the change in the dependent variable should be proportional to the change in the independent
variable.
2. No autocorrelation: The residuals should be independent of each other, which means that there should
be no correlation between the residuals at different values of the independent variable. Autocorrelation
can occur when there is a time series or spatial data.
3. No multicollinearity: There should be no perfect or high correlation among the independent variables.
This can cause problems with interpreting the coefficients of the regression model.
4.Homoscedasticity: The variance of the errors (or residuals) should be constant across all levels of the
independent variable. In other words, the spread of the residuals should be the same for all values of the
independent variable.
5.Normality: The residuals should be normally distributed. This means that the frequency distribution of the
residuals should follow a normal (bell-shaped) curve.
2
Now, we should check all the five assumptions to state that our model is valid
1
0.02
0.00
−0.02
Residuals
0.04
res = residuals(model)
plot(df$Density, res,
ylab="Residuals", xlab="density of body")
abline(0, 0)
1.00
1.02
1.04
1.06
1.08
1.10
density of body
Here we see that linearity seems to hold reasonably well, as the red line is close to the dashed line.
2
library(lmtest)
## Warning: package 'lmtest' was built under R version 3.6.3
## Loading required package: zoo
## Warning: package 'zoo' was built under R version 3.6.3
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
##
as.Date, as.Date.numeric
dwtest(model)
##
##
Durbin-Watson test
3
##
## data: model
## DW = 1.6786, p-value =-
## alternative hypothesis: true autocorrelation is greater than 0
For checking autocorrelation, we used durbin watson test. Because the value of DW is between 1.5 and 2.5,
we can state that there is little autocorrelation between the terms.
3
library(car)
## Loading required package: carData
vif(model)
##
Wrist
Height Abdomen
Hip
##-
We can use VIF factor. VIF value <= 4 suggests no multicollinearity whereas a value of >= 10 implies
serious multicollinearity. Above all, a correlation table should also solve the purpose. As we see the VIFs for
the independent variables are less than 4. Therefore, there isn’t multicolinearity.
4
library(lmtest)
bptest(model)
##
## studentized Breusch-Pagan test
##
## data: model
## BP = 2.863, df = 4, p-value = 0.581
The presence of non-constant variance in the error terms results in heteroskedasticity. Generally, non-constant
variance arises in presence of outliers or extreme leverage values. Look like, these values get too much weight,
thereby disproportionately influences the model’s performance. When this phenomenon occurs, the confidence
interval for out of sample prediction tends to be unrealistically wide or narrow. Breusch-Pagan test is for
checking the hetrosgedasticty. Because the p-value of the test is more than 0.05, we can conclude that there
is no hetrosgedasticity in the model.
5
library(tseries)
## Warning: package 'tseries' was built under R version 3.6.3
## Registered S3 method overwritten by 'quantmod':
##
method
from
##
as.zoo.data.frame zoo
jarque.bera.test(res)
##
## Jarque Bera Test
##
## data: res
## X-squared = 4.7211, df = 2, p-value = 0.09437
Normal Distribution of error terms: If the error terms are non- normally distributed, confidence intervals may
become too wide or narrow. Once confidence interval becomes unstable, it leads to difficulty in estimating
coefficients based on minimization of least squares. Presence of non – normal distribution suggests that there
4
are a few unusual data points which must be studied closely to make a better model. Jarque bera is statistical
test for checking the normality of residuals. Because the p-value is more than 0.05, we can conclude that the
residuals follow the normal distribution
Obtained results based on hypotheses:
1.Because the p-value for “wrist” is less than 0.05 and t-statistic is more than 1.96, we can conclude that this
factor has a significant effect on body fat.
2.Because the p-value for “height” is less than 0.05 and t-statistic is more than 1.96, we can conclude that
this factor has a significant effect on body fat.
3. Because the p-value for “abdomen” is less than 0.05 and t-statistic is more than 1.96, we can conclude
that this factor has a significant effect on body fat.
4. Because the p-value for “hip” is less than 0.05 and t-statistic is more than 1.96, we can conclude that
this factor has a significant effect on body fat. Conclusion
The multiple regression model shows the relationship between the independent variables and the density of
body. The explanation of the independent was quitely acceptable.
5