Prediction of Insurance Costs
1
Prediction of Insurance Costs
2
# Loading data
library(readr)
insurance_data <- read_csv("insurance_data.csv")
## -- Column specification
## cols(
## age = col_double(),
## sex = col_character(),
## bmi = col_double(),
## children = col_double(),
## smoker = col_character(),
## region = col_character(),
## charges = col_double()
## )
##Understanding the Data
data <- insurance_data
head(data)
## # A tibble: 6 x 7
##
age sex
bmi children smoker region
##
## 1
19 female 27.9
0 yes
## 2
18 male 33.8
1 no
charges
southwest 16885.
southeast 1726.
3
## 3
28 male 33
3 no
southeast 4449.
## 4
33 male 22.7
0 no
northwest 21984.
## 5
32 male 28.9
0 no
northwest 3867.
## 6
31 female 25.7
0 no
southeast 3757.
str(data)
## spec_tbl_df [1,338 x 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ age
: num [1:1338]- ...
## $ sex
: chr [1:1338] "female" "male" "male" "male" ...
## $ bmi
: num [1:1338]- ...
## $ children: num [1:1338]- ...
## $ smoker : chr [1:1338] "yes" "no" "no" "no" ...
## $ region : chr [1:1338] "southwest" "southeast" "southeast" "northwest" ...
## $ charges : num [1:1338]- ...
## - attr(*, "spec")=
## .. cols(
## .. age = col_double(),
## .. sex = col_character(),
## .. bmi = col_double(),
## .. children = col_double(),
## .. smoker = col_character(),
## .. region = col_character(),
4
## .. charges = col_double()
## .. )
# Coverting sex, smoker ad region to data type factor
data$sex <- as.factor(data$sex)
data$smoker <- factor(data$smoker,
levels = c("no","yes"),
labels = c(0,1))
data$region <- as.factor(data$region)
Summary (data)
##
##
##
##
##
##
##
age
sex
Min. :18.00 female:662
1st Qu.:27.00 male :676
Median :39.00
Mean :39.21
3rd Qu.:51.00
Max. :64.00
bmi
Min. :15.96
1st Qu.:26.30
Median :30.40
Mean :30.66
3rd Qu.:34.69
Max. :53.13
##
##
##
##
##
##
##
region
northeast:324
northwest:325
southeast:364
southwest:325
charges
Min. : 1122
1st Qu.: 4740
Median : 9382
Mean :13270
3rd Qu.:16640
Max. :63770
children
Min. :0.000
1st Qu.:0.000
Median :1.000
Mean :1.095
3rd Qu.:2.000
Max. :5.000
smoker
0:1064
1: 274
The dataset contains 1338 observations of 7 variables. The variable charges is the one we have to
predict using the following predictors: age, sex, bmi, children, smoker and region. The variable
age and bmi are continuous variables, the variables sex, smoker and region are categorical
5
variables. There are no missing variables in the dataset. The five number summary is also
displayed above.
Exploratory data Analysis
# Distribution of age in the dataset
boxplot(data$age, main = "Histogram of age")
Age is distributed normally as depicted by the boxplot. The lowest age is 18, with the highest
being 64.
# Distribution of bmi in the dataset
par(mfrow = c(1,2))
hist(data$bmi)
boxplot(data$bmi)
6
Bmi is approximately distributed normally. Majority of the bmi is between 20 to 40.
# Distribution of children in the dataset
hist(data$children,breaks=20)
7
The histogram for children against frequency is right skewed, showing that majority of the
people have no children.
# Distribution of charges in the dataset
par(mfrow = c(1,2))
hist(data$charges)
boxplot(data$charges)
8
Charges are right skewed in the data with many outliers. The outliers are values above a charge
of 30,000 depicted by the boxplot.
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.0.5
ggplot(data, aes(x=smoker, fill=sex)) +
geom_bar(position = "fill" )+
labs(y="Propation")
9
There is a slightly higher proportion of females who do not smoke than males who do not smoke.
There is a higher proportion of male smokers than female smokers. The rates between males and
females are approximately the same.
ggplot(data, aes(x=region, fill=smoker)) +
geom_bar(position = "fill") +
labs(y="Proportion")
10
There is a larger proportion of non-smokers than smokers in all the four regions.
ggplot(data, aes(smoker, charges)) +
geom_boxplot()
11
The median for non-smokers is lower than that of smokers. The non-smokers also have lower
insurance charges as compared to smokers. The smokers have above 30,000 while the nonsmokers have a median of less than 10,000.
ggplot(data, aes(smoker, bmi)) +
geom_boxplot()
The median for both smokers and non-smokers are approximately equal.
ggplot(data, aes(x=children, y=charges, color=smoker)) +
geom_jitter()
12
Majority of the people have children 2 or fewer children. Smokers have higher charges than nonsmokers. On average, people with more children pay higher charges as opposed to people with
no children.
ggplot(data, aes(x=age, y=charges)) +
geom_point() +
geom_smooth(se = F)
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
13
ggplot(data, aes(x=bmi, y=charges)) +
geom_point() +
geom_smooth(se = F)
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
14
data_corr <- pairs(data)
In the first plot, we see that there is a trend that with older age the charges increase. There are
also three groups/lines visible. In the second plot we see some sort of trend that with increasing
bmi the charges increase, however this is not very clear.
15
data_corr
## NULL
Regression Analysis
Splitting the data into a train set and test set
library(caTools)
## Warning: package 'caTools' was built under R version 4.0.5
set.seed(200)
sample <- sample.split(data$charges,
16
SplitRatio = 0.75)
train_data <- subset(data, sample == T)
test_data <- subset(data, sample == F)
dim(train_data)
## [1] 1003
7
dim(test_data)
## [1] 335 7
MODEL BUILDING
model_1 <- lm("charges ~ age + sex + bmi + children + region + smoker", data = data)
summary(model_1)
##
## Call:
## lm(formula = "charges ~ age + sex + bmi + children + region + smoker",
##
data = data)
##
## Residuals:
##
Min
1Q Median
3Q
Max
## -11304.9 -2848.1 -
##
## Coefficients:
##
Estimate Std. Error t value Pr(>|t|)
17
## (Intercept)
-11938.5
## age
256.9
## sex
-131.3
## bmi
- < 2e-16 ***
332.9 -
339.2
## children
987.8 -12.086 < 2e-16 ***
475.5
- < 2e-16 ***- ***
## regionnorthwest -353.0
476.3 -
## regionsoutheast -1035.0
478.7 - *
## regionsouthwest -960.0
477.9 - *
## smoker1
23848.5
- < 2e-16 ***
## --## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6062 on 1329 degrees of freedom
## Multiple R-squared: 0.7509, Adjusted R-squared: 0.7494
## F-statistic: 500.8 on 8 and 1329 DF, p-value: < 2.2e-16
r_sq_1 <- summary(model_1)$r.squared
r_sq_1 #-
## [1]-
predict_1 <- predict(model_1, newdata = test_data)
residuals_1 <- test_data$charges - predict_1
rmse_1 <- sqrt(mean(residuals_1^2))
rmse_1 #-
18
## [1]-
model_2 <- lm("charges ~ age + bmi + children + region + smoker", data = data)
summary(model_2)
##
## Call:
## lm(formula = "charges ~ age + bmi + children + region + smoker",
##
data = data)
##
## Residuals:
##
Min
1Q Median
3Q
Max
## -11367.2 -2835.4 -
##
## Coefficients:
##
Estimate Std. Error t value Pr(>|t|)
## (Intercept)
-
## age
256.97
- < 2e-16 ***
## bmi
338.66
- < 2e-16 ***
## children
978.76 -12.250 < 2e-16 ***
474.57
- ***
## regionnorthwest -352.18
476.12 -
## regionsoutheast -1034.36
478.54 - *
## regionsouthwest -959.37
477.78 - *
## smoker1
-
- < 2e-16 ***
19
## --## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6060 on 1330 degrees of freedom
## Multiple R-squared: 0.7509, Adjusted R-squared: 0.7496
## F-statistic: 572.7 on 7 and 1330 DF, p-value: < 2.2e-16
r_sq_2 <- summary(model_2)$r.squared
r_sq_2 #-
## [1]-
predict_2 <- predict(model_2, newdata = test_data)
residuals_2 <- test_data$charges - predict_2
rmse_2 <- sqrt(mean(residuals_2^2))
rmse_2 #-
## [1]-
library(car)
## Warning: package 'car' was built under R version 4.0.5
## Loading required package: carData
## Warning: package 'carData' was built under R version 4.0.3
# vif
vif(model_2)
20
##
GVIF Df GVIF^(1/(2*Df))
## age
-
-
## bmi
-
-
## children-
-
## region-
-
## smoker-
-
# tolerance
1/vif(model_2)
##
GVIF
Df GVIF^(1/(2*Df))
## age
-
-
## bmi
-
-
## children-
-
## region-
-
## smoker-
mean(vif(model_2))
## [1]-
par(mfrow = c(2,2))
plot(model_2)
-
21
Some variables from the model are not significant (sex), while others are significant (age, bmi,
smoker, children and region). Training the model will happen without non-significant variables
to find out if the model gets improved. After the training, the performance of the two models is
similar. Model_1 has an R square of- with a root mean square error of-.
Model_2 has an R square of- with a root mean square error of-.
Model_2 will get used since it is simpler than model_1.
To check the assumptions for regression, muticollinearity gets checked using the vif function, for
which the mean vif for model_2 is-. The smallest possible value of vif is 1; hence, there
is minimal correlation amongst the independent variables leading to a small amount of inflation.