"R" Linear Regression Data Visualization (converted to Word)
STA-3024-01Z Final Project
Kaleena Ann LaBerge-
Get the packages and access library data from initial code chunk above. Take the install.packages lines out of #comment mode by removing the “#”s. Click the green play button at the top right of the code chunk to install the packages and access the library data. Place the “#”s back in front of the install packages lines (so they won’t re-install when any following code chunks are played).
Introduction
Scope: The purpose of this report is to show significant data learned in this course, focusing on three primary points:
· Simple Linear Regression
· Multi-linear Regression
· ANOVA
Getting Started:
In R Markdown (RStudio), library(faraway) will be used as the primary data source.
dvisits {faraway}
Doctor visits in Australia
The data comes from the Australian Health Survey of 1977-78 and consists of 5190 single adults where young and old have been oversampled (Cameron, Trivedi, Milne, & Piggot, 1988).
A data frame with 5190 observations on 19 different variables.
Simple Linear Regression
Simple Linear Regression is a basic statistical technique that allows us to model the linear relationship between a single predictor variable (x) and continuous response variable (y) (Çetinkaya-Rundel & Hardin, 2023).
Using “head(dvisits)” in R Markdown gives the first 6 rows of relevant data.
1. Let’s examine the first 6 and last six values of the data.
‘tail(dvisits)’ gives us the last 6 values of data.
head(dvisits)
## sex age agesq income levyplus freepoor freerepa illness actdays hscore
## 1 - 0.55 1 0 0 1 4 1
## 2 - 0.45 1 0 0 1 2 1
## 3 - 0.90 0 0 0 3 0 0
## 4 - 0.15 0 0 0 1 0 0
## 5 - 0.45 0 0 0 2 5 1
## 6 - 0.35 0 0 0 5 1 9
## chcond1 chcond2 doctorco nondocco hospadmi hospdays medicine prescrib
## 1 0 0 1 0 0 0 1 1
## 2 0 0 1 0 0 0 2 1
## 3 0 0 1 0 1 4 2 1
## 4 0 0 1 0 0 0 0 0
## 5 1 0 1 0 0 0 3 1
## 6 1 0 1 0 0 0 1 1
## nonpresc
## 1 0
## 2 1
## 3 1
## 4 0
## 5 2
## 6 0
tail(dvisits)
## sex age agesq income levyplus freepoor freerepa illness actdays hscore
## 5185 - 1.50 0 0 0 0 0 0
## 5186 - 0.55 0 0 0 0 0 0
## 5187 - 1.30 0 0 0 0 0 1
## 5188 - 0.25 0 0 1 1 0 1
## 5189 - 0.65 0 0 0 0 0 0
## 5190 - 0.25 0 0 1 0 0 0
## chcond1 chcond2 doctorco nondocco hospadmi hospdays medicine prescrib
## 5185 0 0 0 0 0 0 0 0
## 5186 0 0 0 0 0 0 0 0
## 5187 0 0 0 0 0 0 3 0
## 5188 0 0 0 0 0 0 0 0
## 5189 0 0 0 0 0 0 0 0
## 5190 0 0 0 0 0 0 0 0
## nonpresc
## 5185 0
## 5186 0
## 5187 3
## 5188 0
## 5189 0
## 5190 0
The expression 'head(dvisits)' shows us the initial values of all this different variable data, for the first six rows.
Let’s look at an issue that impacts many areas in the world: Income versus Doctor Visits. Isn’t it true that with less income there, will be less doctor visits; And with more income, there will be more doctor visits?
2. Let’s perform Simple Linear Regression, to show income and doctor visits in Melbourne, Australia. Income is going to be the predictor.
lmod <- lm(doctorco ~ income, data = dvisits)
summary(lmod)
##
## Call:
## lm(formula = doctorco ~ income, data = dvisits)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.3987 -0.3405 -0.2906 -0.1826 8.6429
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.39868 0.02067 19.292 < 2e-16 ***
## income -0.16624 0.02995 -e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7959 on 5188 degrees of freedom
## Multiple R-squared: -, Adjusted R-squared: -
## F-statistic: 30.81 on 1 and 5188 DF, p-value: 2.981e-08
Running the above code chunk gives us pertinent values. Let’s take a look at what this means:
lmod <- lm(doctorco ~ income, data = dvisits)
This basically says, this linear model is based on ‘doctorco’ (Search ‘dvisits’ to the right; then, if you look at the Help tab to the lower right of this IDE, you can see that this variable represents the number of consultations with a doctor within the past two weeks, so it’s just some recorded values from 1977-78 in a two-week span); and ‘income’ represents a recalculated AUS dollar amount.
data = dvisits is just locating our chosen variables (doctorco and income) from the dataset ‘dvisits’.
summary(lmod)
The instructs R to show a summary of the linear model data generated by lm().
The great thing about using R is that we can take this information and make it more understandable with visual aid. This is how to visualize the regression line on a scatterplot:
3. Run this code chunk to show the relevant ScatterPlot. This is not in the desired linear format…
plot(dvisits$income, dvisits$doctorco,
xlab = "Income", ylab = "Doctor Visits", main = "Income vs. Doctor Visits")
abline(lmod, col = "blue")
Let’s make sense of this:
plot(dvisits$income, dvisits$doctorco, xlab = "Income", ylab = "Doctor Visits", main = "Income vs. Doctor Visits")
Obviously, we are plotting, dvisits$income refers to the income, dvisits$doctorcorefers to the Doctor Visits. ‘xlab’ and ‘ylab’ are just like your standard x and y axis, so you can tell the model which title goes with which data (horizontal or vertical).
abline(lmod, col = "blue")actually adds the regression line to the plot. lmod is our linear model, which was previously created using that lm() function.
col = "blue"makes the regression line, in this case, blue.
4. The purpose of this exersize is to see a beautiful example of a Simple Linear Regression model. In this code chunk, the income is expressed from a scale of-, with 50 data points. A vector of 50 random values is drawn from a normal distribution (with a mean of 0 and a Standard Deviation of 200), which then gives this visually nice representation of Simple Linear Regression.
set.seed(123)
income <- seq(from = 1000, to = 5000, by = 50)
doctorco <- 2.5 * income + rnorm(length(income), mean = 0, sd = 200)
lmod <- lm(doctorco ~ income)
plot(income, doctorco, xlab = "Income", ylab = "Doctor Visits", main = "Income Vs. Doctor Visits in Melbourne, AUS 1977-78", pch = 20, col = "purple")
abline(lmod, col = "blue")
Lets examine what is done in the code block above to produce this output.
set.seed(123)
This line gives a ‘seed’ input to the random number generator. With this fixed value, every time you run the code, you will get the same output that was randomly generated the first time you run the code.
income <- seq(from = 1000, to = 5000, by = 50)
This line was generated to have the income showing from-, with 50 points to plot about the regression line.
doctorco <- 2.5 * income + rnorm(length(income), mean = 0, sd = 200)
This represents a simple linear relationship between doctor and income (the 2.5 figure increment times income value essentially keeps that desired line slope, when choosing numbers).
‘rnorm’ generates random values and variability in the ’doctorco ~ income’ relationship.
’length(income)’determines the number of random values to be generated, making sure that it matches length of income (this assists in generating this visual aid!).
mean = 0, sd = 200 : When the mean is set to zero, there is no bias, and it is a common assumption that one can make when generating statistical models. When the Standard Deviation is set to 200 here, it gives a good range of freedom in relation to the other chosen numbers, which I think gives a fair view of a good Simple Linear relationship between variables.
lmod <-lm(doctorco ~ income)
This is the most important line for linear regression analysis. In the most general terms, it creates a linear model from our data, and assigns the results of that model to the variable called lmod. The basic layout of a linear model is:
lm(y ~ x, data=specific data frame)
The y and x are vectors of the the same length. Usually they can be viewed as two columns of values in a rectangular array like a spreadsheet.
The y variable is usually called the response variable, and the x variable is many times called the predictor variable.
In this case that other function is the plot(x,y) command given by
plot(income, doctorco, xlab = "Income", ylab = "Doctor Visits", main = "Income Vs. Doctor Visits in Melbourne, AUS 1977-78", pch = 20, col = "purple")
Which plots income against doctor visits, labels the x-axis, y-axis and provides a title to the plot.
pch = 20 represents that the plotted points are filled-in circles. col = "purple" makes those colored circles purple, in this case.
Finally, we have
abline(lmod,col='blue')
This draws the regression line stored in the variable lmod which contains the intercept a=b0𝑎=𝑏0 and the slope b=b1𝑏=𝑏1 of the line y=b0+b1x𝑦=𝑏0+𝑏1𝑥. The argument col='blue' makes the line blue.
------------------------
Multiple Regression / Multi-Linear Regression
Multi-Linear Regression is a basic statistical technique that allows us to model the linear relationship between two or more predictor variables (x) and continuous response variable (y). Now that we have built on the ideas of one predictor, a multiple linear regression model can now be fitted to 2(+) predictor variables. Complicated relationships between predictor variables and response variable can now be uncovered (Pruim, 2010).
Multiple linear regression analysis allows us to test the linear correlation between more than one predictor variable. The predictor variables can be quantitative, binary (having just two values, often encoded as 0 and 1) and even categorical. It is easier to stick with quantitative predictor variables first, which is what we will do for this assignment.
The general linear equation for multiple linear regression, for which simple linear regression is a special case is,
Y=b0+b1X1+b2X2+…+bnXn+ϵ𝑌=𝑏0+𝑏1𝑋1+𝑏2𝑋2+…+𝑏𝑛𝑋𝑛+𝜖
Unlike simple linear regression with just one predictor variable, there is no easy way we can visualize the interactions with a simple scatter plot. A regression equation with two predictor variables requires a 3 dimensional plot, and a regression equation with 3 predictor requires a 4 dimensional plot, where there are 4 spatial dimensions. If we refer back to the Q1 code chunk, we see where multiple variables are named. Since we will only examine numerical data, specifically the numerical data contained in those columns, we will create a new data frame with only this information. First we will see which columns we need using names(dataframe) and then create a new data frame.
5. Run the code chunk to see the available data.
names(dvisits)
## [1] "sex" "age" "agesq" "income" "levyplus" "freepoor"
## [7] "freerepa" "illness" "actdays" "hscore" "chcond1" "chcond2"
## [13] "doctorco" "nondocco" "hospadmi" "hospdays" "medicine" "prescrib"
## [19] "nonpresc"
After running the code chunk, the output starts off by showing, “[1]”sex”, “age”, “agesq”, and so on… These are actually listed in consecutive order, row by row, so that is how we know which variable corresponds to which number.
Let’s list the Variables with Numerical Data: [2]“age”, [3]“agesq”, [4]“income, [8]”illness”, [9]“actdays”, [10]“hscore”, [13]“doctorco”, [14]“nondocco”, [15]“hospadmi”, [16]“hospdays”, [17]“medicine”, [18]“prescrib”, [19]“nonpresc”
Here are the Variables with Non-numerical Data: [1]“sex”, [5]“levyplus”, [6]“freepoor”, [7]“freerepa”, [11]“chcond1”, [12]“chcond2”
Hence dvisits[c(2,3,4,8,9,10,13,14,15,16,17,18,19)] will give us the age, agesq, income, illness, actdays, hscore, doctorco, nondocco, hospadmi, hospdays, medicine, prescrib, and nonpresc data.
Let’s call the new Dataset “newdvisits”.
6. Run the code chunk, which will show all of the variables’ columns with numerical values.
newdvisits <- dvisits[c(2,3,4,8,9,10,13,14,15,16,17,18,19)]
head(newdvisits)
## age agesq income illness actdays hscore doctorco nondocco hospadmi hospdays
##- 0.55 1 4 1 1 0 0 0
##- 0.45 1 2 1 1 0 0 0
##- 0.90 3 0 0 1 0 1 4
##- 0.15 1 0 0 1 0 0 0
##- 0.45 2 5 1 1 0 0 0
##- 0.35 5 1 9 1 0 0 0
## medicine prescrib nonpresc
## 1 1 1 0
## 2 2 1 1
## 3 2 1 1
## 4 0 0 0
## 5 3 1 2
## 6 1 1 0
7. Using our new Dataset, besides age and income, let’s take a look at ‘illness’, which represents those people that had 5 or more illnesses and/or injuries (coded as a 5).
plot(income ~ age, data = newdvisits, subset = (illness == 5 & income < 20000 & income > 0), main = "Scatterplot of People Having 5(+) Illnesses", ylab = "Income", xlab = "Age")
lmod <- lm(income ~ age, data = newdvisits, subset = (illness == 5 & income < 20000 & income > 0))
summary(lmod)
##
## Call:
## lm(formula = income ~ age, data = newdvisits, subset = (illness ==
## 5 & income < 20000 & income > 0))
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.60643 -0.14553 -0.08241 0.10201 1.16759
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.78684 0.05646 13.936 < 2e-16 ***
## age -0.63116 0.09765 -e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2857 on 233 degrees of freedom
## Multiple R-squared: 0.152, Adjusted R-squared: 0.1484
## F-statistic: 41.78 on 1 and 233 DF, p-value: 5.922e-10
The Scatterplot is showing Income against Age for individuals having 5 or more illnesses.
The Income scale shows 0.0 through 1.5 (this scale is meant to represent $0-$150K).
The Age scale shows values 0.2 through 0.7 (this scale is meant to represent ages 0 through 80).
Linear regression analysis has been done on this subset of the data. A summary of the regression model has been generated.
You should see Residuals:
Min is -0.60643
Max is 1.16759
Other data is also available, such as Coefficients, Std. Error, t-value, and p-value.
We can easily run a diagnostic to check condition by looking at a QQ plot of the residuals, via the R command qqnorm() and qqline() as well as the standard histogram plot to check if the histogram looks normally distributed. If a set of data is normally distributed, then the points should be more or less randomly distributed above and below the QQ line for theoretical quantiles (for a normal distribution) with few departures (perhaps caused by outliers) and no pattern.
8. Here, we refit the original data, plot the Histogram, and QQ Plot.
lmod <-lm(income ~ age, data=newdvisits)
hist(lmod$residuals,xlab='Residuals',main='Histogram of Residuals')
qqnorm(lmod$residuals)
qqline(lmod$residuals)
For this next example, let us designate the predictor variables as follows:
X1=age
X2=income
X3=illness
X4=actdays
X5=hscore
X6=doctorco
X7=medicine
To fit a linear model will all of these variables as predictors in R is really easy. We just add a + sign between predictor variables. In this case we would have
lm(income ~ age + income + actdays + hscore + doctorco + medicine, data=newdvisits)
As is customary, we will assign the output of our linear model to a variable, like lmod, so we can use the summary(lmod) function to see the main model results.
9. Run the linear model as indicated and view the model results using the appropriate summary() command such as summary(lmod).
lm(income ~ age + illness + actdays + hscore + doctorco + medicine, data=newdvisits)
##
## Call:
## lm(formula = income ~ age + illness + actdays + hscore + doctorco +
## medicine, data = newdvisits)
##
## Coefficients:
## (Intercept) age illness actdays hscore doctorco
## - - - - - -
## medicine
## -
summary(lmod)
##
## Call:
## lm(formula = income ~ age, data = newdvisits)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.68883 -0.22885 -0.09976 0.22582 1.06999
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.78161 0.01096 71.35 <2e-16 ***
## age -0.48833 0.02407 -20.28 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3551 on 5188 degrees of freedom
## Multiple R-squared: 0.07348, Adjusted R-squared: 0.0733
## F-statistic: 411.5 on 1 and 5188 DF, p-value: < 2.2e-16
-------------------------------------
ANOVA: Analysis of Variance
ANOVA assesses whether the predictors, as a group, significantly explain the variability in the response variable.
10. To perform ANOVA, you can use the anova() function:
lmod <- lm(income ~ age + illness + actdays + hscore + doctorco + medicine, data = newdvisits)
anova(lmod)
## Analysis of Variance Table
##
## Response: income
## Df Sum Sq Mean Sq F value Pr(>F)
## age 1 51.89 - < 2.2e-16 ***
## illness 1 6.41 6.409 51.4663 8.31e-13 ***
## actdays 1 0.00 0.005 -
## hscore 1 1.78 1.778 - ***
## doctorco 1 0.38 0.378 - .
## medicine 1 0.25 0.250 -
## Residuals- 0.125
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The output of anova(lmod) provides an Analysis of Variance Table, including SumSq, MeanSq, F-Value, and P-Value. This table helps determine if the overall model (combining all predictors) significantly explains the variance in the response variable. A small p-value (< 0.05) indicates that at least one predictor in the model has a significant effect on the response variable.
Conditions for an ANOVA analysis
There are three conditions we must check for an ANOVA analysis: all observations must be independent, the data in each group must be nearly normal, and the variance within each group must be approximately equal (Thulin, 2021).
• Independence. If the data are a simple random sample, this condition can be assumed to be satisfied. For processes and experiments, carefully consider whether the data may be independent (e.g., no pairing).
• Approximately Normal. As with one- and two-sample testing for means, the normality assumption is especially important when the sample size is quite small when it is ironically difficult to check for non-normality.
• Constant Variance. The last assumption is that the variance in the groups is about equal from one group to the next. This assumption can be checked by examining a side-by-side box plot of the outcomes
Diagnostics for an ANOVA Analysis
Independence is always important to an ANOVA analysis. The normality condition is very important when the sample sizes for each group are relatively small. The constant variance condition is especially important when the sample sizes differ between groups (Çetinkaya-Rundel & Hardin, 2023).
11. Let’s have income as the Predictor Variable. The Response Variable will be doctorco.
lmod <- lm(doctorco ~ income, data = dvisits)
anova(lmod)
## Analysis of Variance Table
##
## Response: doctorco
## Df Sum Sq Mean Sq F value Pr(>F)
## income 1 - -e-08 ***
## Residuals- 0.6334
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(lmod)
##
## Call:
## lm(formula = doctorco ~ income, data = dvisits)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.3987 -0.3405 -0.2906 -0.1826 8.6429
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.39868 0.02067 19.292 < 2e-16 ***
## income -0.16624 0.02995 -e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7959 on 5188 degrees of freedom
## Multiple R-squared: -, Adjusted R-squared: -
## F-statistic: 30.81 on 1 and 5188 DF, p-value: 2.981e-08
After performing an ANOVA and determining that there are significant differences among groups , post hoc tests are conducted to identify which specific groups differ significantly from each other.
12. In this instance, we are testing if the predictor variable income is significantly related to the response variable in the lmod.
lmod <- lm(doctorco ~ income, data = newdvisits)
anova_results <- anova(lmod)
print(anova_results)
## Analysis of Variance Table
##
## Response: doctorco
## Df Sum Sq Mean Sq F value Pr(>F)
## income 1 - -e-08 ***
## Residuals- 0.6334
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(anova_results)
## Df Sum Sq Mean Sq F value Pr(>F)
## Min. : 1 Min. : 19.52 Min. : 0.6334 Min. :30.81 Min. :0
## 1st Qu.:1298 1st Qu.: 836.13 1st Qu.: 5.3542 1st Qu.:30.81 1st Qu.:0
## Median :2594 Median :1652.74 Median :10.0751 Median :30.81 Median :0
## Mean :2594 Mean :1652.74 Mean :10.0751 Mean :30.81 Mean :0
## 3rd Qu.:3891 3rd Qu.:2469.36 3rd Qu.:14.7959 3rd Qu.:30.81 3rd Qu.:0
## Max. :5188 Max. :3285.97 Max. :19.5168 Max. :30.81 Max. :0
## NA's :1 NA's :1
----------------------------------
References
· Cameron A., Trivedi P., Milne F. and Piggot J. (1988). A Microeconometric model of the demand for health care and health insurance in Australia, Review of Economic Studies 55, 85-106.
· Çetinkaya-Rundel, M., Hardin, J. (2023). Introduction to Modern Statistics (1st Ed). OpenIntro Project. https://oercommons.org/courses/introduction-to-modern-statistics/viewLinks to an external site.
· Pruim, R. (2010). Foundations and Application of Statistics: An Introduction Using R. (2nd edition). American Mathematical Society.
· Slinker, K. (2023). Statistics II for Data Scientists. STA-3024-01Z. Various use of examples out of R Markdown homework; Fall semester 2023.
· Thulin, M. (2021). Modern Statistics with R: From wrangling and exploring data to inference and predictive modelling. (1rst ed). Creative Commons CC BY-NC-SA 4.0. license. https://www.modernstatisticswithr.com/index.htmlLinks to an external site.