House Price Analysis
1
House Prices Analysis
2
House Prices Analysis
Introduction
When people are looking for houses, there will be different factors that will determine
the price of a house. Some of the factors might be the location, the business, security, social
amenities within an area and type of transportation. These factors are crucial in determining
the value of a house and are usually shared amongst different locations worldwide. In
analyzing the house prices in Toronto, some of the factors are similar. Therefore, this report
will analyse and attempt to find a relationship of how number of licensed businesses, number
of child care spaces, risk score for missing three consecutive loan payments, local
employment and number of social assistant recipients affects the prices of houses in Toronto.
During a real estate transaction, the value of a house is very important. Therefore, it is
important to predict the house prices without any bias involved that will assist the buyers and
sellers of the houses to make proper decisions (Wu, 2017). This analysis will use multiple
regression since it is a reliable method of finding out which variables (independent variables)
have an effect on the topic of interest (dependent variable) (Sarstedt & Mooi, 2014).
Some of the previous literature on regression analysis has determined that some of the main
benefits of using regression analysis include:
i)
ii)
iii)
Show that the independent variable has a significant relationship with the
dependent variables.
It can show the relative strength of the independent variables’ effect on the
dependent variable.
It can also assist to make predictions (Skiera et al., 2018).
Methods
The data was collected from a secondary source from www.toronto.ca website for the
economics data for the year 2011. The data was first analysed to check for the structure,
summary and any trends and patterns that it might have and the missing variables. The data
had eight variables, with the Debt Risk Score being the dependent variable and the licensed
businesses, number of child care spaces, risk score for missing three consecutive loan
payments, local employment and number of social assistant recipients being the independent
variables.
The analysis utilized the multiple regression model since the results can get used for
prediction and description based on the value of another variable. When using the regression
model, it gets used to estimate the coefficients of the linear equation that will involve the
independent variables that will best predict the values of the dependent variables. The linear
regression model then fits a straight line that will reduce the discrepancies between the
predicted with the actual output values.
The multiple linear regression model used for the analysis was:
Y= b0 + b1x1 + b2 x2 + …+ bnxn + e
Where;
Y- dependent variable
b0 – the y-intercept
bi’s (i=1,2,…,n) – regression coefficients
3
xi’s (i=1,2,…,n) – independent variables
e- error term
The data will get split into the test and training set. The training set will be larger,
taking approximately 75% of the data while the test set will take approximately 25% of the
data. This process will assist to minimize the discrepancies in the data and provide a better
understanding of the characteristics of the model. Using the multiple linear regression model,
the dependent and independent variables get placed and analysed. After the first round of
analysis, the insignificant independent variables are gotten and then a second model gets
created that does not include the insignificant variables in an attempt to find the best model.
The data analysis gets conducted through R programming language due its ability to
provide good quality of data visualization.
Results
Explanatory Data Analysis
The data frame was checked for missing variables for which it did not have any
missing variables; all the 140 columns had all their data. The boxplots of the variables were
then created with their outliers. The licensed Businesses variable had the most outliers, 12,
while the Debt Risk Score had the least amount of outliers with one outlier. The scatter plot
and ggplot got created to try and show the visual representation of the relationships between
the independent variables against the dependent variable. Afterwards, the correlation
coefficients were calculated with the results showing that the Debt Risk Score had the highest
positive correlation coefficient of- while Social Assistance Receipts had the largest
negative correlation coefficient of -.
Multiple Linear Regression
The data got split into the training set, at 75% and the test set at 25%. The first model,
model1, got built with all the independent variables. The results were as follows:
Intercept
Businesses
Child
Debt
Local
Social
Estimate
Std. Error
T value
-2.640e+06
7.258e+05
-e+01
6.139e+01
0.169
-1.446e+02
2.490e+02
-e+03
9.671e+02
4.605
-1.330e+00
2.005e+00
-0.663
-4.194e+01
1.825e+01
-2.298
Results from model1 of the regression
P value- ***-e-06 ***- *
The multiple R squared was 0.4358 while the adjusted R squared was 0.4147.
4
For the second model, model2, the results were as follows:
Intercept
Debt
Social
Estimate
Std. Error
T value
-2.377e+06
6.435e+05
-e+03
8.416e+02
4.847
-4.835e+01
1.601e+01
-3.020
Results from model2 of the regression
P value- ***
3.35e-06 ***- **
The multiple R squared is 0.4295 while the adjusted R squared is 0.4211.
The ANOVA model shows that model2 is significant for both the values of Debt and Social
2e-16 *** and 0.00302 ** respectively.
Checking Linear Regression Assumptions
Assumption of Independence
The Durbin-Watson Test statistics value is-. There is a positive correlation since the
dw-t statistic is less than 2. The optimal range is 1.5-2.5.
Assumption of Linearity
For model2:
The results of the vif are- and- for Debt and Social respectively. The mean
vif is-.
The vif is less than 10 hence the variables are moderately correlated. The tolerance is more
than 0.2 hence no problems. The mean vif is- therefore there is moderate correlation
hence minimal bias.
Assumption of Homoscedasticity
Residuals vs Fitted plot
5
The points are randomly distributed between the fitted values and the observed values. The
observed residuals with the data points are randomly placed around the line, hence there
seems to be a similar variance.
Assumption of Normality
The points on the qqplot are mostly along the dotted line. Therefore, this data was collected
from a normally distributed sample since it appears the points are on a roughly straight line
even though the ends start to deviate from the straight line.
Discussion
The main reason for this research was to find the best model to describe and predicts
the most significant factors that affect the housing prices within the neighbourhoods of
Toronto. In model1, there were insignificant variables like Licensed Businesses, Child Care
Spaces and Local Employment. Model2 only has Debt Risk Score and Social Assistance
Receipts as the independent variables. Their selection was due to their p values being less
than the critical value of 0.05 at a 95% level of significance. The adjusted R squared of
model2 is 0.4211 which means model2 explains 42.11% of the observed data whereas model
1 has a R squared of 0.4147 meaning the model explains 41.47% of the observed data.
Therefore, model2 is the best model to describe and predict the house prices in Toronto.
Model2 is important since it uses the most significant variables that affect the house prices.
The main limitation of model2 is that the many outliers in the variables lead to
erroneous values being obtained. The outliers can change the model and make it less
accurate, hence leading to a bad prediction of the independent variable. The outliers have the
ability to deceive the training model and lead to a less accurate model hence a bad
performance on the test set.
6
References
Sarstedt, M., & Mooi, E. (2014). Regression Analysis. Springer Texts in Business and
Economics, 193–233. https://doi.org/10.1007/-_7
Skiera, B., Reiner, J., & Albers, S. (2018). Regression Analysis. Handbook of Market
Research, 1–29. https://doi.org/10.1007/-_17-1
Wu, J. Y. (2017). Housing Price prediction Using Support Vector Regression. SJSU
ScholarWorks. https://doi.org/-/etd.vpub-6bgs
7
Appendix
Appendix 1: R Codes
8
9