Tony Nyumba Apindi

House Price Analysis

1 House Prices Analysis 2 House Prices Analysis Introduction When people are looking for houses, there will be different factors that will determine the price of a house. Some of the factors might be the location, the business, security, social amenities within an area and type of transportation. These factors are crucial in determining the value of a house and are usually shared amongst different locations worldwide. In analyzing the house prices in Toronto, some of the factors are similar. Therefore, this report will analyse and attempt to find a relationship of how number of licensed businesses, number of child care spaces, risk score for missing three consecutive loan payments, local employment and number of social assistant recipients affects the prices of houses in Toronto. During a real estate transaction, the value of a house is very important. Therefore, it is important to predict the house prices without any bias involved that will assist the buyers and sellers of the houses to make proper decisions (Wu, 2017). This analysis will use multiple regression since it is a reliable method of finding out which variables (independent variables) have an effect on the topic of interest (dependent variable) (Sarstedt & Mooi, 2014). Some of the previous literature on regression analysis has determined that some of the main benefits of using regression analysis include: i) ii) iii) Show that the independent variable has a significant relationship with the dependent variables. It can show the relative strength of the independent variables’ effect on the dependent variable. It can also assist to make predictions (Skiera et al., 2018). Methods The data was collected from a secondary source from www.toronto.ca website for the economics data for the year 2011. The data was first analysed to check for the structure, summary and any trends and patterns that it might have and the missing variables. The data had eight variables, with the Debt Risk Score being the dependent variable and the licensed businesses, number of child care spaces, risk score for missing three consecutive loan payments, local employment and number of social assistant recipients being the independent variables. The analysis utilized the multiple regression model since the results can get used for prediction and description based on the value of another variable. When using the regression model, it gets used to estimate the coefficients of the linear equation that will involve the independent variables that will best predict the values of the dependent variables. The linear regression model then fits a straight line that will reduce the discrepancies between the predicted with the actual output values. The multiple linear regression model used for the analysis was: Y= b0 + b1x1 + b2 x2 + …+ bnxn + e Where; Y- dependent variable b0 – the y-intercept bi’s (i=1,2,…,n) – regression coefficients 3 xi’s (i=1,2,…,n) – independent variables e- error term The data will get split into the test and training set. The training set will be larger, taking approximately 75% of the data while the test set will take approximately 25% of the data. This process will assist to minimize the discrepancies in the data and provide a better understanding of the characteristics of the model. Using the multiple linear regression model, the dependent and independent variables get placed and analysed. After the first round of analysis, the insignificant independent variables are gotten and then a second model gets created that does not include the insignificant variables in an attempt to find the best model. The data analysis gets conducted through R programming language due its ability to provide good quality of data visualization. Results Explanatory Data Analysis The data frame was checked for missing variables for which it did not have any missing variables; all the 140 columns had all their data. The boxplots of the variables were then created with their outliers. The licensed Businesses variable had the most outliers, 12, while the Debt Risk Score had the least amount of outliers with one outlier. The scatter plot and ggplot got created to try and show the visual representation of the relationships between the independent variables against the dependent variable. Afterwards, the correlation coefficients were calculated with the results showing that the Debt Risk Score had the highest positive correlation coefficient of- while Social Assistance Receipts had the largest negative correlation coefficient of -. Multiple Linear Regression The data got split into the training set, at 75% and the test set at 25%. The first model, model1, got built with all the independent variables. The results were as follows: Intercept Businesses Child Debt Local Social Estimate Std. Error T value -2.640e+06 7.258e+05 -e+01 6.139e+01 0.169 -1.446e+02 2.490e+02 -e+03 9.671e+02 4.605 -1.330e+00 2.005e+00 -0.663 -4.194e+01 1.825e+01 -2.298 Results from model1 of the regression P value- ***-e-06 ***- * The multiple R squared was 0.4358 while the adjusted R squared was 0.4147. 4 For the second model, model2, the results were as follows: Intercept Debt Social Estimate Std. Error T value -2.377e+06 6.435e+05 -e+03 8.416e+02 4.847 -4.835e+01 1.601e+01 -3.020 Results from model2 of the regression P value- *** 3.35e-06 ***- ** The multiple R squared is 0.4295 while the adjusted R squared is 0.4211. The ANOVA model shows that model2 is significant for both the values of Debt and Social 2e-16 *** and 0.00302 ** respectively. Checking Linear Regression Assumptions Assumption of Independence The Durbin-Watson Test statistics value is-. There is a positive correlation since the dw-t statistic is less than 2. The optimal range is 1.5-2.5. Assumption of Linearity For model2: The results of the vif are- and- for Debt and Social respectively. The mean vif is-. The vif is less than 10 hence the variables are moderately correlated. The tolerance is more than 0.2 hence no problems. The mean vif is- therefore there is moderate correlation hence minimal bias. Assumption of Homoscedasticity Residuals vs Fitted plot 5 The points are randomly distributed between the fitted values and the observed values. The observed residuals with the data points are randomly placed around the line, hence there seems to be a similar variance. Assumption of Normality The points on the qqplot are mostly along the dotted line. Therefore, this data was collected from a normally distributed sample since it appears the points are on a roughly straight line even though the ends start to deviate from the straight line. Discussion The main reason for this research was to find the best model to describe and predicts the most significant factors that affect the housing prices within the neighbourhoods of Toronto. In model1, there were insignificant variables like Licensed Businesses, Child Care Spaces and Local Employment. Model2 only has Debt Risk Score and Social Assistance Receipts as the independent variables. Their selection was due to their p values being less than the critical value of 0.05 at a 95% level of significance. The adjusted R squared of model2 is 0.4211 which means model2 explains 42.11% of the observed data whereas model 1 has a R squared of 0.4147 meaning the model explains 41.47% of the observed data. Therefore, model2 is the best model to describe and predict the house prices in Toronto. Model2 is important since it uses the most significant variables that affect the house prices. The main limitation of model2 is that the many outliers in the variables lead to erroneous values being obtained. The outliers can change the model and make it less accurate, hence leading to a bad prediction of the independent variable. The outliers have the ability to deceive the training model and lead to a less accurate model hence a bad performance on the test set. 6 References Sarstedt, M., & Mooi, E. (2014). Regression Analysis. Springer Texts in Business and Economics, 193–233. https://doi.org/10.1007/-_7 Skiera, B., Reiner, J., & Albers, S. (2018). Regression Analysis. Handbook of Market Research, 1–29. https://doi.org/10.1007/-_17-1 Wu, J. Y. (2017). Housing Price prediction Using Support Vector Regression. SJSU ScholarWorks. https://doi.org/-/etd.vpub-6bgs 7 Appendix Appendix 1: R Codes 8 9

Scheduled maintenance