Rizki Mayandi Hasibuan

Paper 4

In this research, we want to investigate the two methods of multivariate analysis. We have used two different data sets to export the best results by them. This analysis is done in two different parts by details. Multiple regression model and confirmatory factor analysis are the two methods. Multiple Regression model As the name suggests, multiple regression is a statistical technique applied on datasets dedicated to draw out a relationship between one response or dependent variable and multiple independent variables. Form the definition it is obvious that the study of an event or phenomena will have various factors causing its occurrence. Multiple regression works by considering the values of the available multiple independent variables and predicting the value of one dependent variable. In case of linear regression, although it is used commonly, it is limited to just one independent and one dependent variable. Apart from that, linear regression restricts to the training dataset and does not predict a non-linear regression. For the same limitations and to cover them, we use multiple regression. It focuses on overcoming one particular limitation and that is allowing to analyse more than one independent variable. Real estate valuation is a process of using three methods (sales comparison, cost, and income approaches) to determine the current value of a potential real estate investment. This value helps compare investment opportunities with each other. The value of a property may or may not be different from its price. An accurate valuation or appraisal of a real estate property is crucial for making wise investment decisions. Some real estate investment companies have entire teams whose main purpose is to determine the value of new real estate opportunities and the company’s current real estate assets. The market historical data set of real estate valuation are collected from Sindian Dist., New Taipei City, Taiwan. This contains six independent variables and one dependent variable. The transaction date, the house age, the distance to the nearest MRT station, the number of convenience stores in the living circle on foot, the geographic coordinate (latitude), and the geographic coordinate (longitude) are the independent variables. The response variable is the house price of the unit area. We want to predict the house price based on mentioned factors. There is no missing data in the data set. Therefore, we have all the related values. The presence of outliers in the data set was checked. There were some outliers in some variables. But this wasn’t a lot and didn’t affect the whole data set. The box plot of the response variable is shown as follows: Now, we want to consider a multiple regression model by the dependent variable “house price”. Before, we check the linear assumptions. In the first picture, we can see that the red line approximately is horizontal, which suggests that the residual plot does not show any pattern. Therefore, the linearity assumption is checked. In the picture to the right, we can see the Q-Q plot of residuals. The normal probability plot of residual approximately follows the line, therefore confirming normal distribution. The third picture shows a relatively horizontal red line, which is a good indication of homoscedasticity. It seems there are some outliers in the dataset based on the plots. But because the sample size is large, we can ignore this and continue to estimate the regression model. There are five hypotheses in the model as follows: H1: The house age has a significant effect on house prices. H2: The distance to the nearest MRT station has a significant effect on house prices. H3: The number of convenience stores has a significant effect on house prices. H4: latitude has a significant effect on house prices. H4: longitude has a significant effect on house prices. The results of the model are shown in the table: Coefficients 1 B Std. Error (Constant) - - x2 -.269 .039 x3 -.004 x4 Beta -.796 .426 -.225 -6.896 .000 .001 -.395 -5.888 .000 1.163 .190 .252 6.114 .000 x5 237.767 44.948 .217 5.290 .000 x6 -7.805 49.149 -.009 -.159 .874 As is seen, the results of the model show that our prediction was approximately true. The house age is very effective in determining the house price. Because the p-value for this variable is less than 0.05, we can say the relationship is significant. It means the first hypothesis is accepted. The p-value for “distance to the nearest MRT station” is less than 0.05. It means that distance has a negative effect on house prices. Therefore, the second hypothesis is accepted. The p-value for “number of convenience stores” is less than 0.05. It means that the number of convenience stores has a positive effect on house prices. Therefore, the third hypothesis is accepted. The p-value for “latitude” is less than 0.05. It means that latitude has a positive effect on house prices. Therefore, the fourth hypothesis is accepted. The story of the final hypothesis is different. Longitude was not an effective factor in house prices. Because the p-value is more than 0.05, we can say that this factor cannot affect house prices. Therefore, the fifth hypothesis is rejected. R-Squared (R² or the coefficient of determination) is a statistical measure in a regression model that determines the proportion of variance in the dependent variable that can be explained by the independent variable. In other words, r-squared shows how well the data fit the regression model (the goodness of fit). R-squared can take any values between 0 to 1. Although the statistical measure provides some useful insights regarding the regression model, the user should not rely only on the measure in the assessment of a statistical model. The figure does not disclose information about the causation relationship between the independent and dependent variables. In addition, it does not indicate the correctness of the regression model. Therefore, the user should always draw conclusions about the model by analyzing rsquared together with the other variables in a statistical model. The R-squared for the model is equal to 0.57. It means that the independent variables could have explained 57 % of the changes in the dependent variable. EXPLORATORY FACTOR ANALYSIS Exploratory data analysis (EDA) is used by data scientists to analyze and investigate data sets and summarize their main characteristics, often employing data visualization methods. It helps determine how best to manipulate data sources to get the answers you need, making it easier for data scientists to discover patterns, spot anomalies, test a hypothesis, or check assumptions. This data set is related to knowing the place and role of cultural industries in society. We want to decrease the dimension of the data and drag the latent variables. There is no missing value in the data set. Because it is a questionnaire, it has definitely an outlier. But we can ignore them in exploratory factor analysis. Before doing the analysis, we check the validity of the data set by the two test “KMO” and “Bartlett sphericity”. Because the value of KMO is more than 0.7 (0.728) and the p-value of the Bartlett is less than 0.05, the sampling adequacy is appropriate. The exploratory factor is done and we found seven factors. The p-value for the sufficiency is more than 0.05 (0.805). The seven factors are shown here: Factor S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S11 S12 S13 S14 S15 S16 S17 1 - 3 4 5 6 7 - S18 S19 S20 S21 S22 S23 S24 S25 S26 S27 S28 S29 S30 S31 S32 - All the factors with eigenvalue of more than 1 are considered as latent variables. There are seven latent variables. The scree plot shows the separation of the seven variables: We can call the seven factors as follows: 1. Cultural development 2. Political relationship 3. Protection of the identity of national and religious culture 4. Economical development 5. National production boom 6. Wealth creation 7. Employment development After all, we found the appropriate factors of the data set by exploratory factor analysis. A Principal Component Analysis could also have been done to decrease the dimension of data, however, with this dataset, I was trying to research hidden factors that have already been researched in this field. Confirmatory factor analysis is the second stage to complete it. We can extend our findings to build a structural model equation for examining the relationships between these seven factors. References 1. Fabrigar, L. R., & Wegener, D. T. (2011). Exploratory factor analysis. Oxford University Press. 2. Gorsuch, R. L. (1988). Exploratory factor analysis. In Handbook of multivariate experimental psychology (pp. 231-258). Springer, Boston, MA. 3. Yong, A. G., & Pearce, S. (2013). A beginner’s guide to factor analysis: Focusing on exploratory factor analysis. Tutorials in quantitative methods for psychology, 9(2), 79-94. 4. Ferguson, E., & Cox, T. (1993). Exploratory factor analysis: A users’ guide. International journal of selection and assessment, 1(2), 84-94.

Scheduled maintenance