Paper 4
In this research, we want to investigate the two methods of multivariate analysis. We
have used two different data sets to export the best results by them. This analysis is
done in two different parts by details. Multiple regression model and confirmatory
factor analysis are the two methods.
Multiple Regression model
As the name suggests, multiple regression is a statistical technique applied on
datasets dedicated to draw out a relationship between one response or dependent
variable and multiple independent variables. Form the definition it is obvious that
the study of an event or phenomena will have various factors causing its occurrence.
Multiple regression works by considering the values of the available multiple
independent variables and predicting the value of one dependent variable.
In case of linear regression, although it is used commonly, it is limited to just one
independent and one dependent variable. Apart from that, linear regression restricts
to the training dataset and does not predict a non-linear regression.
For the same limitations and to cover them, we use multiple regression. It focuses
on overcoming one particular limitation and that is allowing to analyse more than
one independent variable.
Real estate valuation is a process of using three methods (sales comparison, cost,
and income approaches) to determine the current value of a potential real estate
investment. This value helps compare investment opportunities with each other. The
value of a property may or may not be different from its price.
An accurate valuation or appraisal of a real estate property is crucial for making wise
investment decisions. Some real estate investment companies have entire teams
whose main purpose is to determine the value of new real estate opportunities and
the company’s current real estate assets.
The market historical data set of real estate valuation are collected from Sindian
Dist., New Taipei City, Taiwan. This contains six independent variables and one
dependent variable. The transaction date, the house age, the distance to the nearest
MRT station, the number of convenience stores in the living circle on foot, the
geographic coordinate (latitude), and the geographic coordinate (longitude) are the
independent variables. The response variable is the house price of the unit area. We
want to predict the house price based on mentioned factors. There is no missing data
in the data set. Therefore, we have all the related values. The presence of outliers in
the data set was checked. There were some outliers in some variables. But this wasn’t
a lot and didn’t affect the whole data set. The box plot of the response variable is
shown as follows:
Now, we want to consider a multiple regression model by the dependent variable
“house price”. Before, we check the linear assumptions.
In the first picture, we can see that the red line approximately is horizontal, which
suggests that the residual plot does not show any pattern. Therefore, the linearity
assumption is checked. In the picture to the right, we can see the Q-Q plot of
residuals. The normal probability plot of residual approximately follows the line,
therefore confirming normal distribution. The third picture shows a relatively
horizontal red line, which is a good indication of homoscedasticity. It seems there
are some outliers in the dataset based on the plots. But because the sample size is
large, we can ignore this and continue to estimate the regression model.
There are five hypotheses in the model as follows:
H1: The house age has a significant effect on house prices.
H2: The distance to the nearest MRT station has a significant effect on house prices.
H3: The number of convenience stores has a significant effect on house prices.
H4: latitude has a significant effect on house prices.
H4: longitude has a significant effect on house prices.
The results of the model are shown in the table:
Coefficients
1
B
Std. Error
(Constant)
-
-
x2
-.269
.039
x3
-.004
x4
Beta
-.796
.426
-.225
-6.896
.000
.001
-.395
-5.888
.000
1.163
.190
.252
6.114
.000
x5
237.767
44.948
.217
5.290
.000
x6
-7.805
49.149
-.009
-.159
.874
As is seen, the results of the model show that our prediction was approximately true.
The house age is very effective in determining the house price. Because the p-value
for this variable is less than 0.05, we can say the relationship is significant. It means
the first hypothesis is accepted. The p-value for “distance to the nearest MRT
station” is less than 0.05. It means that distance has a negative effect on house prices.
Therefore, the second hypothesis is accepted. The p-value for “number of
convenience stores” is less than 0.05. It means that the number of convenience stores
has a positive effect on house prices. Therefore, the third hypothesis is accepted. The
p-value for “latitude” is less than 0.05. It means that latitude has a positive effect on
house prices. Therefore, the fourth hypothesis is accepted. The story of the final
hypothesis is different. Longitude was not an effective factor in house prices.
Because the p-value is more than 0.05, we can say that this factor cannot affect house
prices. Therefore, the fifth hypothesis is rejected.
R-Squared (R² or the coefficient of determination) is a statistical measure in a
regression model that determines the proportion of variance in the dependent
variable that can be explained by the independent variable. In other words, r-squared
shows how well the data fit the regression model (the goodness of fit). R-squared
can take any values between 0 to 1. Although the statistical measure provides some
useful insights regarding the regression model, the user should not rely only on the
measure in the assessment of a statistical model. The figure does not disclose
information about the causation relationship between the independent and dependent
variables. In addition, it does not indicate the correctness of the regression model.
Therefore, the user should always draw conclusions about the model by analyzing rsquared together with the other variables in a statistical model. The R-squared for
the model is equal to 0.57. It means that the independent variables could have
explained 57 % of the changes in the dependent variable.
EXPLORATORY FACTOR ANALYSIS
Exploratory data analysis (EDA) is used by data scientists to analyze and investigate
data sets and summarize their main characteristics, often employing data
visualization methods. It helps determine how best to manipulate data sources to get
the answers you need, making it easier for data scientists to discover patterns, spot
anomalies, test a hypothesis, or check assumptions.
This data set is related to knowing the place and role of cultural industries in society.
We want to decrease the dimension of the data and drag the latent variables. There
is no missing value in the data set. Because it is a questionnaire, it has definitely an
outlier. But we can ignore them in exploratory factor analysis. Before doing the
analysis, we check the validity of the data set by the two test “KMO” and “Bartlett
sphericity”. Because the value of KMO is more than 0.7 (0.728) and the p-value of
the Bartlett is less than 0.05, the sampling adequacy is appropriate. The exploratory
factor is done and we found seven factors. The p-value for the sufficiency is more
than 0.05 (0.805). The seven factors are shown here:
Factor
S1
S2
S3
S4
S5
S6
S7
S8
S9
S10
S11
S12
S13
S14
S15
S16
S17
1
-
3
4
5
6
7
-
S18
S19
S20
S21
S22
S23
S24
S25
S26
S27
S28
S29
S30
S31
S32
-
All the factors with eigenvalue of more than 1 are considered as latent variables.
There are seven latent variables. The scree plot shows the separation of the seven
variables:
We can call the seven factors as follows:
1. Cultural development
2. Political relationship
3. Protection of the identity of national and religious culture
4. Economical development
5. National production boom
6. Wealth creation
7. Employment development
After all, we found the appropriate factors of the data set by exploratory factor
analysis. A Principal Component Analysis could also have been done to decrease
the dimension of data, however, with this dataset, I was trying to research hidden
factors that have already been researched in this field. Confirmatory factor analysis
is the second stage to complete it. We can extend our findings to build a structural
model equation for examining the relationships between these seven factors.
References
1. Fabrigar, L. R., & Wegener, D. T. (2011). Exploratory factor analysis. Oxford
University Press.
2. Gorsuch, R. L. (1988). Exploratory factor analysis. In Handbook of multivariate
experimental psychology (pp. 231-258). Springer, Boston, MA.
3. Yong, A. G., & Pearce, S. (2013). A beginner’s guide to factor analysis: Focusing
on exploratory factor analysis. Tutorials in quantitative methods for psychology,
9(2), 79-94.
4. Ferguson, E., & Cox, T. (1993). Exploratory factor analysis: A users’ guide.
International journal of selection and assessment, 1(2), 84-94.