Paper 9
QUANTITATIVE METHODS FOR INTERNATIONAL MARKETS
Psid_60
Exercise 1
We start by modeling individual income as a function of years of schooling and socio-demographic
factors. To this end, we estimate by OLS a regression model where the dependent variable is “income”
and the regressors are given by “educ”, “age”, “gender”, “married” and “nkids”.
a) Report your regression results and interpret the coefficient associated to “educ”. Is it
statistically significant?
The results demonstrate the importance of the regression model, as indicated by the p-value falling below
the 5% significance level. In particular, the p-value related to the "educ" variable is likewise below the
5% threshold, implying that this specific variable holds statistical significance based on the outcomes of
the regression model.
b) Estimate the previous model using ln(income) as the dependent variable.
Report your regression results and interpret the coefficient associated to “educ”. Is it
statistically significant?
According to the findings, it is evident that the regression model holds significance, given that the p-value
is below the 5% significance threshold. Specifically, the p-value associated with "educ" is below the 5%
threshold, leading to the inference that this particular variable is statistically significant according to the
regression model results. Furthermore, it is noted that all independent variables demonstrate statistical
significance based on p-values below 0.05.
c) Considering the model in b), include educ2 as an additional regressor.
Does this model provide a better fit with respect to the model in b)? Please consider both 𝑅2
and 𝑅2 adjusted in your comments.
Based on the results, the regression model is considered significant due to the p-value being below the 5%
significance level. Except for "nkids" with a p-value of-, all independent variables exhibit
statistical significance with p-values below 0.05.
R-squared (R2) and adjusted R-squared serve as metrics in regression analysis, assessing how well a
model fits the data by measuring the degree to which independent variables explain the variation in the
dependent variable.
In summary, while R-squared evaluates overall model fit, adjusted R-squared takes into account the
number of predictors, providing a more realistic evaluation of model fit, particularly in scenarios
involving multiple predictors. Adjusted R-squared is preferred when comparing models with different
predictor numbers. If adjusted R-squared closely mirrors R-squared, it suggests that additional predictors
aren't significantly enhancing the model.
As evident from the findings, the adjusted R-squared is 0.2871, signifying that approximately 28.71% of
the variability in the dependent variable is explained by the independent variables, considering the
number of predictors. The model's fitness aligns reasonably well with other models.
d) Considering the model in c), test the null hypothesis of no effect of the years of schooling on
income.
Null hypothesis: The years of schooling don’t have a significant effect on income
Alternative hypothesis: The years of schooling has a significant effect on income
The results indicate that "educ" has a notable impact on income, as evidenced by a p-value less than 0.05.
Additionally, the t-statistic, registering at 10.147 and exceeding 1.96, supports the rejection of the null
hypothesis.
e) How much is an additional year of schooling worth when the individual has 12 years of
schooling? And when she has 16 years of schooling?
We are focusing on the Beta1 coefficient, specifically for "educ." This coefficient signifies the estimated
alteration in income linked to a one-year elevation in education.
For an individual with 12 years of schooling:
The "educ" coefficient stands for the anticipated income change for a one-year rise in education at this
level.
Likewise, for an individual with 16 years of schooling:
The same "educ" coefficient can be employed to estimate the income change for a one-year increase in
education at this higher level.The coefficient is equal to-*-=-*-=-
f) Test the null hypothesis that marriage/civil partnership and having children have the same
effect on income.
Null Hypothesis: Marriage/civil partnership and having children have the same effect on income.
Alternative Hypothesis: Marriage/civil partnership and having children have different effects on income.
The results indicate that the variable "married" significantly influences income, as evidenced by a p-value
below 0.05. Additionally, with a t-statistic of 10.652 surpassing 1.96, it supports the rejection of the null
hypothesis for the "married" variable.
Conversely, for the variable "nkids," the p-value is 0.4321, exceeding 0.05. Moreover, the t-statistic for
"nkids" is 0.786, falling below 1.96. Consequently, "nkids" lacks statistical significance. In contrast,
"married" holds statistical significance. Hence, there isn't enough justification to reject the null
hypothesis.
EXERCISE 2
This part refers to the model estimated in 1.c).
a) Test the null hypothesis that the coefficients estimated using the subset of employees are equal
to the coefficients estimated using the subset of self-employed. What is your conclusion?
Null Hypothesis: The coefficients for employees are equal to the coefficients for self-employed.
Alternative Hypothesis : The coefficients are different between employees and self-employed.
The results highlight that the coefficients differ between the two groups. In the employee group, all
variables exhibit statistical significance, with p-values below 0.05. For the self-employed group, the pvalues for "educ" and "married" are both below 0.05 (0.007 and 0.0000). This leads to the rejection of the
null hypothesis, indicating that the coefficients for employees are not equal to those for the selfemployed.
b) Using the approach by Breusch-Pagan, test the null hypothesis of homoskedasticity. What is
your conclusion?
Null Hypothesis: The errors are homoskedastic (constant variance).
Alternative Hypothesis: The errors are heteroskedastic (non-constant variance).
The Breusch-Pagan test was utilized to examine homoskedasticity. The test statistic returned a value of
23.253, and the associated p-value was found to be less than 0.05. Therefore, we reject the null
hypothesis. In summary, the p-value from the studentized Breusch-Pagan test strongly suggests the
existence of heteroskedasticity in the model.
c) Considering the statistical evidence from 2.a) and 2.b), do you believe the model in 1.c)
provides reliable estimates of the parameters and that the t-ratio and F tests are valid? If not,
how would you proceed?
We can draw conclusions and make model comparisons based on the R2 values. The R2 for Model (a) is
0.35, while the R2 for Models (b) and (c) is 0.27. This indicates that Model (a) is more reliable.
Exercise 3
a) Explain using your own words why the model considered so far may suffer from endogeneity
and the resulting consequences on the statistical properties of the OLS estimator. Please
indicate the endogenous variable(s) and the included exogenous variable(s).
Estimate the model in 1.b) by TSLS using the mother’s and father’s years of schooling as instrumental
variables.
Endogeneity occurs when a variable in a statistical model is correlated with the error term, violating a
fundamental assumption. This compromises the accuracy of the Ordinary Least Squares (OLS) estimator,
resulting in biased and inconsistent parameter estimates. It is essential to identify which variables are
endogenous (affected by errors) and which are exogenous (not affected). Employing advanced techniques
such as instrumental variables becomes crucial to address endogeneity and ensure more reliable results in
econometric models and statistical analyses.
In the presence of an endogenous variable, the OLS estimator becomes biased. The direction and
magnitude of this bias depend on the nature of the endogeneity, potentially leading to either
overestimation or underestimation of the true relationships between variables. This bias has implications
for the reliability of the statistical inferences drawn from the regression analysis.
Exogenous variable: In(income)
Endogenous variables: educ^2, age, gender, married, nkids
For estimating the model by TSLS, we should use two varibales equal to number of instrumental
variables. Therefore, the model 1.b) is not suitable.
In summary, the model suggests that education (educ) and gender are significant predictors of the log of
income after accounting for potential endogeneity issues using instrumental variables. The overall model
fit is statistically significant, as indicated by the Wald test.
b) Explain using your own words why we might expect the instruments to be both relevant and
exogenous.
To ensure the validity and reliability of instrumental variable regression, it is crucial that the selected
instruments satisfy two key criteria. Firstly, they must be relevant, demonstrating a correlation with the
endogenous variable. This ensures that the instruments capture meaningful variations in the variable
under consideration. Secondly, the instruments must be exogenous, meaning they are unrelated to the
error term in the regression equation. This requirement ensures that the instruments do not introduce bias
by being correlated with unobserved factors influencing the dependent variable. By meeting these
conditions, the instruments effectively isolate the variation in the endogenous variable, offering a
dependable solution to the endogeneity issue in the regression model.
c) Using appropriate statistical tests, determine whether the instruments are relevant and exogenous.
Null Hypothesis: The instruments are not jointly relevant in explaining the variation in the endogenous
variable.
Alternative Hypothesis: The instruments are jointly relevant in explaining the variation in the endogenous
variable.
We use F-test for checking the relevance. But we saw that the fitness of the model (R squared) is not
good. Therefore, the relevant test is not valid.
Null Hypothesis: The instruments are not jointly exogenous in explaining the variation in the endogenous
variable.
Alternative Hypothesis: The instruments are jointly exogenous in explaining the variation in the
endogenous variable.
We employ the Hausman Wu-test to assess exogeneity. Given the small p-value, we reject the null
hypothesis. This suggests there is evidence of endogeneity in the model, and, as a result, the instrumental
variable estimates are favored over the OLS estimates.
d) Considering 3.b)-3.c), should we prefer the OLS or the TSLS estimates? If appropriate, use a
statistical test to support your answer.
The choice between Ordinary Least Squares (OLS) and Two-Stage Least Squares (TSLS) estimates is
contingent on diagnostic tests and findings from instrumental variable (IV) analysis. The Wu-Hausman
test provides evidence of endogeneity, as it rejects the null hypothesis of consistent and efficient OLS
estimates.
Since the p-value from the Hausman test is below 0.05, we reject the null hypothesis in favor of the
alternative. This indicates a preference for TSLS estimates over OLS due to the identified presence of
endogeneity.
e) Using your preferred estimates, i.e., OLS or TSLS, indicate and interpret the coefficient associated
with the regressor “gender”. Is it statistically significant?
Null Hypothesis: Gender doesn’t have a significant effect on income
Alternative Hypothesis: Gender has a significant effect on income
Given that the p-value is 0.000, which is less than the conventional significance level of 0.05, we can
reject the null hypothesis. Consequently, gender is deemed statistically significant.
f) Suppose we have a panel of observations for 2015 and 2018. Also, note that the survey contains
information on adults who have completed their education by the time of the first survey. How
would you use the panel structure to eliminate endogeneity due to educ if the variable of interest
is “age”? Would such a methodology be helpful if the primary variable of interest is “gender”,
instead?
Arranging observations in a panel format for the years 2015 and 2018 offers a strategy to mitigate
endogeneity concerns associated with the variable "educ" (education) when studying the impact of "age."
Panel data enable the exploration of changes within individuals over time, providing a means to account
for individual-specific characteristics that may be linked to both education and age.
EXERCISE 4
Let us define a new dummy variable, “low_income”, equal to 1 if “income” is less than or equal to
9500 USD (the 10th percentile of the actual income distribution). Estimate a Probit model using the
regressors in 1.b).
a) Explain using your own words why we do not use the linear regression model and resort to
the Probit model?
The decision to employ the Probit model instead of the linear regression model is driven by the nature of
the dependent variable. Linear regression is well-suited for continuous outcomes, where the response
variable can take any value. However, when dealing with binary outcomes, such as success or failure, the
linear regression model might produce predictions outside the valid probability range of 0 to 1.
In contrast, the Probit model is specifically tailored for binary outcomes. It utilizes a probit link function,
ensuring that the predicted probabilities stay within the correct range. This feature makes the Probit model
more suitable for situations where the dependent variable is binary.
b) What is the estimated coefficient of “married”? Is it statistically significant?
the estimated coefficient of “married”: -
Null Hypothesis: “married” doesn’t have a significant effect on low income
Alternative Hypothesis: “married” has a significant effect on low income
Because the p-value is less than 0.05, we can reject the null hypothesis. Therefore, “married” is
statistically significant.
c) Compute the predicted probability of being a low-income earner for a single 40-year-old male
with 12 years of schooling and no kids and for a single 40-year-old female with 12 years of
schooling and no kids.
d) Test the null hypothesis that gender and the number of children are jointly irrelevant for being
a “low_income” worker.
Null Hypothesis: The coefficients of "gender" and "nkids" are jointly equal to zero.
Alternative Hypothesis: At least one of the coefficients of "gender" and "nkids" is not equal to zero.
We use Wald test to check the hypothesis:
Because the p-value is less than 0.05, we can reject the null hypothesis. Therefore, At least one of the
coefficients of "gender" and "nkids" is not equal to zero.