Paper
Exercise 1
We start by modeling individual income as a function of years of schooling and socio-demographic
factors. To this end, we estimate by OLS a regression model where the dependent variable is “income”
and the regressors are given by “educ”, “age”, “gender”, “married” and “nkids”.
a) Report your regression results and interpret the coefficient associated to “educ”. Is it
statistically significant?
Based on the results, it is clear that the regression model is significant because the p-value is less than
the significance level of 5 %. The p-value of “educ” is less than the significance level of 5 %. Therefore, it
can be concluded that this variable is statistically significant based on the results of the regression
model.
b) Estimate the previous model using ln(income) as the dependent variable.
Report your regression results and interpret the coefficient associated to “educ”. Is it
statistically significant?
Based on the results, it is clear that the regression model is significant because the p-value is less than
the significance level of 5 %. The p-value of “educ” is less than the significance level of 5 %. Therefore, it
can be concluded that this variable is statistically significant based on the results of the regression
model. Also, it is observed that all the independent variables are statistically significant based on the pvalues that are less than 0.05.
c) Considering the model in b), include educ2 as an additional regressor.
Does this model provide a better fit with respect to the model in b)? Please consider both 𝑅2
and 𝑅2 adjusted in your comments.
Based on the results, it is clear that the regression model is significant because the p-value is less than
the significance level of 5 %. Also, it is observed that all the independent variables are statistically
significant based on the p-values that are less than 0.05.
R-squared (R2) and adjusted R-squared are statistical measures used in regression analysis to assess the
goodness of fit of a model. They provide information about how well the independent variable(s) in a
regression equation explain the variability of the dependent variable.
In summary, while R-squared gives a measure of how well the model fits the data, adjusted R-squared
considers the number of predictors and provides a more realistic assessment of the model's goodness of
fit, especially when dealing with multiple predictors. Adjusted R-squared is generally preferred when
comparing models with different numbers of predictors. If the adjusted R-squared is close to the Rsquared, it suggests that the additional predictors are not providing significant improvement to the
model.
As it is clear, the adjusted R-squared is equal to 0.2849. it indicates that approximately 28.49% of the
variability in the dependent variable is explained by the independent variables in the model, taking into
account the number of predictors. The fitness of this model is approximately the same with the other
models.
d) Considering the model in c), test the null hypothesis of no effect of the years of schooling on
income.
Null hypothesis: The years of schooling don’t have a significant effect on income
Alternative hypothesis: The years of schooling has a significant effect on income
Based on the results, it is observed that “educ” has a significant effect on income based on the p-value is
less than 0.05. Also, the t-statistic is equal to 11.069 and more than 1.96. It shows that the null
hypothesis is rejected.
e) How much is an additional year of schooling worth when the individual has 12 years of
schooling? And when she has 16 years of schooling?
We are interested in the coefficient of Beta1 that is for “educ”.
It represents the estimated change in income associated with a one-year increase in education.
When the individual has 12 years of schooling:
The coefficient for "educ" represents the estimated change in income for a one-year increase in
education.
When the individual has 16 years of schooling:
Similarly, the same coefficient for "educ" can be used to estimate the change in income for a one-year
increase in education when the individual has 16 years of schooling.
The coefficient is equal to-*- =-*- =-
f) Test the null hypothesis that marriage/civil partnership and having children have the same
effect on income.
Null Hypothesis: Marriage/civil partnership and having children have the same effect on income.
Alternative Hypothesis: Marriage/civil partnership and having children have different effects on income.
Based on the results, it is observed that “married” has a significant effect on income because the p-value
is less than 0.05. Also, the t-statistic is equal to 10.428 and more than 1.96. The p-value for “nkids” is
equal to- and less than 0.05. The t-statistics for “nkids” is equal to 2.456 and more than 1.96.
Therefore, both variables are statistically significant and Marriage/civil partnership and having children
have the same effect on income. But the level of effectiveness is different between them.
EXERCISE 2
This part refers to the model estimated in 1.c).
a) Test the null hypothesis that the coefficients estimated using the subset of employees are equal
to the coefficients estimated using the subset of self-employed. What is your conclusion?
Null Hypothesis: The coefficients for employees are equal to the coefficients for self-employed.
Alternative Hypothesis : The coefficients are different between employees and self-employed.
Based on results, it is clear that the coefficients are not the same between the two groups. In the
employee group, all the variables are statistically significant because all the p-values are less than 0.05.
In the self-employee group, the p-values for “educ”, “gender” and “married” are less than 0.05 (0.007,
0.01, 0.0005). It can be concluded that the null hypothesis is rejected. In other words, The coefficients
for employees are equal to the coefficients for self-employed.
b) Using the approach by Breusch-Pagan, test the null hypothesis of homoskedasticity. What is
your conclusion?
Null Hypothesis: The errors are homoskedastic (constant variance).
Alternative Hypothesis: The errors are heteroskedastic (non-constant variance).
We used from Breusch-Pagan test for homoskedasticity. The test statistic is 43.626 and the
corresponding p-value is less than 0.05. Therefore, we reject the null hypothesis. In summary, based on
the p-value from the studentized Breusch-Pagan test, there is significant evidence to suggest that
heteroskedasticity is present in the model.
c) Considering the statistical evidence from 2.a) and 2.b), do you believe the model in 1.c)
provides reliable estimates of the parameters and that the t-ratio and F tests are valid? If not,
how would you proceed?
We can conclude and compare the models based on R2. The R2 of the model (a) is equal to 0.35. The R2 of
the model (b) and (c) is equal to 0.28. That model (a) is more reliable.
Exercise 3
a) Explain using your own words why the model considered so far may suffer from endogeneity
and the resulting consequences on the statistical properties of the OLS estimator. Please
indicate the endogenous variable(s) and the included exogenous variable(s).
Estimate the model in 1.b) by TSLS using the mother’s and father’s years of schooling as instrumental
variables.
Endogeneity arises when a variable in a statistical model is correlated with the error term, violating a key
assumption. This compromises the accuracy of the OLS estimator, leading to biased and inconsistent
parameter estimates. Identification of endogenous (affected by errors) and exogenous (not affected)
variables is crucial. Advanced techniques like instrumental variables help address endogeneity, ensuring
more reliable results in econometric models and statistical analyses.
When an endogenous variable is present, the OLS estimator is biased. The direction and magnitude of
the bias depend on the nature of the endogeneity. It could lead to either overestimation or
underestimation of the true relationships between variables. This bias can affect the reliability of the
statistical inferences drawn from the regression analysis.
Exogenous variable: In(income)
Endogenous variables: educ^2, age, gender, married, nkids
For estimating the model by TSLS, we should use two varibales equal to number of instrumental
variables. Therefore, the model 1.b) is not suitable.
In summary, the model suggests that education (educ) and gender are significant predictors of the log of
income after accounting for potential endogeneity issues using instrumental variables. The overall model
fit is statistically significant, as indicated by the Wald test.
b) Explain using your own words why we might expect the instruments to be both relevant and
exogenous.
In essence, to ensure the validity and reliability of instrumental variable regression, the instruments
need to meet two crucial criteria. Firstly, they should be pertinent, demonstrating a correlation with the
endogenous variable. This ensures that the instruments capture meaningful variation in the variable of
interest. Secondly, the instruments must be exogenous, meaning they are unrelated to the error term in
the regression equation. This stipulation ensures that the instruments do not introduce bias by being
correlated with unobserved factors affecting the dependent variable. By satisfying these conditions, the
instruments effectively isolate the variation in the endogenous variable, offering a trustworthy solution
to the endogeneity issue in the regression model.
c) Using appropriate statistical tests, determine whether the instruments are relevant and exogenous.
Null Hypothesis: The instruments are not jointly relevant in explaining the variation in the endogenous
variable.
Alternative Hypothesis: The instruments are jointly relevant in explaining the variation in the
endogenous variable.
We use F-test for checking the relevance. But we saw that the fitness of the model (R squared) is not
good. Therefore, the relevant test is not valid.
Null Hypothesis: The instruments are not jointly exogenous in explaining the variation in the endogenous
variable.
Alternative Hypothesis: The instruments are jointly exogenous in explaining the variation in the
endogenous variable.
We use Hausman Wu-test for checking the exogeniety. With such a small p-value, we would reject the
null hypothesis. This indicates that there is evidence of endogeneity in the model, and the instrumental
variable estimates are preferred over the OLS estimates.
d) Considering 3.b)-3.c), should we prefer the OLS or the TSLS estimates? If appropriate, use a
statistical test to support your answer.
The preference between OLS and TSLS estimates depends on the diagnostic tests and results from the
instrumental variable (IV) analysis. The Wu-Hausman test indicates evidence of endogeneity (rejecting
the null hypothesis of consistent and efficient OLS estimates).
Because the p-value from the Hausman test is less 0.05, we reject the null hypothesis in favor of the
alternative. This implies that the TSLS estimates are preferred over OLS due to the presence of
endogeneity.
e) Using your preferred estimates, i.e., OLS or TSLS, indicate and interpret the coefficient associated
with the regressor “gender”. Is it statistically significant?
Null Hypothesis: Gender doesn’t have a significant effect on income
Alternative Hypothesis: Gender has a significant effect on income
Because the p-value is equal to 0.007 and less than 0.05, we can reject the null hypothesis. Therefore,
gender is statistically significant.
f) Suppose we have a panel of observations for 2015 and 2018. Also, note that the survey contains
information on adults who have completed their education by the time of the first survey. How
would you use the panel structure to eliminate endogeneity due to educ if the variable of interest
is “age”? Would such a methodology be helpful if the primary variable of interest is “gender”,
instead?
The arrangement of observations in a panel format for the years 2015 and 2018 offers a means to
mitigate endogeneity issues associated with the variable "educ" (education) when studying the influence
of "age." Panel data enable the exploration of changes within individuals over time, providing a
mechanism to account for individual-specific traits that could be linked to both education and age.
EXERCISE 4
Let us define a new dummy variable, “low_income”, equal to 1 if “income” is less than or equal to
9500 USD (the 10th percentile of the actual income distribution). Estimate a Probit model using the
regressors in 1.b).
a) Explain using your own words why we do not use the linear regression model and resort to
the Probit model?
The decision to use the Probit model instead of the linear regression model is grounded in the nature of
the dependent variable. Linear regression is well-suited for continuous outcomes, where the response
variable can take any value. However, when dealing with binary outcomes, like success or failure, the
linear regression model may lead to predictions outside the valid probability range of 0 to 1. In contrast,
the Probit model is specifically designed for binary outcomes. It employs a probit link function, which
ensures that predicted probabilities remain within the correct range, making it more appropriate for
modeling situations where the dependent variable is binary.
b) What is the estimated coefficient of “married”? Is it statistically significant?
the estimated coefficient of “married”: - 0.11886
Null Hypothesis: “married” doesn’t have a significant effect on low income
Alternative Hypothesis: “married” has a significant effect on low income
Because the p-value is less than 0.05, we can reject the null hypothesis. Therefore, “married” is
statistically significant.
c) Compute the predicted probability of being a low-income earner for a single 40-year-old male
with 12 years of schooling and no kids and for a single 40-year-old female with 12 years of
schooling and no kids.
d) Test the null hypothesis that gender and the number of children are jointly irrelevant for being
a “low_income” worker.
Null Hypothesis: The coefficients of "gender" and "nkids" are jointly equal to zero.
Alternative Hypothesis: At least one of the coefficients of "gender" and "nkids" is not equal to zero.
We use Wald test to check the hypothesis:
Because the p-value is less than 0.05, we can reject the null hypothesis. Therefore, At least one of the
coefficients of "gender" and "nkids" is not equal to zero.