Rizki Mayandi Hasibuan

Paper

Exercise 1 We start by modeling individual income as a function of years of schooling and socio-demographic factors. To this end, we estimate by OLS a regression model where the dependent variable is “income” and the regressors are given by “educ”, “age”, “gender”, “married” and “nkids”. a) Report your regression results and interpret the coefficient associated to “educ”. Is it statistically significant? Based on the results, it is clear that the regression model is significant because the p-value is less than the significance level of 5 %. The p-value of “educ” is less than the significance level of 5 %. Therefore, it can be concluded that this variable is statistically significant based on the results of the regression model. b) Estimate the previous model using ln(income) as the dependent variable. Report your regression results and interpret the coefficient associated to “educ”. Is it statistically significant? Based on the results, it is clear that the regression model is significant because the p-value is less than the significance level of 5 %. The p-value of “educ” is less than the significance level of 5 %. Therefore, it can be concluded that this variable is statistically significant based on the results of the regression model. Also, it is observed that all the independent variables are statistically significant based on the pvalues that are less than 0.05. c) Considering the model in b), include educ2 as an additional regressor. Does this model provide a better fit with respect to the model in b)? Please consider both 𝑅2 and 𝑅2 adjusted in your comments. Based on the results, it is clear that the regression model is significant because the p-value is less than the significance level of 5 %. Also, it is observed that all the independent variables are statistically significant based on the p-values that are less than 0.05. R-squared (R2) and adjusted R-squared are statistical measures used in regression analysis to assess the goodness of fit of a model. They provide information about how well the independent variable(s) in a regression equation explain the variability of the dependent variable. In summary, while R-squared gives a measure of how well the model fits the data, adjusted R-squared considers the number of predictors and provides a more realistic assessment of the model's goodness of fit, especially when dealing with multiple predictors. Adjusted R-squared is generally preferred when comparing models with different numbers of predictors. If the adjusted R-squared is close to the Rsquared, it suggests that the additional predictors are not providing significant improvement to the model. As it is clear, the adjusted R-squared is equal to 0.2849. it indicates that approximately 28.49% of the variability in the dependent variable is explained by the independent variables in the model, taking into account the number of predictors. The fitness of this model is approximately the same with the other models. d) Considering the model in c), test the null hypothesis of no effect of the years of schooling on income. Null hypothesis: The years of schooling don’t have a significant effect on income Alternative hypothesis: The years of schooling has a significant effect on income Based on the results, it is observed that “educ” has a significant effect on income based on the p-value is less than 0.05. Also, the t-statistic is equal to 11.069 and more than 1.96. It shows that the null hypothesis is rejected. e) How much is an additional year of schooling worth when the individual has 12 years of schooling? And when she has 16 years of schooling? We are interested in the coefficient of Beta1 that is for “educ”. It represents the estimated change in income associated with a one-year increase in education. When the individual has 12 years of schooling: The coefficient for "educ" represents the estimated change in income for a one-year increase in education. When the individual has 16 years of schooling: Similarly, the same coefficient for "educ" can be used to estimate the change in income for a one-year increase in education when the individual has 16 years of schooling. The coefficient is equal to-*- =-*- =- f) Test the null hypothesis that marriage/civil partnership and having children have the same effect on income. Null Hypothesis: Marriage/civil partnership and having children have the same effect on income. Alternative Hypothesis: Marriage/civil partnership and having children have different effects on income. Based on the results, it is observed that “married” has a significant effect on income because the p-value is less than 0.05. Also, the t-statistic is equal to 10.428 and more than 1.96. The p-value for “nkids” is equal to- and less than 0.05. The t-statistics for “nkids” is equal to 2.456 and more than 1.96. Therefore, both variables are statistically significant and Marriage/civil partnership and having children have the same effect on income. But the level of effectiveness is different between them. EXERCISE 2 This part refers to the model estimated in 1.c). a) Test the null hypothesis that the coefficients estimated using the subset of employees are equal to the coefficients estimated using the subset of self-employed. What is your conclusion? Null Hypothesis: The coefficients for employees are equal to the coefficients for self-employed. Alternative Hypothesis : The coefficients are different between employees and self-employed. Based on results, it is clear that the coefficients are not the same between the two groups. In the employee group, all the variables are statistically significant because all the p-values are less than 0.05. In the self-employee group, the p-values for “educ”, “gender” and “married” are less than 0.05 (0.007, 0.01, 0.0005). It can be concluded that the null hypothesis is rejected. In other words, The coefficients for employees are equal to the coefficients for self-employed. b) Using the approach by Breusch-Pagan, test the null hypothesis of homoskedasticity. What is your conclusion? Null Hypothesis: The errors are homoskedastic (constant variance). Alternative Hypothesis: The errors are heteroskedastic (non-constant variance). We used from Breusch-Pagan test for homoskedasticity. The test statistic is 43.626 and the corresponding p-value is less than 0.05. Therefore, we reject the null hypothesis. In summary, based on the p-value from the studentized Breusch-Pagan test, there is significant evidence to suggest that heteroskedasticity is present in the model. c) Considering the statistical evidence from 2.a) and 2.b), do you believe the model in 1.c) provides reliable estimates of the parameters and that the t-ratio and F tests are valid? If not, how would you proceed? We can conclude and compare the models based on R2. The R2 of the model (a) is equal to 0.35. The R2 of the model (b) and (c) is equal to 0.28. That model (a) is more reliable. Exercise 3 a) Explain using your own words why the model considered so far may suffer from endogeneity and the resulting consequences on the statistical properties of the OLS estimator. Please indicate the endogenous variable(s) and the included exogenous variable(s). Estimate the model in 1.b) by TSLS using the mother’s and father’s years of schooling as instrumental variables. Endogeneity arises when a variable in a statistical model is correlated with the error term, violating a key assumption. This compromises the accuracy of the OLS estimator, leading to biased and inconsistent parameter estimates. Identification of endogenous (affected by errors) and exogenous (not affected) variables is crucial. Advanced techniques like instrumental variables help address endogeneity, ensuring more reliable results in econometric models and statistical analyses. When an endogenous variable is present, the OLS estimator is biased. The direction and magnitude of the bias depend on the nature of the endogeneity. It could lead to either overestimation or underestimation of the true relationships between variables. This bias can affect the reliability of the statistical inferences drawn from the regression analysis. Exogenous variable: In(income) Endogenous variables: educ^2, age, gender, married, nkids For estimating the model by TSLS, we should use two varibales equal to number of instrumental variables. Therefore, the model 1.b) is not suitable. In summary, the model suggests that education (educ) and gender are significant predictors of the log of income after accounting for potential endogeneity issues using instrumental variables. The overall model fit is statistically significant, as indicated by the Wald test. b) Explain using your own words why we might expect the instruments to be both relevant and exogenous. In essence, to ensure the validity and reliability of instrumental variable regression, the instruments need to meet two crucial criteria. Firstly, they should be pertinent, demonstrating a correlation with the endogenous variable. This ensures that the instruments capture meaningful variation in the variable of interest. Secondly, the instruments must be exogenous, meaning they are unrelated to the error term in the regression equation. This stipulation ensures that the instruments do not introduce bias by being correlated with unobserved factors affecting the dependent variable. By satisfying these conditions, the instruments effectively isolate the variation in the endogenous variable, offering a trustworthy solution to the endogeneity issue in the regression model. c) Using appropriate statistical tests, determine whether the instruments are relevant and exogenous. Null Hypothesis: The instruments are not jointly relevant in explaining the variation in the endogenous variable. Alternative Hypothesis: The instruments are jointly relevant in explaining the variation in the endogenous variable. We use F-test for checking the relevance. But we saw that the fitness of the model (R squared) is not good. Therefore, the relevant test is not valid. Null Hypothesis: The instruments are not jointly exogenous in explaining the variation in the endogenous variable. Alternative Hypothesis: The instruments are jointly exogenous in explaining the variation in the endogenous variable. We use Hausman Wu-test for checking the exogeniety. With such a small p-value, we would reject the null hypothesis. This indicates that there is evidence of endogeneity in the model, and the instrumental variable estimates are preferred over the OLS estimates. d) Considering 3.b)-3.c), should we prefer the OLS or the TSLS estimates? If appropriate, use a statistical test to support your answer. The preference between OLS and TSLS estimates depends on the diagnostic tests and results from the instrumental variable (IV) analysis. The Wu-Hausman test indicates evidence of endogeneity (rejecting the null hypothesis of consistent and efficient OLS estimates). Because the p-value from the Hausman test is less 0.05, we reject the null hypothesis in favor of the alternative. This implies that the TSLS estimates are preferred over OLS due to the presence of endogeneity. e) Using your preferred estimates, i.e., OLS or TSLS, indicate and interpret the coefficient associated with the regressor “gender”. Is it statistically significant? Null Hypothesis: Gender doesn’t have a significant effect on income Alternative Hypothesis: Gender has a significant effect on income Because the p-value is equal to 0.007 and less than 0.05, we can reject the null hypothesis. Therefore, gender is statistically significant. f) Suppose we have a panel of observations for 2015 and 2018. Also, note that the survey contains information on adults who have completed their education by the time of the first survey. How would you use the panel structure to eliminate endogeneity due to educ if the variable of interest is “age”? Would such a methodology be helpful if the primary variable of interest is “gender”, instead? The arrangement of observations in a panel format for the years 2015 and 2018 offers a means to mitigate endogeneity issues associated with the variable "educ" (education) when studying the influence of "age." Panel data enable the exploration of changes within individuals over time, providing a mechanism to account for individual-specific traits that could be linked to both education and age. EXERCISE 4 Let us define a new dummy variable, “low_income”, equal to 1 if “income” is less than or equal to 9500 USD (the 10th percentile of the actual income distribution). Estimate a Probit model using the regressors in 1.b). a) Explain using your own words why we do not use the linear regression model and resort to the Probit model? The decision to use the Probit model instead of the linear regression model is grounded in the nature of the dependent variable. Linear regression is well-suited for continuous outcomes, where the response variable can take any value. However, when dealing with binary outcomes, like success or failure, the linear regression model may lead to predictions outside the valid probability range of 0 to 1. In contrast, the Probit model is specifically designed for binary outcomes. It employs a probit link function, which ensures that predicted probabilities remain within the correct range, making it more appropriate for modeling situations where the dependent variable is binary. b) What is the estimated coefficient of “married”? Is it statistically significant? the estimated coefficient of “married”: - 0.11886 Null Hypothesis: “married” doesn’t have a significant effect on low income Alternative Hypothesis: “married” has a significant effect on low income Because the p-value is less than 0.05, we can reject the null hypothesis. Therefore, “married” is statistically significant. c) Compute the predicted probability of being a low-income earner for a single 40-year-old male with 12 years of schooling and no kids and for a single 40-year-old female with 12 years of schooling and no kids. d) Test the null hypothesis that gender and the number of children are jointly irrelevant for being a “low_income” worker. Null Hypothesis: The coefficients of "gender" and "nkids" are jointly equal to zero. Alternative Hypothesis: At least one of the coefficients of "gender" and "nkids" is not equal to zero. We use Wald test to check the hypothesis: Because the p-value is less than 0.05, we can reject the null hypothesis. Therefore, At least one of the coefficients of "gender" and "nkids" is not equal to zero.

Scheduled maintenance