-
Notifications
You must be signed in to change notification settings - Fork 0
/
Econometrics PS3.Rmd
269 lines (189 loc) · 23 KB
/
Econometrics PS3.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
---
title: "Econometrics PS3"
author: "Cheryl Lim"
date: "2024-10-16"
output:
pdf_document:
latex_engine: xelatex
df_print: paged
geometry: margin = 0.3in
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(warning = FALSE, message = F, echo = TRUE, error=TRUE)
library(haven)
library(sandwich)
library(lmtest)
library(dplyr)
library(ggplot2)
library(stargazer)
library(broom)
nichh <- read_dta("nichh.dta", encoding="utf-8")
```
## Exercise A: "Legal Determinants of World Cup Success" Paper (by Mark West)
### 1a. Critique of the econometric strategy used in “Legal determinants of World Cup Success” by Mark West
In this article, we detect some potential violations of the Gauss-Markov Assumptions.
The paper's model specification for each country is
\begin{center} $Fifa\ Ranking = \beta_0 + \beta_1 \times Rule of Law + \beta_3 \times Antidirector Rights + \beta_4 \times Origin + \beta_5 NumProPlayers + u$ \end{center}
This model proposed aims to explain a country's FIFA ranking based mainly on the rule of law and other independent variables like the strength of antidirector rights, and the number of professional players. While the model appears linear, certain violations of the Gauss-Markov assumptions may lead to biased or inefficient estimates, which would compromise the validity of the conclusions drawn from the paper.
**1. Violation of Random Sampling (Independence of Observations):**
One key Gauss-Markov assumption is that the sample should consist of independently and identically distributed (i.i.d.) observations. However, the dataset uses samples from only 21 or 49 countries. The selection of countries that made it to the finals of the World Cup is inherently non-random, which likely influences their success beyond the legal determinants being analyzed. For example, countries selected for the sample have not only strong legal frameworks but likely have a rich footballing culture and historical success internationally which are likely correlated with factors such as football infrastructure or investment in youth development which are omitted from the model.
**2. Linearity**: The model assumes a linear relationship between legal variables (e.g., Rule of Law, French Origin) and FIFA points, but this relationship could be more complex and non-linear for variables like "number of professional soccer players per capita" on FIFA points may exhibit diminishing returns or threshold effects.If the relationship is non-linear but the model assumes linearity, the estimated coefficients could be biased and inconsistent, leading to incorrect inferences about how legal variables affect soccer success.
**3. Homoscedasticity:**: The variance of the error term might not be constant across all levels of the independent variables. Countries with large football programs might have more predictable performance, introducing heteroscedasticity. For example, wealthier countries with strong football programs may have more predictable FIFA rankings, while poorer countries could see more fluctuation due to unstable investments in their soccer programs. This would lead to heteroscedasticity, where errors for certain countries (e.g., Brazil or Germany) are much smaller compared to less developed soccer nations).
- A country with worse football investment/institutions, can have unpredictable performance (due for instance to a once-in-a-lifetime naturally talented team) which leads to non-constant error variance and less precise estimations.
**4. Omitted Variable Bias**: The assumption of a zero conditional mean implies that the error term $u$ should have an expected value of zero, conditional on the independent variables. However, in this model, the error term may capture omitted variables that influence FIFA rankings but are correlated with the included regressors. For instance, cultural factors or the popularity of football in a country could significantly affect a nation’s FIFA ranking. These cultural factors may also be correlated with the rule of law or the number of professional players which leads to omitted variable bias. If football is deeply embedded in a country's national identity, it could result in better training systems, greater player investment, and supportive legal structures. All of this creates a correlation with both the rule of law and the number of professional players. Another important omitted variable is population size. Countries with larger populations will likely have a larger talent pool, increasing the chances of developing top professional players. If this is not accounted for, it would likely be absorbed into the error term, which could be correlated with variables like "NumProPlayers," introducing bias.
**5. Multicollinearity**: While perfect multicollinearity is unlikely, there could be high correlations between some of the independent variables in the mode that can inflate standard errors and reduce the precision of the coefficient estimates. Variables like “Rule of Law” “Antidirector Rights” and “Origin” reflect the quality of a country’s institutions, making it plausible that they are highly correlated. For example, countries with a strong rule of law are also likely to have strong anti-director rights, and these factors could be influenced by a shared legal origin (e.g., civil law versus common law). Additionally, certain legal origins (such as French or Scandinavian) might be associated with high Rule of Law scores, creating collinearity between these variables.
**6. Endogeneity**: There could be reverse causality between soccer success and some of the predictors. For example, strong legal systems might not only affect soccer success but also be influenced by a country’s global stature, including soccer performance.
### 1b. Analysis of estimations related to the determinants of World Cup success
The variability of the results leads us to question if the estimates are robust. We also note that given the small sample size used for each specification, we should not rely entirely on their outcomes. Then, the lack of attempt to give a causality interpretation makes this model irrelevant to determining World Cup soccer success, and the conclusions drawn are limited to correlations.
However, if we were to select one as a starting point, the specification with the highest $R^2$ might be the best or at least be a starting point in establishing the relation between the legal origin and soccer performances. Within the limitation of the data, it still explains the most variance in FIFA ranking points among the tested models which could provide insights into potential correlations.
Given the limitations identified above, a better model to determine World Cup ranking would involve variables that account for the sports legal institutions, cultural practices, investment in infrastructures etc to account for potential sources of endogeneity and omitted variable bias.
### 2. Effect of missing observations on estimates
**True** - Missing observations can affect both the precision and bias of estimates.
- `Precisions`: Randomly or non-randomly missing, missing observations reduce the sample size, which in turn impacts the precision of the estimates. As standard error is inversely related to the square root of the sample size, fewer observations mean larger standard errors - the estimate is less reliable, as there is more uncertainty in its accuracy. Likewise, a smaller sample size leads to a higher standard deviation, which means greater variance in the estimates and so less precision. This is reflected in wider confidence intervals and less certainty about the estimated coefficients.
- `Bias`: Whether missing observations lead to bias depends on why the data is missing. If the missing data is missing completely at random, where the likelihood of a data point being missing is unrelated to the data itself, the estimates remain unbiased (although this makes the estimators less precise because of the reduction of the sample size). If the missing values are missing based on the value of an independent variable (we only selected a subpopulation), as long as there is enough variation in the independent variables in this subpopulation, the estimates also remain unbiased. However, if the data is missing not at random, where the missingness is systematically related to certain characteristics of the dependent variable, it introduces bias. The sample is no longer representative of the population. The mathematical explanation is therefore that the estimates are different from the expected true values. For example, in the context of the links between quality and original legal system and success in the World Cup, if countries that did not make it to the finals are systematically missing from the dataset, or if countries with weaker legal institutions are missing “rule of law” data, this would introduce bias into the results. The dataset does not accurately represent the full range of variation in the global population. These missing observations are not random; they are systematically related to the outcome variables of interest (Fifa ranking) and lead to biased estimates.
## Computer Exercise
```{r, log-linear model, echo=FALSE, results="show", results='asis'}
reg <- lm(log(constot) ~ adultm + adultf + headage + headeduc + healthkm + schoolkm + landarea, data = nichh)
reg_results <- tidy(reg)
stargazer(reg, type = "latex",
single.row = TRUE,
no.space = TRUE,
column.sep.width = "3pt",
title = "Regression Results for Log-linear Model",
covariate.labels = c("Male Adults", "Female Adults", "Head's Age", "Head's Education",
"Distance to Health Center (km)", "Distance to School (km)", "Land Owned"),
dep.var.labels = "Log of Total Household Consumption",
digits = 3,
header = FALSE,
out = "regression1_results.html") # saves output to a file for github
```
```{r, include=FALSE}
education_p_value <- summary(reg)$coefficients[4,4] # p-value of headeduc variable
```
**1. Minimum significance level to reject null hypothesis**
- $H_0:$ Household head’s education has no effect on consumption.
- $H_1:$ Household head’s education affects consumption.
The minimum significance level at which we can reject the null hypothesis that the education level of the household head does not affect total household consumption corresponds to when the p-value for the coefficient for the household head’s education variable is less than our chosen significance level. Estimating the log-linear regression model, we find the p-value for Head's Education to be `r education_p_value`; as also shown in table 1, this means that the coefficient for the household head’s education variable is highly statistically significant and below any conventional significance level (e.g. 0.05, 0.01, or 0.001). Thus, there is strong evidence that the household head’s education significantly affects total household consumption, also controlling for other variables such as age, number of male and female adults, land owned, and distances to school and health center. The minimum significance level at which we can reject the null hypothesis is any level more than `r education_p_value`.
\newpage
**2. Test of significance**
We need to perform a test on two coefficients from the same model. Thus, we clearly state our null hypothesis and our (one-sided) alternative hypothesis as follows:
\begin{center} $H_0: \beta_m \leq \beta_1 \Longleftrightarrow \beta_m - \beta_1 \leq 0$ \end{center}
\begin{center} $H_1: \beta_m > \beta_1 $ (one-sided) \end{center}
In order to test that, define $\beta_m + \beta_f = \theta$. In our case, we hypothesise that $\theta = 0$. We can substitute $\beta_m$ for $\theta + \beta_f$ in our regression, so that
\begin{center} $log(constot) = \beta_0 + \beta_madultm + \beta_fadultf + ... + \epsilon $ \end{center}
\begin{center} $log(constot) = \beta_0 + (\theta+\beta_f)adultm + \beta_fadultf + ... + \epsilon $ (under the null) \end{center}
\begin{center} $log(constot) = \beta_0 + \theta adultm + \beta_f(adultm + adultf) + ... + \epsilon $ \end{center}
Thus, we rerun the regression
```{r}
reg2 <- lm(log(constot) ~ adultm + I(adultm + adultf) + headage + headeduc + healthkm + schoolkm + landarea, nichh) # transformed regression
reg2pvalue <- summary(reg2)$coefficients[2,4] # significant at the 5% level
```
To compare if the coefficients for male and female adults in the regression are statistically different from each other (or in this case if male is greater than female), we use a t-test for the difference between the two coefficients. The test-statistic is calculated by subtracting the estimated returns (i.e. coefficient) for male adults by the coefficient of female adults and then dividing the sum of the standard errors of the estimates:
\begin{center} $t = \frac{\hat{\beta}_{\text{male}} - \hat{\beta}_{\text{female}}}{\sqrt{\text{Var}(\hat{\beta}_{\text{male}}) + \text{Var}(\hat{\beta}_{\text{female}})}}$ \end{center}
- The calculated t-stat is -2.25, which implies that the coefficient for female adults is actually greater than the coefficient for male adults (so opposite of the alternative hypothesis). However, we need to check whether this difference is statistically significant by computing the p-value.
- The test statistics follows a t-distribution, which approaches the normal distribution as sample size is large (as is the case here with 567 degrees of freedom). Importantly, The further the test statistic is from zero, the more probable that the studied difference in the two coefficients is statistically significant. The p-value (calculated based on the test statistic) is the probability of getting a value of the test statistic as extreme or more extreme as the one calculated (i.e. test statistic that is less than -2.25), assuming that the null hypothesis is true. For a one-tailed test, where we are specifically testing if the returns to male adults are greater than those to female adults, the p-value can be calculated using the cumulative distribution function of the t-distribution. The p-value of `r reg2pvalue` confirms that this difference is statistically significant at the 5% significance level.
**3. Joint significance of the 2 distance variables**
In order to test for joint hypothesis that are not linear combinations of coefficients, we need an F-test. The formula is as follows:
\begin{center} $F \equiv \frac{(R^2_{ur}-R^2_r)/q}{(1-R^2_{ur})/(n-k-1)}$ (one-sided) \end{center}
```{r, include=FALSE}
reg3 <- lm(log(constot) ~ adultm + adultf + headage + headeduc + landarea, nichh) # restricted regression
F_statistic <- ((summary(reg)$r.squared - summary(reg3)$r.squared)/2) / ((1-summary(reg)$r.squared)/(length(reg$residuals)-7-1)) # f value = 2.51
p_value_2 <- df(2.515716, 2, 567)
```
- $H_0:$ The distance variables (distance to school and health clinic) are not jointly significant, i.e. their coefficients are equal to zero.
- $H_1:$ either one of the distance variables is significantly affecting household consumption.
We can use an F-test to test whether the two variables are jointly significant; we test whether the coefficients of these variables are significantly different from zero when considered together. We construct a reduced regression model, which removes the distance variables and thus allows us to compare to the original model to see if the distance variables add meaningful effect. We calculate the RSS of both models and then use the formula for calculating the F-statistic. Using the formula for F-statistic, we find: $F \equiv \frac{(R^2_{ur}-R^2_r)/q}{(1-R^2_{ur})/(n-k-1)}$ = `r F_statistic`.
Then, calculating the p-value we find it to be equal to `r p_value_2`, meaning there is a `r p_value_2` chance of observing an F-statistic as extreme or more than `r F_statistic`, given the null hypothesis is true. In this way, we can say that the variables are jointly significant at the 0.1 level (10%) and means that at this significance level we can reject the null hypothesis and say that at least one of the distance variables is significantly affecting household consumption. Overall, isolation has an impact on household welfare at the 10% level (but not at other conventional levels such as e.g. 5%).
**4. Correlation between household's educ and age**
```{r, echo=FALSE, warning=FALSE}
corr_educ_age <- cor(nichh$headeduc, nichh$headage, use = "complete.obs") # significant at 5% but not at 1%
```
```{r, corrplot, echo=FALSE, warning=FALSE, fig.align='center'}
# correlation plot
corplot <- ggplot(nichh, aes(x = headeduc, y = headage)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
labs(x = "headeduc", y = "headage") +
geom_text(x = max(nichh$headeduc), y = max(nichh$headage),
label = paste("Correlation: -0.3117732"),
hjust = 1, vjust = 1)
# save the plot as a PNG file
ggsave("headeduc_headage_plot.png", plot = corplot, width = 6, height = 4, dpi = 300)
```
Calculating the correlation between household head education and age, we obtain a value of `r corr_educ_age`.
There is a moderate negative correlation between the household's age and its education. This correlation has two implications:
- Multicollinearity = our estimation is going to be less efficient
- Omitted variable bias (if not controlled for)
**5. Coefficient Test**
We need to test a hypothesis on a single regression coefficient:
\begin{center} $H_0: \beta_{educ} = 0.06$ \end{center}
\begin{center} $H_1: \beta_educ \not= 0.06\ $ (two-sided) \end{center}
```{r}
summary(reg)$coefficients[5,1] #beta_hat
summary(reg)$coefficients[5,2] # standard error```
```
\begin{center} $t-test = \frac{\hat{\beta}-\beta_{H_0}}{se(\hat{\beta})}= \frac{0.06574516-0.06}{0.004904345}=1.171443$
\end{center}
```{r, t-test, echo=FALSE}
education_coef <- coef(reg)["headeduc"]
education_se <- summary(reg)$coefficients["headeduc", "Std. Error"]
# define hypothesised value (H0: β_education = 0.06)
beta_0 <- 0.06
# t-statistic manual calculation
t_stat2 <- (education_coef - beta_0) / education_se
# degrees of freedom (n - k)
n <- nrow(nichh) # number of observations
k <- length(coef(reg)) # number of coefficients (including the intercept)
df <- n - k
p_value <- pnorm(1.171443, lower.tail = F)*2
cat("t-statistic:", t_stat2, "\n")
cat("p-value:", p_value, "\n")
```
We can perform a t-test to test whether the coefficient of education is different from 0.06. We find a test-statistic of `r t_stat2` and a p-value of for a two-sides test of `r p_value`. With this p-value, we fail to reject the nulll hypothesis even at the standard 10% level.
**6. Restimation of model**
```{r, echo=FALSE, results="show", results='asis', fig.align='center'}
#filter the data to exclude households with more than 100 hectares of land
nichh_filtered <- nichh %>% filter(landarea < 100)
reestimation <- ggplot(nichh_filtered, aes(x=landarea)) +
geom_histogram(aes(y = after_stat(density)),
fill = "blue",
alpha = 0.3) +
geom_density(color="blue") +
theme_bw()
ggsave("reestimation_less100.png", plot = reestimation, width = 6, height = 4, dpi = 300)
# re-estimate the model excluding households with more than 100 ha
model_filtered <- lm(log(constot) ~ headage + headeduc + adultm + adultf + landarea + healthkm + schoolkm, data = nichh_filtered)
# table with results of the regression
model_filtered_results <- tidy(model_filtered)
# Extract coefficient of land area variable of model 3.
landarea_model_3_coeff<- model_filtered_results %>%
filter(term== "landarea") %>%
pull(estimate)
# Extract p-value of land area variable of model 3.
landarea_model_3_pvalue<- reg_results %>%
filter(term== "landarea") %>%
pull(p.value)
# Extract coefficient of land area variable of model 1.
landarea_model_1_coeff <- reg_results %>%
filter(term== "landarea") %>%
pull(estimate)
# Extract p-value of land area variable of model 1.
landarea_model_1_pvalue <- reg_results %>%
filter(term== "landarea") %>%
pull(p.value)
stargazer(model_filtered, type = "latex",
single.row = TRUE, # to put coefficients and standard errors on same line
no.space = TRUE, # to remove the spaces after each line of coefficients
column.sep.width = "3pt", # to reduce column width
title = "Regression Results after removing land area over 100ha ",
covariate.labels = c("Head's Age", "Head's Education", "Male Adults", "Female Adults", "Land Owned", "Distance to School (km)", "Distance to Health Center (km)"),
dep.var.labels = "Log of Total Household Consumption",
digits = 3,
header = FALSE,
out = "regression2_results.html")
```
Looking at the summary statistics and Figure 1, the density plot for households' land area is skewed to the left, with most households having around 25ha.
As seen in Table 2, the estimated coefficient for the land area variable (after removing households with more than 100ha) is `r landarea_model_3_coeff`. On the other hand, the estimated land area coefficient for the first model without removing these households is `r landarea_model_1_coeff`. These results show that the coefficient estimate is higher for the model removing households with more than 100ha of land, and implies that land area has a larger effect on household consumption and larger land owners may have been deflating the estimate of the effect of land area on household consumption. Importantly, the p-value of the estimated coefficient for land area for this new regression becomes `r landarea_model_3_pvalue`, meaning it is significant at the 5% level, as opposed to having a p-value of `r landarea_model_1_pvalue` and thus being significant at the 1% level when estimating with the model where we do not remove households with land area above 100ha.
Land area is a proxy for the wealth of a household - once we remove households with very high land area (of which there are 6 with above 100ha), the coefficient estimate of land area increases. This suggests that owning land may not contribute towards household income proportionally, with the rate at which land area is related with more income decreases as land area increases i.e. it has diminishing returns. Households have different consumption patterns, and controlling for house size, there may be an upper threshold after which household consumption does not continue to increase by as much as before this threshold.
**7. Additional variable**
An additional variable could be access to electricity (a proxy for infrastructure development). In general, infrastructure development is a critical factor in household welfare and consumption as it can significantly improve quality of life and economic productivity. Therefore, it will likely positively affects household consumption and could reduce the coefficients of other endowment variables by explaining more of the variance in consumption. For example, households with higher land endowment may have higher consumption due to their ability to generate more income. However, part of that consumption may actually be driven by whether the household has access to electricity. Including such a variable therefore provides a more complete understanding of how households convert their resources into welfare, as infrastructure plays a crucial role in that process.