-
Notifications
You must be signed in to change notification settings - Fork 0
/
4 OLS Assumptions_Detection_Solutions.Rmd
221 lines (145 loc) · 8.02 KB
/
4 OLS Assumptions_Detection_Solutions.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
---
title: "OLS Assumptions_Detection and Solution"
output:
word_document: default
html_document: default
pdf_document: default
---
# OLS assumptions
1. Correct specification of the model.
2. Model has to be linear in parameters.
3. Number of observations is greater than number of parameters.
4. The variance of each independent variables is not zero (not all observations have the same value for the independent variable).
5. Independent variables are deterministic.
6. No perfect multicollinearity
7. Homoskedasticity: Constant variance of the errors across value of independent variables.
8. No correlation between residuals. For two given and different values of the X the errors are not correlated.
9. The covariance between X and the error is zero.
10. The mean of the errors for a given X is zero.
11. The errors are normally distributed.
# OLS properties
If the OLS assumptions hold, then the OLS-estimator will be the best, linear, unbiased one.
- **unbiased**, i.e. the estimates converge towards their true value with an increasing number of observations.
- **consistent**, i.e. the variance (standard error) decreases with an increasing number of observations.
- **efficient**, i.e. it has the lowest variance (standard error) among all estimators available.
# Assumption violation: Detection & Solutions
## Multicollinearity
Imagine we run a regression including revenue and total assets as explanatory variables in our model. The OLS will not be able to distinguish between the effects of the two correctly. We will then have high [multicollinearity]{.underline}, which [results to]{.underline}
**less precise parameter estimates =\> increased standard errors of the estimates =\> lower t-values**
### Detection:
- [High R-squared]{.underline} but [few significant parameters]{.underline}
- [High pairwise correlations]{.underline} between independent variables $|corr|>0.8$
- [High Variance inflation factor]{.underline}: $VIF>= 5$
### Solutions:
- Usage of [more and/or better data]{.underline}
- [Exclusion of]{.underline} one or more [variables]{.underline} (particularly in the case of perfect multicollinearity) but risk of specification errors
- Other methods such as [factor analysis]{.underline}
### Example
Let the data "MarketPower.xlsx"
```{r echo=FALSE}
library(readxl)
mpow <- read_excel("D:/data/Empirical Research/MarketPower.xlsx")
summary(mpow)
```
Lets us run an OLS regression
```{r}
OLSbase = lm(Markup~RevGR+eqshare+FCR+age+TotalassetsthEUR, dat=mpow)
summary(OLSbase)
```
The phenomenon of "high-R-squared, few significant parameters rule", is not observed in our model as there are indeed only one or two significant parameters but the the $R^{2}=0.05 << 1$ . Thus, we would conclude that there is no multicollinearity issue concerning the specific linear model?
As a second detection method let us calculate the pairwise correlation coefficients
```{r}
indepvar = cbind(mpow[,5], mpow[,18], mpow[,20], mpow[,22:23])
cor(indepvar)
```
Again we observe that no pairwise correlation coefficient has absolute value is greater than 0.8. According to this criterion there is no multicollinearity in our model
Finally, let us examine according to VIF detection method
```{r}
library(car)
vif(OLSbase)
```
As no VIF is equal to or higher than 5, we conclude again that there is no multicollinearity issue with the selected independent variables of this linear model
## Heteroskedasticity
The variance of the errors varies across observations leading to distorted standard errors =\> t-tests become inaccurate (usually indicate higher significance). (Suppose you regress a common-product consumption on income =\> for large incomes, the errors will be larger, i.e. their variance increases)
### Detection
- [Goldfeld/Quandt test]{.underline}: equality of error variance cross subsamples tested (F-test)
- [Method of Glesjer:]{.underline} Regress absolute values of residuals on independent variables.
- [Breusch-Pagan-test]{.underline}: Regress squared residuals on independent variables. Conduct a Chi-squared test with k-1 degrees of freedom (k = number of explanatory variable) where: $BP=n{R^{2}}$
- [White test:]{.underline} Regress squared residuals on independent variables and their products. Conduct a Chi-squared test with k-1 degrees of freedom (k = number of explanatory variables) where: $W=n{R^{2}}$
### Solutions
- Solve specification errors ([omitted variables]{.underline})
- Use o[ther estimators]{.underline} (such as [weighted least squares]{.underline})
- T[ransform the error term]{.underline} ([Generalized least squares]{.underline})
- Use h[eteroskedasticity robust standard errors]{.underline} (most popular)
### Example
Let us consider again the data "MarketPower.xlsx" and the same linear model "OLSbase".
We start with the Breusch-Pagan test for Heteroskedasticity
```{r}
mpow$sqres = OLSbase$residuals^2
BPreg = lm(sqres~RevGR+eqshare+FCR+age+TotalassetsthEUR, dat=mpow)
BP = nrow(mpow)*summary(BPreg)$r.squared
BP
BPpv = pchisq(BP, length(BPreg$coefficients)-1,lower.tail=FALSE)
BPpv
```
As the $p-value << 1$ we conclude that under the H0 : 'there is no Heteroskedasticity' a value $BP = 57.907$ it is extremely unlikely to be measured. Thus we reject the null hypothesis in favor of heteroskedasticity.
Alternatively we can conduct a White-test
```{r}
WTreg = lm(sqres~RevGR+eqshare+FCR+age+TotalassetsthEUR+I(RevGR*RevGR)+RevGR*eqshare+RevGR*FCR
+RevGR*age+RevGR*TotalassetsthEUR+I(eqshare*eqshare)+eqshare*FCR+eqshare*age
+eqshare*TotalassetsthEUR+I(FCR*FCR)+FCR*age+FCR*TotalassetsthEUR
+I(age*age)+age*TotalassetsthEUR+I(TotalassetsthEUR*TotalassetsthEUR), dat=mpow)
summary(WTreg)
WT = nrow(mpow)*summary(WTreg)$r.squared
WT
WTpv = pchisq(WT, length(WTreg$coefficients)-1,lower.tail=FALSE)
WTpv
```
With that test we also have that $p-value << 1$ .We conclude that under the H0 : 'there is no Heteroskedasticity' a value $W = 126.49$ it is extremely unlikely to be measured. Thus we reject the null hypothesis in favor of heteroskedasticity.
A way to resolve the issue of heteroskedastic errors we use robust standard errors, which by construction are wider(and thus more realistic) than the ones from the simple model.
## Autocorrelation
The residuals are correlated over time. Hence, autocorrelation is mostly relevant when working with times series or panel data. Autocorrelation typically leads to downward biased standard errors.
### Detection
- Look at the [scatterplot]{.underline} of the [residuals from t against]{.underline} those [from t-1]{.underline}.
- [Durbin-Watson test]{.underline}: The test statistic is calculated as: d = $\sum ([u(t)-u(t-1)]^{2})/\sum([u(t)]^{2})$ where
d lies between 0 and 4 and we look up lower and upper d from the table.
### Solutions
- Mostly changing the model specification.
### Example
We consider the time-series data "solardat.csv"
```{r}
library(readr)
soldat <- read.csv("D:/data/Empirical Research/solardat.csv",sep=";")
head(soldat)
```
Let us run a linear model
```{r}
OLSsol = lm(CV_daily~PV_daily_MWh, dat=soldat)
summary(OLSsol)
```
Obtain residuals
```{r message=FALSE, warning=FALSE}
library(Hmisc)
soldat$solres= OLSsol$residuals
soldat$lagsolres = Lag(soldat$solres)
```
Generate the test statistic
```{r}
soldat$difres = (soldat$solres-soldat$lagsolres)^2
soldat$sqres = soldat$solres^2
dtest = sum(soldat$difres, na.rm=TRUE)/sum(soldat$sqres, na.rm=TRUE)
dtest
```
We use predefined function in R to perform the Durbin-Watson test
```{r}
library(car)
durbinWatsonTest(OLSsol)
```
A possible way to resolve the issue of autocorrelation is the following
```{r}
soldat$lagcv = Lag(soldat$CV_daily)
autoco = lm(CV_daily~PV_daily_MWh+lagcv,dat=soldat)
summary(autoco)
durbinWatsonTest(autoco)
```
As we can observe the autocorrelation is now not statistical significant.