forked from happyrabbit/IntroDataScience
-
Notifications
You must be signed in to change notification settings - Fork 0
/
09-Regression.Rmd
338 lines (226 loc) · 24.6 KB
/
09-Regression.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
# Regression Models
In this chapter, we will cover ordinary linear regression and a few more advanced regression methods. The linear combination of variables seems simple compared to many of today’s machine learning models. However, many advanced models use linear combinations of variables as one of its major components or steps. For example, for each neuron in the deep neural network, all the input signals are first linearly combined before feeding to a non-linear activation function. To understand many of today's machine learning models, it is helpful to understand the key ideas across different modeling frameworks.
First, we will introduce multivariate linear regression (i.e. the typical least square regression) which is one of the simplest supervised learning methods. Even though it is simple, the general ideas and procedures of fitting a regression model are applied to a boarder scope. Having a solid understanding of the basic linear regression model enables us to learn more advanced models easily. For example, we will introduce two “shrinkage” versions of linear regression: ridge regression and LASSO regression. While the parameters are fitted by the least square method, the extra penalty can effectively shrink model parameters towards zero. It mediates overfitting and maintains the robustness of the model when data size is small compared to the number of explanatory variables. We first introduce basic knowledge of each model and then provide R codes to show how to fit the model. We only cover the major properties of these models and the listed reference will provide more in-depth discussion.
We will use the clothing company data as an example. We want to answer business questions such as “which variables are the driving factors of total revenue (both online and in-store purchase)?” The answer to this question can help the company to decide where to invest (such as design, quality, etc). Note that the driving factor here does not guarantee a causal relationship. Linear regression models reveal correlation rather than causation. For example, if a survey on car purchase shows a positive correlation between price and customer satisfaction, does it suggest the car dealer should increase the price? Probably not! It is more likely that the customer satisfaction is impacted by quality. And a higher quality car tends to be more expensive. Causal inference is much more difficult to establish and we have to be very careful when interpreting regression model results.
Load the R packages first:
```{r, message = FALSE, warning=FALSE, results='hide'}
# install packages from CRAN
p_needed <- c('caret', 'dplyr', 'lattice',
'elasticnet', 'lars', 'corrplot',
'pls')
packages <- rownames(installed.packages())
p_to_install <- p_needed[!(p_needed %in% packages)]
if (length(p_to_install) > 0) {
install.packages(p_to_install)
}
lapply(p_needed, require, character.only = TRUE)
```
## Ordinary Least Square
For a typical linear regression with $p$ explanatory variables, we have a linear combinations of these variables:
$$f(\mathbf{X})=\mathbf{X}\mathbf{\beta}=\beta_{0}+\sum_{j=1}^{p}\mathbf{x_{.j}}\beta_{j}$$
where $\mathbf{\beta}$ is the parameter vector with length $p+1$. Here we use $\mathbf{x_{.j}}$ for column vector and $\mathbf{x_{i.}}$ for row vector. Least square is the method to find a set of value for $\mathbf{\beta^{T}}=(\beta_{0},\beta_{1},...,\beta_{p})$ such that it minimizes the residual sum of square (RSS):
$$RSS(\beta)=\sum_{i=1}^{N}(y_{i}-f(\mathbf{x_{i.}}))^{2}=\sum_{i=1}^{N}(y_{i}-\beta_{0}-\sum_{j=1}^{p}x_{ij}\beta_{j})^{2}$$
The process of finding a set of values has been implemented in R. Now let's load the data:
```{r}
dat <- read.csv("http://bit.ly/2P5gTw4")
```
Before fitting the model, we need to clean the data, such as removing bad data points that are not logical (negative expense).
```{r}
dat <- subset(dat, store_exp > 0 & online_exp > 0)
```
Use 10 survey question variables as our explanatory variables.
```{r}
modeldat <- dat[, grep("Q", names(dat))]
```
The response variable is the sum of in-store spending and online spending.
```{r}
# total expense = in store expense + online expense
modeldat$total_exp <- dat$store_exp + dat$online_exp
```
To fit a linear regression model, let us first check if there are any missing values or outliers:
```{r}
par(mfrow = c(1, 2))
hist(modeldat$total_exp, main = "", xlab = "total_exp")
boxplot(modeldat$total_exp)
```
There is no missing value in the response variable, but there are outliers. Outliers are usually best described by the problem to solve itself such that we know from domain knowledge that it is not possible to have such values. We can also use a statistical threshold to remove extremely large or small outlier values from the data. We use the Z-score to find and remove outliers described in section \@ref(outliers). Readers can refer to the section for more detail.
```{r}
y <- modeldat$total_exp
# Find data points with Z-score larger than 3.5
zs <- (y - mean(y))/mad(y)
modeldat <- modeldat[-which(zs > 3.5), ]
```
We will not perform log-transformation for the response variable at this stage. Let us first check the correlation among explanatory variables:
```{r corplotlm, fig.cap= "Correlation Matrix Plot for Explanatory Variables", out.width="80%", fig.asp=.75, fig.align="center"}
correlation <- cor(modeldat[, grep("Q", names(modeldat))])
corrplot::corrplot.mixed(correlation, order = "hclust", tl.pos = "lt",
upper = "ellipse")
```
As shown in figure \@ref(fig:corplotlm), there are some highly correlated variables. Let us use the method described in section \@ref(collinearity) to find potential highly correlated explanatory variables to remove with a threshold of 0.75:
```{r}
highcor <- findCorrelation(correlation, cutoff = 0.75)
```
```{r}
modeldat <- modeldat[, -highcor]
```
The dataset is now ready to fit a linear regression model. The standard format to define a regression in R is:
(1) response variable is at the left side of `~`
(2) the explanatory variables are at the right side of `~`
(3) if all the variables in the dataset except the response variable are included in the model, we can use `.` at the right side of `~`
(4) if we want to consider the interaction between two variables such as Q1 and Q2, we can add an interaction term `Q1*Q2`
(5) transformation of variables can be added directly to variable names such as `log(total_exp)`.
```{r}
lmfit <- lm(log(total_exp) ~ ., data = modeldat)
summary(lmfit)
```
The `summary(lmfit)` presents a summary of the model fit. It shows the point estimate of each explanatory variable (the `Estimate` column), their corresponding standard error (the `Std. Error` column), t values (`t value`), and p values (`Pr(>|t|)`).
### The Magic P-value
Let us pause a little to have a short discussion about p-value. Misuse of p-value is common in many research fields. There were heated discussions about P-value in the past. Siegfried commented in his 2010 Science News article: \index{P-value}
> "It's science's dirtiest secret: The scientific method of testing hypotheses by statistical analysis stands on a flimsy foundation."
American Statistical Association (i.e., ASA) released an official statement on p-value in 2016 [@ASA_P]. It was the first time to have an organization level announcement about p-value. ASA stated that the goal to release this guidance was to
> "improve the conduct and interpretation of quantitative science and inform the growing emphasis on reproducibility of science research."
The statement also noted that
> "the increased quantification of scientific research and a proliferation of large, complex data sets has expanded the scope for statistics and the importance of appropriately chosen techniques, properly conducted analyses, and correct interpretation."
The statement’s six principles, many of which address misconceptions and misuse of the P-value, are the following:
1. P-values can indicate how incompatible the data are with a specified statistical model.
2. P-values do not measure the probability that the studied hypothesis is true or the probability that the data were produced by random chance alone.
3. Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold.
4. Proper inference requires full reporting and transparency.
5. A p-value, or statistical significance, does not measure the size of an effect or the importance of a result.
6. By itself, a p-value does not provide a good measure of evidence regarding a model or hypothesis.
The $p = 0.05$ threshold is not based on any scientific calculation but is an arbitrary number. It means that practitioners can use a different threshold if they think it better fits the problem to solve. We do not promote p-value in this book. However, the p-value is hard to avoid in classical statistical inference. In practice, when making classic statistical inferences, we recommend reporting confidence interval whenever possible instead of P-value.
The Bayesian paradigm is an alternative to the classical paradigm. A Bayesian can state probabilities about the parameters, which are considered random variables. However, it is not possible in the classical paradigm. In our work, we use hierarchical (generalize) linear models in practice instead of classical linear regression. Hierarchical models pool information across clusters (for example, you can treat each customer segment as a cluster). This pooling tends to improve estimates of each cluster, especially when sampling is imbalanced. Because the models automatically cope with differing uncertainty introduced by sampling imbalance (bigger cluster has smaller variance), it prevents over-sampled clusters from unfairly dominating inference.
This book does not cover the Bayesian framework. The best applied Bayesian book is [Statistical Rethinking](https://xcelab.net/rm/statistical-rethinking/) by Richard McElreath [@rethinking2020]. The book provides outstanding conceptual explanations and a wide range of models from simple to advanced with detailed, repeatable code. The text uses R, but there are code examples for Python and Julia on the book website.
Now let us come back to our example. We will not spend too much time on p-values, while we will focus on the confidence interval for the parameter estimate for each explanatory variable. In R, the function `confint()` can produce the confidence interval for each parameter:
```{r}
confint(lmfit,level=0.9)
```
The above output is for a 90% confidence level as `level=0.9` indicated in the function call. We can change the confidence level by adjusting the level setting.
Fitting a linear regression is so easy using R that many analysts directly write reports without thinking about whether the model is meaningful. On the other hand, we can easily use R to check model assumptions. In the following sections, we will introduce a few commonly used diagnostic methods for linear regression to check whether the model assumptions are reasonable.
### Diagnostics for Linear Regression
In linear regression \index{linear regression}, we would like the Ordinary Least Square (OLS) \index{Ordinary Least Square (OLS)} estimate to be the Best Linear Unbiased Estimate (BLUE)\index{Best Linear Unbiased Estimate (BLUE)}. In other words, we hope the expected value of the estimate is the actual parameter value (i.e., unbiased) and achieving minimized residual (i.e., best). Based on the Gauss-Markov theorem, the OLS \index{OLS} estimate is BLUE \index{BLUE} under the following conditions:
1. Explanatory variables ($\mathbf{x_{.j}}$) and random error ($\mathbf{\epsilon}$) are independent: $cov(\mathbf{x_{.j},\epsilon})=0$ for $\forall j=j\in1...p$.
2. The expected value of random error is zero: $E(\mathbf{\epsilon|X})=0$
3. Random errors are uncorrelated with each other, and the variance of random error is consistent: $Var(\mathbf{\epsilon})=\sigma^{2}I$, where $\sigma$ is positive and $I$ is a $n \times n$ identical matrix.
We will introduce four graphic diagnostics for the above assumptions.
(1) Residual plot \index{residual plot}
It is a scatter plot with residual on the Y-axis and fitted value on the X-axis. We can also put any of the explanatory variables on the X-axis. Under the assumption, residuals are randomly distributed, and we need to check the following:
- Are residuals centered around zero?
- Are there any patterns in the residual plots (such as residuals with x-values farther from $\bar{x}$ have greater variance than residuals with x-values closer to $\bar{x}$)?
- Are the variances of the residual consistent across a range of fitted values?
Please note that even if the variance is not consistent, the regression parameter's point estimate is still unbiased. However, the variance estimate is not unbiased. Because the significant test for regression parameters is based on the random error distribution, these tests are no longer valid if the variance is not constant.
(2) Normal quantile-quantile Plot (Q-Q Plot) \index{Q-Q Plot}
Q-Q Plot is used to check the normality assumption for the residual. For normally distributed residuals, the data points should follow a straight line along the Q-Q plot. The more departure from a straight line, the more departure from a normal distribution for the residual.
(3) Standardized residuals plot \index{Standardized residuals plot}
Standardized residual is the residual normalized by an estimate of its standard deviation. Like the residual plot, the X-axis is still the fitted value, but the y-axis is now standardized residuals. Because of the normalization, the y-axis shows the number of standard deviations from zero. A value greater than 2 or less than -2 indicates observations with large standardized residuals. The plot is useful because when the variance is not consistent, it can be difficult to detect the outliers using the raw residuals plot.
(4) Cook's distance \index{Cook's distance}
Cook's distance can check influential points in OLS based linear regression models. In general, we need to pay attention to data points with Cook's distance > 0.5.
In R, these diagnostic graphs are built in the `plot()` function.
```{r lmdiagnostic, fig.cap= "Linear Regression Diagnostic Plots: residual plot (top left), Q-Q plot (top right), standardized residuals plot (lower left), Cook's distance (lower right)", out.width="80%", fig.asp=.75, fig.align="center"}
par(mfrow = c(2, 2))
plot(lmfit, which = 1)
plot(lmfit, which = 2)
plot(lmfit, which = 3)
plot(lmfit, which = 4)
```
The above diagnostic plot examples show:
- Residual plot: residuals are generally distributed around $y=0$ horizontal line. There are no significant trends or patterns in this residual plot (there are two bumps but does not seem too severe). So the linear relationship assumption between the response variable and explanatory variables is reasonable.
- Q-Q plot: data points are pretty much along the diagonal line of Y=X, indicating no significant normality assumption departure for the residuals. Because we simulate the data, we know the response variable before log transformation follows a normal distribution. The shape of the distribution does not deviate from a normal distribution too much after log transformation.
Note that the Gauss-Markov theorem does not require normality. We need the normal assumption to look for significant factors or define a confidence interval. However, as Andrew Gelman pointed out in section 3.6 of his book [@linear2006], normality and equal variance are typically minor concerns.
- Standardized residual plot: if the constant variance assumption is valid, then the plot's data points should be randomly distributed around the horizontal line. We can see there are three outliers on the plot. Let us check those points:
```{r}
modeldat[which(row.names(modeldat) %in% c(960, 678, 155)), ]
```
It is not easy to see why those records are outliers from the above output. It will be clear conditional on the independent variables (`Q1`, `Q2`, `Q3`, `Q6`, and `Q8`). Let us examine the value of `total_exp` for samples with the same Q1, Q2, Q3, Q6, and Q8 answers as the 3rd row above.
```{r}
datcheck = modeldat %>%
filter(Q1 ==2 & Q2 == 1 & Q3 == 1 & Q6 == 1 & Q8 == 3)
nrow(datcheck)
```
There are 87 samples with the same values of independent variables. The response variable's (`total_exp`) distribution is:
```{r}
summary(datcheck$total_exp)
```
Now it is easy to see why row 960 with `total_exp = 658.3` is an outlier. All the other 86 records with the same survey responses have a much higher total expense!
- Cook's distance: the maximum of Cook's distance is around 0.05. Even though the graph does not have any point with Cook's distance of more than 0.5, we could spot some outliers.
The graphs suggest some outliers, but it is our decision what to do. We can either remove it or investigate it further. If the values are not due to any data error, we should consider them in our analysis.
## Principal Component Regression and Partial Least Square
In real-life applications, explanatory variables are usually related to each other, containing similar information. For example, in the previous chapter, we used expenditure variables to predict consumer income. In that model, store expenditure (`store_exp`), online expenditure (`online_exp`), number of store transactions (`store_trans`), and number of online transactions (`online_trans`) are correlated to a certain extent, especially the number of transactions and expenditure. If there is a strong correlation among explanatory variables, then the least square-based linear regression model may not be robust. If the number of observations is less than the number of explanatory variables, the standard least square method cannot provide a unique set of coefficient estimates. We can perform data preprocessing, such as remove highly correlated variables with a preset threshold for such cases. However, this approach cannot guarantee a low correlation of the linear combination of the variables with other variables. In that case, the standard least square method will still not be robust. We need to be aware that removing highly correlated variables is not always guarantee a robust solution. We can also apply feature engineering procedures to explanatory variables such as principal component analysis (PCA). \index{principal component analysis (PCA)} By using principal components, we can ensure they are uncorrelated with each other. However, the drawback of using PCA is that these components are linear combinations of original variables, and it is difficult to explain the fitted model. Principal component regression (PCR) \index{Principal component regression (PCR)} is described in more detail in [@Massy1965]. It can be used when there are strong correlations among variables or when the number of observations is less than the number of variables.
In theory, we can use PCR to reduce the number of variables used in a linear model, but the results are not good. Because the first a few principal components may not be good candidates to model the response variable. PCA is unsupervised learning such that the entire process does not consider the response variable. In PCA, it only focuses on the variability of explanatory variables. When the independent variables and response variables are related, PCR can well identify the systematic relationship between them. However, when there exist independent variables not associated with response variable, it will undermine PCR’s performance. We need to be aware that PCR does not make feature selections, and each principal component is a combination of original variables.
Partial least square regression (PLS) \index{Partial least square regression (PLS)} is the supervised version of PCR. Similar to PCR, PLS can also reduce the number of variables in the regression model.\index{PCR}\index{PLS} As PLS is also related to the variables' variance, we usually standardize or normalize variables before PLS. Suppose we have a list of explanatory variables $\mathbf{X}=[X_{1},X_{2},...,X_{p}]^{T}$, and their variance-covariance matrix is $\Sigma$. PLS also transforms the original variables using linear combination to new uncorrelated variables $(Z_{1} , Z_{2} , \ldots , Z_{m})$. When $m=p$, the result of PLS is the same as OLS. The main difference between PCR and PLS is the process of creating new variables. PLS considers the response variable.
PLS is from Herman Wold's Nonlinear Iterative Partial Least Squares (NIPALS) algorithm [@wold1973;@wold1982] \index{Nonlinear Iterative Partial Least Squares (NIPALS)}. Later NIPALS was applied to regression problems, which was then called PLS \index{PLS}. PLS is a method of linearizing nonlinear relationships through hidden layers. It is similar to the PCR, except that PCR does not take into account the information of the dependent variable when selecting the components. PCR's purpose is to find the linear combinations (i.e., unsupervised) that capture the most variance of the independent variables, while PLS maximizes the linear combination of dependencies with the response variable. In the current case, the more complicated PLS does not perform better than simple linear regression. We will not discuss the PLS algorithm's detail, and the reference mentioned above provides a more detailed description of the algorithm.
We focus on using R library `caret` to fit PCR and PLS models. Let us use the 10 survey questions (`Q1`-`Q10`) as the explanatory variables and income as the response variable. First load the data and preprocessing the data:
```{r}
library(lattice)
library(caret)
library(dplyr)
library(elasticnet)
library(lars)
# Load Data
sim.dat <- read.csv("http://bit.ly/2P5gTw4")
ymad <- mad(na.omit(sim.dat$income))
# Calculate Z values
zs <- (sim.dat$income - mean(na.omit(sim.dat$income)))/ymad
# which(na.omit(zs>3.5)) find outlier
# which(is.na(zs)) find missing values
idex <- c(which(na.omit(zs > 3.5)), which(is.na(zs)))
# Remove rows with outlier and missing values
sim.dat <- sim.dat[-idex, ]
```
Now let’s define explanatory variable matrix (`xtrain`) by selecting these 10 survey questions columns, and define response variable (`ytrain`):
```{r}
xtrain = dplyr::select(sim.dat, Q1:Q10)
ytrain = sim.dat$income
```
We also set up random seed and 10-folder cross-validation:
```{r}
set.seed(100)
ctrl <- trainControl(method = "cv", number = 10)
```
Fit PLS model using number of explanatory variables as the hyper-parameter to tune. As there are at most 10 explanatory variables in the model, we set up the hyper-parameter tuning range to be 1 to 10:
```{r}
plsTune <- train(xtrain, ytrain,
method = "pls",
# set hyper-parameter tuning range
tuneGrid = expand.grid(.ncomp = 1:10),
trControl = ctrl)
plsTune
```
From the result, we can see that the optimal number of variables is 7. However, if we pay attention to the RMSE improvement, we will find only minimum improvement in RMSE after three variables. In practice, we could choose to use the model with three variables if the improvement does not make a practical difference, and we would rather have a simpler model. \index{RMSE}
We can also find the relative importance of each variable during PLS model tuning process, as described using the following code: \index{PLS}
```{r}
plsImp <- varImp(plsTune, scale = FALSE)
plot(plsImp, top = 10, scales = list(y = list(cex = 0.95)))
```
The above plot shows that Q1, Q2, Q3, and Q6, are more important than other variables. Now let’s fit a PCR model with number of principal components as the hyper-parameter: \index{PCR}
```{r}
# Set random seed
set.seed(100)
pcrTune <- train(x = xtrain, y = ytrain,
method = "pcr",
# set hyper-parameter tuning range
tuneGrid = expand.grid(.ncomp = 1:10),
trControl = ctrl)
pcrTune
```
From the output, the default recommendation is ten components. However, if we pay attention to RMSE improvement with more components, we will find little difference after the model with three components. Again, in practice, we can keep models with three components.
Now let’s compare the hyper-parameter tuning process for PLS and PCR:
```{r}
# Save PLS model tuning information to plsResamples
plsResamples <- plsTune$results
plsResamples$Model <- "PLS"
# Save PCR model tuning information to plsResamples
pcrResamples <- pcrTune$results
pcrResamples$Model <- "PCR"
# Combine both output for plotting
plsPlotData <- rbind(plsResamples, pcrResamples)
# Leverage xyplot() function from lattice library
xyplot(RMSE ~ ncomp,
data = plsPlotData,
xlab = "# Components",
ylab = "RMSE (Cross-Validation)",
auto.key = list(columns = 2),
groups = Model,
type = c("o", "g"))
```
The plot confirms our choice of using a model with three components for both PLS and PCR.