-
Notifications
You must be signed in to change notification settings - Fork 3
/
09_variable_selection.Rmd
553 lines (382 loc) · 25.5 KB
/
09_variable_selection.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
# Application 1: Variable selection {#ch-08}
## Introduction
In Chapter \@ref(ch-07), we noted that the coefficient of determination $R^2$ may increase even when uninformative predictors are added to a model. This will ascribe some predictive power to an uninformative predictor that is in fact misguided by its (random) correlation with the target variable. Often, we start out formulating models, not knowing beforehand what predictors should be considered and we are tempted to use them all because the full model will always yield the best $R^2$. In such cases, we're prone to building overconfident models that perform well on the training set, but will not perform well when predicting to new data.
In this application session, we'll implement an algorithm that sequentially searches the best additional predictor to be included in our model, starting from a single one. This is called *stepwise-forward* regression (see definition below). There is also *stepwise-backward* regression where predictors are sequentially removed from a model that includes them all. The challenge is that we often lack the possibility to confidently assess generalisability. The effect of spuriously increasing $R^2$ by adding uninformative predictors can be mitigated, as we noted in Chapter \@ref(ch-07), by considering alternative metrics that penalize the number of predictors in a model. They balance the tradeoff between model complexity (number of variables in the linear regression case) and goodness of fit. Such metrics include the *adjusted-*$R^2$,the Akaike Information Criterion (AIC), or the Bayesian Information Criterion (BIC). In cases, where sufficient data is available, also cross-validation can be used for assessing the generalisability of alternative models. Here, we'll assess how these different metrics behave for a sequence of linear regression models with an increasing number of predictors. You'll learn to write code that implements an algorithm determining the order in which variables enter the model, starting from one and going up to fourteen predictors. You'll write your own stepwise-forward regression code.
Let's get started!
## Application
### Warm-up 1: Nested for-loop
Given a matrix A and a vector B (see below), do the following tasks:
- Replace the missing values (`NA`) in the first row of A by the largest value of B. After using that element of B for imputing A, drop that element from the vector B and proceed with imputing the second row of A, using the (now) largest value of the updated vector B, and drop that element from B after using it for imputing A. Repeat the same procedure for all four rows in A.
- After imputing (replacing) in each step, calculate the mean of the remaining values in B and record it as a single-row data frame with two columns `row_number` and `avg`, where `row_number` is the row number of A where the value was imputed, and `avg` is the mean of remaining values in B. As the algorithm proceeds through rows in A, sequentially bind the single-row data frame together so that after completion of the algorithm, the data frame contains four rows (corresponding to the number of rows in A).
```{r}
A <- matrix(c(6, 7, 3, NA, 15, 6, 7,
8, 9, 12, 6, 11, NA, 3,
9, 4, 7, 3, 21, NA, 6,
7, 19, 6, NA, 15, 8, 10),
nrow = 4, byrow = TRUE)
B <- c(8, 4, 12, 9, 15, 6)
```
Before implementing these tasks, try to write down a pseudo code. This is code-like text that may not be executable, but describes the structure of real code and details where and how major steps are implemented. Next, you'll need to write actual R code. For this, you will need to find answers to the following questions:
+ How to go through each of the element in matrix?
+ How to detect NA value?
+ How to drop an element of a given value from a vector?
+ How to add a row to an existing data frame?
**Solution**:
```{r message=FALSE}
library(tidyverse)
summ <- data.frame()
for (i in 1:nrow(A)){
for (j in 1:ncol(A)){
if (is.na(A[i,j])){
A[i,j] <- max(B)
}
}
B <- B[-which(B == max(B))] # update the B vector removing the biggest values
summ <- bind_rows(summ, data.frame(row_number = i, avg = mean(B)))
}
summ
```
### Warm-up 2: Find the best single predictor
**The math behind forward stepwise regression:**
1. Let $\mathcal{M_0}$ denote the null model, which contains no predictors.
2. For $k=0,..,p-1$:
(a) Consider all $p − k$ models that augment $\mathcal{M}_k$ with one additional predictor.
(b) Choose the best model among these $p − k$ models, and call it $\mathcal{M}_{k+1}$. Here _best_ is defined as having the highest $R^2$ .
3. Select a single best model from among $\mathcal{M}_0$, . . . , $\mathcal{M}_p$ using cross-validated prediction error, AIC, BIC, or adjusted $R^2$.
The first step of a stepwise forward regression is to find the single most powerful predictor in a univariate linear regression model for the target variable `GPP_NT_VUT_REF` among all fourteen available predictors in our data set (all except those of type `date` or `character`). Implement this first part of the search, using the definition of the stepwise-forward algorithm above. Remove all rows with at least one missing value before starting the predictor search.
- Which predictor achieves the highest $R^2$?
- What value is the $R^2$?
- Visualise $R^2$ for all univariate models, ordered by their respective $R^2$ values.
- Do you note a particular pattern? Which variables yield similar $R^2$? How do you expect them to be included in multivariate models of the subsequent steps in the stepwise forward regression?
_Hints_:
+ Model structure:
- The "counter" variables in the for loop can be provided as a vector, and the counter will sequentially take on the value of each element in that vector. For example: `for (var in all_predictors){ ... }`.
+ Algorithm:
- To record $R^2$ values for the different models, you may start by creating an empty vector (`vec <- c()`) before the loop and then sequentially add elements to that vector inside the loop (`vec <- c(vec, new_element)`). Alternatively, you can do something similar, but with a data frame (initialising with `df_rsq <- data.frame()` before the loop, and adding rows by `df_rsq <- bind_rows(df_rsq, data.frame(pred = predictor_name, rsq = rsq_result))` inside the loop).
- A clever way how to construct formulas dynamically is described, for example in [this stackoverflow post](https://stackoverflow.com/questions/4951442/formula-with-dynamic-number-of-variables).
+ Value retrieving:
- Extract the $R^2$ from the linear model object: `summary(fit_lin)[["r.squared"]]`
+ Visualising:
- Search for solutions for how to change the order of levels to be plotted yourself.
```{r message=F, warning=F}
library(tidyverse)
## read CSV and drop missing values in one step
df <- read_csv("./data/ddf_for_08_application.csv") %>%
drop_na()
## specify target variable
target <- 'GPP_NT_VUT_REF'
## determine predictors as all except site ID, timestamp, and the target (should be 14)
preds <- df %>%
dplyr::select(-siteid, -TIMESTAMP, -GPP_NT_VUT_REF) %>%
names()
## initialise an empty data frame (necessary, because otherwise we cannot use bind_rows() below)
df_rsq <- data.frame()
# rsq_list <- c() # alternative for vector
for (var in preds){
## create formula dynamically
forml <- as.formula(paste(target, "~", var))
## fit linear model
fit_lin <- lm(forml, data = df)
## extract R2 from linear model
rsq <- summary(fit_lin)[["r.squared"]]
## add a row to the data frame that holds the results
df_rsq <- bind_rows(df_rsq, data.frame(pred = var, rsq = rsq))
# rsq_list <- c(rsq_list,rsq) # alternative with vector
}
## print best single predictor and its R2
df_rsq %>%
arrange(-rsq) %>% # arrange by R2 in descending order (highest R2 on top)
slice(1) # print only the first row (with highest R2)
```
```{r}
library(ggplot2)
## alternative: determine the first variable to enter into our model
# preds[which.max(rsq_list)]
## use the data frame that holds the results for plotting
df_rsq %>%
ggplot(aes(x = reorder(pred, rsq), y = rsq)) +
geom_bar(stat = "identity") +
labs(y = expression(italic(R)^2), x = "Variable") +
coord_flip()
```
**Solution**:
*`PPFD_IN` achieves the highest $R^2$ value, and the corresponding $R^2$ values is 0.5276123.*
*We could see from the above plot that `LW_IN_F_MDS` and `LW_IN_F`, `VPD_F` and `VPD_F_MDS`, `SW_IN_F` and `SW_IN_F_MDS` share basically the same $R^2$ values. Those variables are highly correlated and are explaining the same amount of variance in the target variable. In the subsequent steps of the stepwise forward regression, since each of the two variables are highly correlated with each other, it's likely that only one of them will be selected into our final model.*
### Full stepwise regression
Now, we take it to the next level and implement a full stepwise forward regression as described above. For each step (number of predictors $k$), record the following metrics: $R^2$, *adjusted-*$R^2$, the Akaike Information Criterion (AIC), Bayesian Information Criterion (BIC), and the 5-fold cross-validation $R^2$ and RMSE.
- Write pseudo-code for how you plan to implement the algorithm first.
- Implement the algorithm in R, run it and display the order in which predictors enter the model.
- Display a table with the metrics of all $k$ steps, and the single variable, added at each step.
_Hints_:
+ Model structure:
- Recall what you learned in the breakout session, you may use the same idea on this task. Try to think of the blueprint (*pseudo-code*) first: How to go through different models in each forward step? How to store predictors added to the model and how to update candidate predictors?
+ Algorithm:
- A complication is that the set of predictors is sequentially complemented at each step of the search through $k$. You may again use `vec <- list()` to create an empty vector, and then add elements to that vector by `vec <- c(vec, new_element)`.
- It may be helpful to explicitly define a set of "candidate predictors" that may potentially be added to the model as a vector (e.g., `preds_candidate`), and define predictors retained in the model from the previous step in a separate vector (e.g., `preds_retained`). In each step, search through `preds_candidate`, select the best predictor, add it to `preds_retained` and remove it from `preds_candidate`.
- At each step, record the metrics and store them in a data frame for later plots. As in the first "warm-up" exercise, you may record metrics at each step as a single-row data frame and sequentially stack (bind) them together.
- (As above) A clever way how to construct formulas dynamically is described, for example in [this stackoverflow post](https://stackoverflow.com/questions/4951442/formula-with-dynamic-number-of-variables).
- The metrics for the $k$ models are assessed *after* the order of added variables is determined. To be able to determine the metrics, the $k$ models can be saved by constructing a list of models and sequentially add elements to that list (`mylist[[ name_new_element ]] <- new_element`). You can also fit the model again after determining which predictor worked best.
- Your code will most certainly have bugs at first. To debug efficiently, write code first in a simple R script and use the debugging options in RStudio (see [here](https://support.rstudio.com/hc/en-us/articles/205612627-Debugging-with-RStudio)).
+ Value retrieving
- To get AIC and BIC values for a given model, use the base-R functions `AIC()` and `BIC()`.
- To get the cross-validated $R^2$ and RMSE, use the caret function `train()` with RMSE as the loss function, and `method = "lm"` (to fit a linear regression model). Then extract the values by `trained_model$results$Rsquared` and `trained_model$results$RMSE`.
+ Displaying:
- To display a table nicely as part of the RMarkdown html output, use the function `knitr::kable()`
- To avoid reordering of the list of variable names in plotting, change the type of variable names from "character" to "factor" by `pred <- factor(pred, levels = pred)`
<!-- - remove element by index number from a vector `vec[-2]` -->
<!-- - concatenate string vectors with certain separations: `paste(c("predictor_1","predictor_2","predictor_3"),collapse = '+')` gives `"predictor_1 + predictor_2 + predictor_3"` -->
<!-- - fit a linear regression with three predictors: `fit_lin <- lm(target ~ predictor_1 + predictor_2 + predictor_3, data = df)` -->
<!-- - execute a string-like expression: `eval(parse(text = <string>))`, for example: `eval(parse(text='mean(c(2,3,4))'))=3` -->
<!-- - get AIC and BIC from a linear regression model object: `AIC(fit_lin)`, `BIC(fit_lin)` -->
```{r message=FALSE}
library(caret) # need train() for cross-validation
## specify target variable (as above)
target <- 'GPP_NT_VUT_REF'
## determine predictors as all except site ID, timestamp, and the target (should be 14) (as above)
preds <- df %>%
dplyr::select(-siteid, -TIMESTAMP, -GPP_NT_VUT_REF) %>%
names()
# This is the vector of candidate predictors to be added in the model. To begin with, consider all as candidates.
preds_candidate <- preds
# predictors retained in the model from the previous step. To begin with, is empty.
preds_retained <- c()
## work with lists as much as possible (more flexible!)
df_metrics <- data.frame()
## outer loop for k predictors
for (k_index in 1:length(preds)){
# rsq_candidates <- c()
df_rsq_candidates <- data.frame()
linmod_candidates <- list()
## inner loop for single additional predictor
for (ipred in preds_candidate){
# variable vector (new variable + retained variables) used in regression
pred_add <- c(preds_retained, ipred)
# define formulate with newly-added predictor
forml <- as.formula(paste( target, '~', paste(pred_add, collapse = '+')))
# fit linear model
fit_lin <- lm(forml, data = df)
# add model object to list, and name the element according to the added variable
linmod_candidates[[ ipred ]] <- fit_lin
# record metrics for all candidates
rsq <- summary(fit_lin)[["r.squared"]]
df_rsq_candidates <- bind_rows(df_rsq_candidates, data.frame(pred = ipred, rsq = rsq)) # when storing R2 in a data frame
# rsq_candidates <- c(rsq_candidates, rsq) # when storing R2 as a vector
}
## get name of candidate predictor that achieved the highest R2.
pred_add <- df_rsq_candidates %>% # when storing R2 in a data frame
arrange(desc(rsq)) %>%
slice(1) %>%
pull(pred) %>%
as.character()
# pred_add <- preds_candidate[ which.max(rsq_candidates) ] # when storing R2 as a vector
## add best predictors to retained predictors
preds_retained <- c(preds_retained, pred_add)
## get the cross-validation R2
forml <- as.formula(paste( target, '~', paste(preds_retained, collapse = '+')))
control <- trainControl(method = "cv", number = 5)
set.seed(123)
# cv_modl <- train(forml, data = df, method = "lm", , metric = "Rsquared", trControl = control)
cv_modl <- train(forml, data = df, method = "lm", metric = "RMSE", trControl = control)
# record AIC and BIC and adjusted-R2 of the respective model
df_metrics <- df_metrics %>%
bind_rows(
data.frame( pred = pred_add,
rsq = summary(linmod_candidates[[ pred_add ]])[["r.squared"]],
adj_rsq = summary(linmod_candidates[[ pred_add ]])[["adj.r.squared"]],
cv_rsq = cv_modl$results$Rsquared,
cv_rmse = cv_modl$results$RMSE,
aic = AIC(linmod_candidates[[ pred_add ]]),
bic = BIC(linmod_candidates[[ pred_add ]])
)
)
# remove the selected variable from the candidate variable list
preds_candidate <- preds_candidate[-which(preds_candidate == pred_add)]
# preds_candidate <- setdiff(preds_candidate,pred_add) # alternative
}
data.frame(df_metrics$pred) # order in which variables enter the model
df_metrics %>% knitr::kable()
```
- Visualise all metrics as a function of the number of predictors (add labels for the variable names of the added predictor). Highlight the best-performing model based on the respective metric. How many predictors are in the best performing model, when assessed based on each metric?
```{r eval = FALSE}
library(ggplot2)
df_metrics$pred <- factor(df_metrics$pred, levels = df_metrics$pred)
ggplot() +
geom_point(data = df_metrics, aes(x = pred, y = rsq)) +
geom_point(data = filter(df_metrics, rsq == max(rsq)), aes(x = pred, y = rsq), color = "red") +
labs(title = expression(italic(R)^2)) +
coord_flip()
ggplot() +
geom_point(data = df_metrics, aes(x = pred, y = adj_rsq)) +
geom_point(data = filter(df_metrics, adj_rsq == max(adj_rsq)), aes(x = pred, y = adj_rsq), color = "red") +
labs(title = expression(paste("Adjusted-", italic(R)^2))) +
coord_flip()
ggplot() +
geom_point(data = df_metrics, aes(x = pred, y = cv_rsq)) +
geom_point(data = filter(df_metrics, cv_rsq == max(cv_rsq)), aes(x = pred, y = cv_rsq), color = "red") +
labs(title = expression(paste("Cross-validated ", italic(R)^2))) +
coord_flip()
ggplot() +
geom_point(data = df_metrics, aes(x = pred, y = cv_rmse)) +
geom_point(data = filter(df_metrics, cv_rmse == min(cv_rmse)), aes(x = pred, y = cv_rmse), color = "red") +
labs(title = expression(paste("Cross-validated ", RMSE))) +
coord_flip()
ggplot() +
geom_point(data = df_metrics, aes(x = pred, y = aic)) +
geom_point(data = filter(df_metrics, aic == min(aic)), aes(x = pred, y = aic), color = "red") +
labs(title = "AIC")+
coord_flip()
ggplot() +
geom_point(data = df_metrics, aes(x = pred, y = bic)) +
geom_point(data = filter(df_metrics, bic == min(bic)), aes(x = pred, y = bic), color = "red") +
labs(title = "BIC")+
coord_flip()
```
- Observe which predictors are selected into the final model and which not. Why?
*Solution:*
Take the best model selected by BIC as an example, `VPD_F` is selected into the model and `VPD_F_MDS` is not. They won’t enter the model both because they are highly correlated and when one is in, the other doesn’t add information, which proves our expectation in the previous question.
- Whether it matters to the order of variables entering into the model which metric (i.e. AIC, BIC or $R^2$) is used? Why?
*Solution:*
No, it doesn't matter. For each inner loop, when deciding which variable is sequentially chosen, we compare models of the same level (i.e. same number of variables). In this sense, model complexities for all the models are the same, and AIC/BIC only compares the goodness of fit part, which works the same as $R^2$.
- Discuss your results. Which metric do you trust most? Why? Try out your algorithm without evaluating the cross-validated $R^2$. Do you get a computational efficiency gain? Which metric comes closest to the cross-validated $R^2$ in terms of selecting the same number predictors?
*Solution:*
Cross-validated $R^2$ is the "gold-standard" for assessing generalisability. However, it is computationally costly and may not provide robust results when the data set is small. Since AIC and BIC both penalise additional predictors, they both don't select the full model with 14 variables as the best model (However, it could happen that the best model selected by AIC/BIC is the full model). The BIC selects for the same number of predictors as the cross-validated $R^2$ and RMSE, but without the disadvantage of the computational costs. It provides a conservative and apparently robust estimate for the model generalisability. Note that the order of added predictors is determined by the simple $R^2$ and is therefore independent of the final metrics we use for choosing the best $k$.
### Bonus: Stepwise regression out-of-the-box
In R, you can also conduct the above variable selection procedures automatically. `regsubsets()` from leaps package provides a convenient way to do so. It does model selection by exhaustive search, including forward or backward stepwise regression.
- Do a stepwise forward regression using `regsubsets()`.
- Create a data frame with the same metrics as above (except cross-validated $R^2$).
- Visualise metrics as above.
- Are your result consistent?
_Hints_:
- Specify stepwise *forward* by setting `method = "forward"`.
- Specify the number of predictors to examine, that is all fourteen, by setting `nvmax = 14`.
- AIC values of each step are stored in `summary(regfit.fwd)$cp` and BIC values of each step are in `summary(regfit.fwd)$bic`, etc.
- Get order in which predictors $k$ are added (which corresponds to values returned by `summary(regfit.fwd)$bic`), by `all_predictors[regfit.fwd$vorder[2:15]-1]`. `vorder` is the order of variables entering into model. Note that `Intercept` is counted here as the first to enter into the model and should be removed.
- To avoid reordering of the list of variable names in plotting, change the type of variable names from "character" to "factor" by `preds_enter <- factor(preds_enter, levels = preds_enter)`
```{r}
library(leaps)
regfit.fwd <- regsubsets(as.formula(paste(target, "~", paste(preds,collapse=" + "))), data = df, method = "forward", nvmax = 14)
reg.summary <- summary(regfit.fwd)
# get variables in their order added to the model
preds_enter <- preds[regfit.fwd$vorder[2:15]-1]
## create metrics data frame
df_metrics_auto <- data.frame(
preds = factor(preds_enter, levels = preds_enter),
rsq = reg.summary$rsq,
adj_rsq = reg.summary$adjr2,
aic = reg.summary$cp,
bic = reg.summary$bic
)
ggplot() +
geom_point(data = df_metrics_auto, aes(x = preds, y = rsq)) +
geom_point(data = filter(df_metrics_auto, rsq == max(rsq)), aes(x = preds, y = rsq), color = "red") +
labs(title = expression(italic(R)^2)) +
coord_flip()
ggplot() +
geom_point(data = df_metrics_auto, aes(x = preds, y = adj_rsq)) +
geom_point(data = filter(df_metrics_auto, adj_rsq == max(adj_rsq)), aes(x = preds, y = adj_rsq), color = "red") +
labs(title = expression(paste("Adjusted-", italic(R)^2))) +
coord_flip()
ggplot() +
geom_point(data = df_metrics_auto, aes(x = preds, y = aic)) +
geom_point(data = filter(df_metrics_auto, aic == min(aic)), aes(x = preds, y = aic), color = "red") +
labs(title = "AIC")+
coord_flip()
ggplot() +
geom_point(data = df_metrics_auto, aes(x = preds, y = bic)) +
geom_point(data = filter(df_metrics_auto, bic == min(bic)), aes(x = preds, y = bic), color = "red") +
labs(title = "BIC")+
coord_flip()
```
### Bonus: Best Subset Selection - Not included in application for students (too long...)
First we generate our data.
```{r}
set.seed(41)
#data size
n = 500
#predictors
x1 = rnorm(n)
x2 = rnorm(n)
x3 = rnorm(n)
#generate output
y = 4*x1 + 3*x3 + rnorm(n)
df_data = data.frame(y = y, x1 = x1,x2 = x2, x3 = x3)
head(df_data)
```
The true underline model uses as predictors the variables x1 and x3. However in practise we do not know this information. What we really see is the df_data file. For this exercise we want to find those variables that are most informatives. Since we have 3 available predictors we can create 7 different models using all possible combinations of predictors. This type of variable selection is reffered as _Best Subset Selection_.
```{r}
formulas = c('y~x1','y~x2','y~x3','y~x1+x2','y~x1+x3','y~x2+x3','y~x1+x2+x3')
cat("All possible models \n",paste(formulas,collapse = ' , '),'\n')
```
In order to find the optimal model we will use two methods:
a. BIC
b. Cross Validated MSE
We note here that the BIC criterion uses all the available data while the MSE uses the cross validated test data.
#### BIC
```{r}
#create vector that stores the BIC results for all the models
bic_s = rep(NA,length(formulas))
#iterate over the models
for (i in 1:length(formulas)){
#construct formula
formula = as.formula(formulas[i])
#fit linear model
lm_fit = lm(formula,data = df_data)
#calculate bic
bic_s[i] = BIC(lm_fit)
}
```
find the optimal model according to BIC
```{r}
#find the optimal model according to BIC
cat("Optimal Model BIC: ",formulas[which.min(bic_s)])
```
#### Cross Validated MSE
```{r}
#shuffle the data
shuffled_id = sample(x = 1:n, size = n)
df_data = df_data[shuffled_id,]
#specify the number of folds
n_folds = 5
#folds assignments (in which folder each point is assigned)
folds = cut(1:n,breaks = n_folds,labels = FALSE)
#create matrix to store the MSE from each test folder for all models : dimension 7 x n_folds
mse_s = matrix(NA,nrow = length(formulas),ncol = n_folds)
#iterate over the models
for (i in 1:length(formulas)){
#construct formula
formula = as.formula(formulas[i])
#cross validation for the i-th model
for(j in 1:n_folds){
#take train indixes for all folders except j
train_ind = (folds != j)
#take test indexes for folder j
test_ind = (folds == j)
#train data that contains all the points that do not belong in folder j
train_data = df_data[train_ind,]
#test data that contains all the points that belong to folder j
test_data = df_data[test_ind,]
#fit linear model using the training data
lm_fit = lm(formula, data = train_data)
# calculate mse on test data
## make predictions
y_pred = predict(lm_fit,test_data)
#calculate mse of test folder --> jth folder
mse = mean((y_pred - test_data$y)^2)
mse_s[i,j] = mse
}
}
```
Check the MSE error for each test folder for every model.
```{r}
mse_s
```
Calculate Cross Validated MSE for every model.
```{r}
cv_mse = rowMeans(mse_s)
cv_mse
```
Find the optimal model according to Cross Validated MSE.
```{r}
#find the optimal model according to CV
cat("Optimal Model CV_MSE: ",formulas[which.min(cv_mse)])
```