-
Notifications
You must be signed in to change notification settings - Fork 3
/
08_supervised_ml_2.Rmd
1152 lines (939 loc) · 46.8 KB
/
08_supervised_ml_2.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
editor_options:
markdown:
wrap: 72
---
# Supervised Machine Learning II {#ch-07}
## Introduction
### Learning objectives
After this learning unit, you will be able to ...
- Explain the effect of hyper-parameter tuning.
- Assess the generalisability of a trained model.
- Understand the purpose of resampling.
- Avoid data leakage during model training.
### Key points from the lecture
**Training data** is the data that is used to train our model - "to fit
a curve to the data points".
**Testing data** is set aside at the initial split and not "touched"
during model training. It is key to test a model's predictive power and
whether it is overfitted.
**Validation data** is used for determining the loss during model
training. The reason for distinguishing between testing and validation
data is to assure we're not misleading model training by some
peculiarities of the training data and we get an assessment of
generalisability based on data that was not seen during model training.
This distinction might be somewhat confusing for now, have a look at
this [blog
post](https://machinelearningmastery.com/difference-test-validation-datasets/)
for additional explanations.
**Model training** minimises the loss. In other words, it optimises the
agreement between predicted and observed values. The loss is most
commonly measured by the root mean square error (RMSE).
To tune a model, you can set *hyperparameters* that determine model
structure or calibrate the coefficients. The *k* in KNN is such a
hyperparameter.
**Generalisability** refers to the model's performance on data not seen
during the training - the testing data. To avoid overfitting, model
generalisability is desired already during model training. One method to
guide model training is **cross validation**.
**Data leakage** is when information from the testing dataset "creeps"
into the training data. To avoid this the testing set must be left
completely untouched!
## Tutorial
### Model formulation using `train()`
In the previous tutorial you learned how to wrangle, split and
pre-process data (remember the recipes) and how to formulate a linear
model in base R. To learn more about model creation, let us first repeat
the data preparation from the previous tutorial:
```{r message=FALSE, warning=FALSE}
## Load data and extrat relevant subset
library(tidyverse)
ddf <- read_csv("./data/FLX_CH-Dav_FLUXNET2015_FULLSET_DD_1997-2014_1-3.csv") %>%
select(starts_with("TIMESTAMP"),
ends_with("_F"),
GPP_NT_VUT_REF,
NEE_VUT_REF_QC,
-contains("JSB")) %>%
mutate(TIMESTAMP = lubridate::ymd(TIMESTAMP)) %>%
na_if(-9999) %>%
select(-ends_with("_QC"), NEE_VUT_REF_QC) %>%
mutate(GPP_NT_VUT_REF = ifelse(NEE_VUT_REF_QC < 0.9, NA, GPP_NT_VUT_REF)) %>%
select(-NEE_VUT_REF_QC) %>%
drop_na()
## Split data into testing and training data
library(rsample)
set.seed(1982) # for reproducibility
split <- initial_split(ddf, prop = 0.7)
ddf_train <- training(split)
ddf_test <- testing(split)
## Base R formulation for linear model
lm(GPP_NT_VUT_REF ~ SW_IN_F + VPD_F + TA_F, data = ddf)
```
Ideally, the way we formulate a model should be independent of the
algorithm, or *engine* that takes care of fitting $f(X)$. As mentioned
in the previous tutorial, the R package
[**caret**](https://topepo.github.io/caret/) provides a unified
interface for using different ML algorithms implemented in separate
packages. In other words, **caret** acts as a *wrapper* for multiple
different model fitting, or ML algorithms. This has the advantage that
it unifies the interface (the way arguments are provided). **caret**
also provides implementations for a set of commonly used tools for data
processing, model training, and evaluation. We'll use **caret** for
model training with the function `train()`. Using caret for specifying
the same linear regression model as above, using the base-R `lm()`
function, can be done with caret in a generalized form as:
```{r message=F, warning=F}
library(caret)
train(
form = GPP_NT_VUT_REF ~ SW_IN_F + VPD_F + TA_F,
data = ddf,
method = "lm"
)
```
Of course, this is an overkill compared to just writing `lm(...)`. But
the advantage of the unified interface is that we can simply replace the
`method` argument to use a different ML algorithm. For example, to use a
random forest model implemented by the **ranger** package, we can write:
```{r eval=F}
train(
form = GPP_NT_VUT_REF ~ SW_F_IN + VPD_F + TA_F,
data = ddf,
method = "ranger",
...
)
```
The `...` hints at the fact that there are a few more arguments to be
specified for training a random forest model with ranger. We learn more
about these arguments below in Chapter \@ref(training).
Another useful aspect of using **caret** to formulate models is that it
takes recipes from the **recipes** packages as input. This way, we can
make sure that the exact same pre-processing steps are taken for the
training data, testing data and as will be introduced later, the
validation data.
```{r eval=FALSE, message=FALSE, warning=FALSE}
library(recipes)
pp <- recipe(GPP_NT_VUT_REF ~ SW_IN_F + VPD_F + TA_F, data = ddf) %>%
step_center(all_numeric(), -all_outcomes()) %>%
step_scale(all_numeric(), -all_outcomes())
```
The object `pp` can then be supplied to `train()` as its first argument:
```{r, eval=FALSE}
train(
pp,
data = ddf,
method = "ranger",
...
)
```
### Model training
Model training in supervised ML is guided by the match (or mismatch)
between the predicted and observed target variable(s), that is, between
$\hat{Y}$ and $Y$. The *loss* function quantifies this mismatch
($L(\hat{Y}, Y)$), and the algorithm takes care of progressively
reducing the loss during model training.
Let's say the ML model contains two parameters and predictions can be
considered a function of the two ($\hat{Y}(w_1, w_2)$). $Y$ doesn't
change as a function of \$w_1\$ and \$w_2\$. Thus, the loss function is
effectively a function $L(w_1, w_2)$. Therefore, we can consider the
model training as a search of the parameter space of the machine
learning model $(w_1, w_2)$ to find the minimum of the loss. Common loss
functions are the root mean square error (RMSE), or the mean square
error (MSE), or the mean absolute error (MAE). Loss minimization is a
general feature of ML model training.
```{r, echo = FALSE, fig.cap = "Visualization of a loss function as a plane spanned by the two parameters $w_1$ and $w_2$."}
knitr::include_graphics("figures/loss_plane.png")
```
Model training is implemented in R for different algorithms in different
packages. Some algorithms are even implemented by multiple packages
(e.g., `nnet` and `neuralnet` for artificial neural networks). Again,
**caret** comes to the rescue as it offers a large selection of ML model
implementations from different packages (see
[here](https://topepo.github.io/caret/available-models.html) for an
overview of available models). The `train()` function takes an argument
`metric`, which specifies the loss function and defaults to the RMSE for
regression models and to accuracy for classification (see chapter on
metrics below).
#### Hyperparameter tuning
All ML algorithms have some "knobs" to turn in order to achieve
efficient model training and predictive performance. Such "knobs" are
called *hyperparameters* and their meaning and effect depend on the ML
algorithm.
For k-nearest neighbour (KNN), it is `k` - the number of neighbours to
consider for determining distances. There is always an optimum $k$.
Obviously, if $k = n$, we consider all observations as neighbours and
each prediction is simply the mean of all observed target values $Y$,
irrespective of the predictor values. This cannot be optimal and such a
model is likely underfit. On the other extreme, with $k = 1$, the model
will be strongly affected by the noise in the single nearest neighbour
and its generalisability will suffer. This should be reflected in a poor
performance on the validation data.
For random forests from the **ranger** package, hyperparameters are:
- `mtry`: the number of variables to consider to make decisions, often
taken as $p/3$, where $p$ is the number of predictors.
- `min.node.size`: the number of data points at the "bottom" of each
decision tree
- `splitrule`: the function applied to data in each branch of a tree,
used for determining the goodness of a decision
Hyperparameters usually have to be "tuned" and their optimal values
depend on the data. Therefore they cannot be known *a priori* but must
be found on-the-go. In **caret**, hyperparameter tuning is implemented
as part of the `train()` function. Values of hyperparameters to consider
are to be specified by the argument `tuneGrid`, which takes a data frame
with column(s) named according to the name(s) of the hyperparameter(s)
and rows for each combination of hyperparameters to consider. To specify
the three hyperparameters of a random forest model would look like this:
```{r eval=F}
train(
form = GPP_NT_VUT_REF ~ SW_F_IN + VPD_F + TA_F,
data = ddf,
method = "ranger",
tuneGrid = expand.grid( .mtry = floor(6 / 3),
.min.node.size = c(3, 5, 9,15, 30),
.splitrule = c("variance", "maxstat")),
...
)
```
Here, `expand.grid()` is used to generate a data frame with all possible
combinations of individual vectors (e.g. min node size) and their
provided values (e.g. 3, 5, 9, 15, 30). For a KNN, it suffices to enter
a dataframe with one vector of numbers as argument to `tuneGrid` (e.g.,
`data.frame(k = c(2, 5, 10, 15, 18, 20, 22, 24, 26, 30, 35, 40, 60, 100))`).
The `train()` function will then create a KNN for each of these numbers
to find the optimal k.
#### Resampling
The goal of model training is to achieve the best possible model
generalisability. That is, the best possible model performance when
predicting to data that was not used for training - the test data.
Resampling mimicks the comparison of predictions to the test data.
Instead of using all training data, the training data is *resampled*
into a number further splits - pairs of training and *validation* data.
Model training is then guided by minimising the average loss determined
on each resample of the validation data. Having multiple resamples
(multiple *folds* of training-validation splits) avoids the loss
minimization from being misguided by random peculiarities in the
training and/or validation data.
A common resampling method is *k-fold cross validation*, where the
training data is split into *k* equally sized subsets (*folds*). Then,
there will be *k* iterations, where each fold is used for validation
once (while the remaining folds are used for training). An extreme case
is *leave-one-out cross validation*, where *k* corresponds to the number
of data points.
```{r cv, echo = FALSE, fig.cap = "Cross-validation resampling (figure from [Boehmke & Greenwell (2019)](https://bradleyboehmke.github.io/HOML/process.html))."}
knitr::include_graphics("./figures/cv.png")
```
To do a k-fold cross validation during model training in R, we don't
have to implement the loops around folds ourselves. The resampling
procedure can be specified in the **caret** function `train()` with the
argument `trControl`. The object that this argument takes is the output
of a function call to `trainControl()`. This can be implemented in two
steps. For example, to do a 10-fold cross-validation, we can write:
```{r eval=F}
train(
pp,
data = ddf_train,
method = "ranger",
tuneGrid = expand.grid( .mtry = floor(6 / 3),
.min.node.size = c(3, 5, 9,15, 30),
.splitrule = c("variance", "maxstat")),
trControl = trainControl(method = "cv", number = 10),
...
)
```
In certain cases, data points stem from different "groups", and
generalisability across groups is critical. In such cases, data from a
given group must not be used both in the training and validation sets.
Instead, splits should be made along group delineations. The *caret*
function `groupKFold()` offers the solution for this case.
### Model evaluation
In previous chapters, you have already encountered different metrics
that can be used to quantify the agreement between predicted ($\hat{y}$)
and observed ($y$) values (e.g., the root mean square error). These
metrics are essential to guide model training and inform the evaluation.
Different metrics measure different aspects of the model-data agreement
and different metrics are used for regression and classification models.
#### Metrics for regression models
Different metrics measure, for example, the correlation between modeled
and observed values or the magnitude of the errors. To get an intuitive
understanding of their different abilities, compare the scatterplots in
Figure \@ref(fig:metrics) and how different aspects of the model-data
agreement are measured by different metrics. Their definitions will
follow below.
```{r metrics, echo = FALSE, fig.cap = "Comparison of model metrics on different data sets."}
knitr::include_graphics("./figures/correlation_error.png")
```
- **MSE**: The mean squared error is defined, as its name suggests,
as: $$
\text{MSE} = \frac{1}{n} \sum_{i=1}^n (y_i - \hat{y_i})^2
$$It measures accuracy, i.e., the magnitude of the errors, and is
minimized during model training when used as a loss function. Note
that since it scales with the square of the errors, the MSE is
particularly sensitive to large errors in single points (including
outliers). You may notice the difference to the error variance
$\widehat{\sigma}^2$. It was defined in Chapter \@ref(ch-06) similar
to the definition of the MSE above, but with $n-p$ in the
denominator instead of $n$. The denominator $n-p$ corresponds to the
*degrees of freedom* (size of our sample minus the number of
parameters to estimate). Here we want to compute the mean of the
errors, hence we divide by $n$.
- **RMSE**: The root mean squared error is, as its name suggests, the
root of the MSE: $$
\text{RMSE} = \sqrt{\text{MSE}} = \sqrt{\frac{1}{n} \sum_{i=1}^n (y_i - \hat{y_i})^2}
$$Like the MSE, the RMSE also measures accuracy (the magnitude of
the errors) and is minimized during model training. By taking the
square root of mean square errors, the RMSE is in the same units as
the data $y$ and is less sensitive to outliers as the MSE.
**Checkpoint**
Implement the formula for RMSE using simple "low-level" functions like
`sqrt()` and `mean()`. Confirm that the function `rmse()` from the
yardstick package computes the RMSE the same way.
**Solution**
```{r}
## generate random data
df_demo <- tibble(x = rnorm(100)) %>%
mutate(y_obs = x + rnorm(100),
y_pred = x)
yardstick::rmse(df_demo, y_obs, y_pred) %>% pull(.estimate)
sqrt(mean((df_demo$y_pred - df_demo$y_obs)^2))
```
- $R^2$, also called the *coefficient of determination*, describes the
proportion of variation in $y$ that is captured by modelled values
$\hat{y}$. In this case, the goal is to maximize the metric, thus
trying the explain as much variation as possible. In contrast to the
MSE and RMSE, $R^2$ measures consistency, or correlation, or
goodness of fit, and not accuracy. It is traditionally defined as:
$$
R^2 = 1 - \frac{\sum_i (\hat{y}_i - y_i)^2}{\sum_i (y_i - \bar{y})^2}
$$A perfect fit is quantified by $R^2 = 1$ and uninformative
estimates have an $R^2$ approaching zero when compared to
observations.
- **Pearson's** $r^2$: The linear association between to variables
(here $y$ and $\hat{y}$) is measured by the *Pearson's correlation
coefficient* $r$. Its square is closely related to the coefficient
of determination and in common cases of prediction-observation
comparisons almost identical. $$
r = \frac{\sum_i (y_i - \bar{y}) (\hat{y_i} - \hat{\bar{y}}) }{\sqrt{ \sum_i(y_i-\bar{y})^2(\hat{y_i}-\hat{\bar{y}})^2 } }
$$
The distinction between uppercase and lowercase nomenclature is often
not consistent in the literature. The uppercase $R^2$ is commonly used
in the context of comparing observed and predicted values with the
coefficient of determination. When the *correlation* between two
different variables in a sample is quantified, the lowercase $r^2$ is
commonly used. In a linear regression with an estimated intercept, the
coefficient of determination and the square of the Pearson's correlation
coefficient are equal. However, when comparing estimated and observed
values, the coefficient of determination can return negative values for
uninformative estimates.
Metrics for correlation should not be used as a loss function because
they do not penalise biased models. This is illustrated also in the
plots above.
The yardstick library implements the definition of the coefficient of
determination with its function `rsq_trad()`.
```{r}
## generate random data
df_demo <- tibble(x = rnorm(100)) %>%
mutate(y_obs = x + rnorm(100),
y_pred = x)
## the equation given above for the coefficient of determination corresponds to 'rsq_trad()'
1 - sum((df_demo$y_pred - df_demo$y_obs)^2) / sum((df_demo$y_obs - mean(df_demo$y_obs))^2)
yardstick::rsq_trad(df_demo, y_obs, y_pred) %>% pull(.estimate)
```
The square of the Pearson's correlation coefficient, as defined above,
is implemented by the yardstick function `rsq()`, and corresponds also
to the value reported for `Multiple R-squared` by `summary()` on a
linear model object, or simply to `cor(...)^2`.
```{r}
## the equation given above for the squared Pearson's correlation coefficient corresponds to 'rsq()', 'cor()^2', and `summary()$r.squared
(sum((df_demo$y_pred - mean(df_demo$y_pred))*(df_demo$y_obs - mean(df_demo$y_obs))))^2/
(sum((df_demo$y_obs - mean(df_demo$y_obs))^2)*sum((df_demo$y_pred - mean(df_demo$y_pred))^2))
yardstick::rsq(df_demo, y_obs, y_pred) %>% pull(.estimate)
summary(lm(y_obs ~ y_pred, data = df_demo))$r.squared
cor(df_demo$y_obs, df_demo$y_pred)^2
```
The $R^2$ generally increases when predictors are added to a model, even
if predictors are not informative. This is particularly critical in the
context of machine learning when we compare alternative models that
differ by their number of predictors. In other words, the $R^2$ of a
model with a large number of predictors tends to give an overconfident
estimate of its predictive power.
Above, we have seen how cross-validation can be used to assess
generalisability (model performance on the validation set). This is the
"gold-standard". But when the number of data points is small, cross
validation estimates may not be robust. Without resorting to cross
validation, the effect of spuriously improving the evaluation metric by
adding uninformative predictors can also be mitigated by penalising by
the number of predictors $p$. Different metrics are available.
- **Adjusted** $R^2$: The adjusted $R^2$ discounts values by the
number of predictors as $$
\bar{R}^2 = 1 - (1-R^2) \; \frac{n-1}{n-p-1} \;,
$$where $n$ (as before) is the number of observations and $p$ the
number of predictors. As for $R^2$, the goal is to maximize it. For
a fitted model in R `modl`, it is returned by
`summary(modl)$adj.r.squared`.
- **AIC**: the Akaike's Information Criterion is defined as $$
\text{AIC} = n \log \Big(\frac{\text{SSE}}{n}\Big) + 2(p+2)
$$where $N$ is the number of observations used for estimation, $p$
is the number of predictors in the model and SSE is the sum of
squared estimate of errors (SSE$= \sum_i (y_i-\hat{y_i})^2$). Also
in this case we have to minimize it and the model with the minimum
value of the AIC is often the best model for forecasting. For large
values of\
$n$, minimising the AIC is equivalent to minimising the
cross-validated MSE.
- **AIC**$_c$: For small values of $n$ the AIC tends to select too
many predictors. A bias-corrected version of the AIC is defined as:
$$
\text{AIC}_c = \text{AIC} + \frac{2(p + 2)(p + 3)}{n-p-3}
$$Also AIC$_c$ is minimised for an optimal predictive model.
- **BIC**: the Schwarz's Bayesian Information Criterion is defined as
$$
\text{BIC} = n \log \Big(\frac{\text{SSE}}{n}\Big) + (p+2) \log(n)
$$Also in this case our goal is to minimize the BIC. This metric has
the feature that if there is a true underlying model, the BIC will
select that model given enough data. The BIC tends to select the
model with fewer predictors than AIC.
#### Metrics for classification
In the examples for this course, we have thus far focused on regression
models. For classification, different metrics for measuring the
agreement between predicted and observed values are used. They will be
introduced in a later chapter. If you're curious already now, good
overviews are provided in the following links:
- [Hands On Machine Learning in R, Bradley &
Boehmke](https://bradleyboehmke.github.io/HOML/process.html#model-eval)
for brief definitions.
- [Wikipedia on Confusion
Matrix](https://en.wikipedia.org/wiki/Confusion_matrix)
- [Blogpost by M. Sunasra on
Medium](https://medium.com/@MohammedS/performance-metrics-for-classification-problems-in-machine-learning-part-i-b085d432082b)
#### Residual analysis
Quantifying metrics is one part of model evaluation. The other part is
to get an intuitive understanding of the model-observation agreement and
where and why they fail. Getting there is an integral part of
exploratory data analysis. One of the first steps after obtaining
results from your initial model is to investigate its residuals (the
difference between predicted and observed values) and their pattern. If
you can detect a clear pattern or trend in your residuals, then your
model has room for improvement.
Our example data set contains time series of multiple variables. Above,
we have not used time as a predictor but values in the data frame are
ordered by time along rows. An obvious first step is to look at
residuals and their relationship with time (or simply row number in our
case).
A handy function to add predictions and residuals from a fitted model to
the data (must contain the variables used in the model), is
[`augment`](https://broom.tidymodels.org/reference/data.frame_tidiers.html)
from the tidyverse [*broom*](https://broom.tidymodels.org/index.html)
package [@R-broom]. It adds information about observations to a data
set.
Let's fit a linear regression model `GPP_NT_VUT_REF ~ SW_IN_F` and look
at the residuals.
```{r}
library(broom)
linmod1 <- lm(GPP_NT_VUT_REF ~ SW_IN_F, data = ddf)
df1 <- augment(linmod1, data = ddf)
df1 %>%
mutate(id = row_number()) %>%
ggplot(aes(id, .resid)) +
geom_point(size = 1, alpha = .4) +
xlab("Row number") +
ylab("Residuals") +
labs(title = "Univariate linear regression", subtitle = "GPP_NT_VUT_REF ~ SW_IN_F")
```
Apparently, there is a clear pattern in the residuals with an apparent
autocorrelation (the residual in one data point is correlated with the
residual in its preceding data point). This suggests that there is
temporal information that we did not account for in our model. This
could either be from additional predictors, not included here, that have
a temporal pattern (note that we only included one predictor in this
model), or from the inherent dynamics of the system itself, which is not
captured by the predictors (e.g., memory effects).
We can also plot residuals versus additional, potentially influential
but not included predictors to guide the revision of the model. However,
it is often not possible to determine effects by eye, especially if
interactions between variables are important. We learn about methods to
sequentially add predictors to a linear regression model in the
'Application' session (see Chapter \@ref(ch-08)).
### Bonus: Model interpretation
Until now, we have seen how to build and train models, tune the
hyperparameters and evaluate models. ML models work so well because they
can effectively make use of large amounts of data and are flexible
enough to model non-linear relationships and interactions, and predict
rare and faint phenomena. This is their great advantage over classical
statistical methods. However, this flexibility is underpinned by a high
model complexity and a large number of parameters. This complexity also
contrasts with traditional statistical methods, e.g., linear regression
where fitted coefficients ($\beta_i$ in Chapter \@ref(ch-06)) can
directly be interpreted and yield information about the sensitivity of
the target variable to the different predictors and even about the
statistical significance of their effect (we haven't looked at that
part). In contrast, the complexity of ML models renders their
interpretation difficult. The ML algorithms' predictive power comes at
the cost of reduced *model interpretability*. ML models often appear to
be "black boxes" which may limit their usefulness in typical
applications in research, where we're often not only interested in
*predicting*, but also in *understanding* how the model arrived at its
predictions. We are often interested in identifying patterns in the data
that would otherwise not be visible, but that the algorithm apparently
identified and learned.
In order to interpret a ML model and understand its inner workings, we
have to "ask specific questions" that can be translated into an
evaluation of the trained model that then yields the answer we want.
There are basically two types of questions we can ask:
- How "important" is each predictor variable in our model? This is
answered by *variable importance analysis*.
- What's the functional relationship between the target variable and
each predictor? This is answered by *partial residual analysis*.
#### Variable importance
**Model specific approach (for linear regression)**
Some of the approaches for evaluating feature importances are
*model-specific*. For instance, the absolute value of *t*-statistic, in
case of linear models, as a measure of feature importance. Such
model-specific interpretation tools are limited to their respective
model classes. There can be some advantages to using these
model-specific approaches as they are more closely linked to the model
and its performance, and can thus directly use parameters (coefficients)
of the fitted model.
For linear regression models, we can quantify the significance of its
coefficients $\beta$, by testing a null-hypothesis that the coefficient
is, in reality, zero. As an example, let's assume that our data are
generated from the following linear model $$
y = \beta_0 + \beta_1 x + \epsilon
$$ where we assume that our errors $\epsilon$ are independent and come
from a normal distribution. Given only one realization of our data, our
aim is to estimate $\beta_0$ and $\beta_1$. Let's generate random data
for a univariate linear relationship where the true coefficients are
$\beta_1 = 3$ and $\beta_0 =2$. Note that in practice we do not know the
real coefficients and we don't know what the true predictors are. Let's
consider an additional predictor that is available in our data but does
not contribute to the actual data generation process (`x2`). We include
it in the model and want to know whether the respective coefficient it
yields is significant. How can we determine information about its
*importance*? As we've seen before, the `summary()` function generates a
human-readable output. What information does it provide about variable
importance?
```{r}
## generate random data
library(dplyr)
set.seed(123)
n <- 100 # sample size
b_1 <- 3
b_0 <- 2
df_demo <- data.frame(x1 = rnorm(n), x2 = rnorm(n)) %>%
mutate(y = b_1 * x1 + b_0 + rnorm(n ,mean = 0 , sd = 1)) # no x2 here
## fit model
linmod_demo <- lm(y ~ x1 + x2, data = df_demo) # x2 here because it's in the data - is it significant?
summary(linmod_demo)
```
You see that the linear regression fit yields estimates of the
coefficients $\beta_0$ and $\beta_1$ that are relatively close to the
real ones (2.13507 for $\beta_0$ instead of 2, 2.86683 for $\beta_1$
instead of 3). We also see that the estimate (`Estimate`) for (the fake)
coefficient `x2` is close to zero. The standard error of the coefficient
estimate `Std. Error` is an estimate of the standard deviation of
coefficient estimates we would get if the above random data generation
process was repeated many times. It decreases with the sample size (`n`
in the code above). The *t*-statistic (`t value`) is
$\frac{\mbox{Estimate}}{\mbox{Std.Error}}$. Assuming a t-distribution of
coefficient estimates, the *p*-value (`Pr(>|t|)`) quantifies the
probability of our coefficient estimate (considering its *t*-statistic)
if the true coefficient was zero - our null hypothesis. In our example,
the *p*-value for the `x2` estimate is 0.81. This is not significant at
any significance level (indicated by the `*` to the right of the
reported p-value, key given by `Signif. codes`). In contrast, the
estimates for coefficients $\beta_0$ and $\beta_1$ are highly
significant. Their very low *p*-values indicate that it is highly
unlikely that their true value was zero.
This is illustrated below.
```{r warning=F, message=F}
n <- 100
grid <- seq(-7, 7, length.out = 1000)
t_dist <- dt(grid, df = n-3) # 3 is the number of estimated coefficients
ggplot()+
geom_line(aes(x = grid, y = t_dist,color = 'Null Distribution'), lwd=1.5) +
geom_vline(aes(xintercept = 22.208, linetype = ' t value \n for intercept'), color='red') +
scale_color_manual(name = '',values = 'black') +
scale_linetype_manual(name='', values = 2) +
xlim(c(-5,25)) +
labs(x = 't value', y = 'Density', title = 'Students t-distribution') +
theme(legend.position = 'top') +
guides(color = guide_legend(order = 1),
linetype = guide_legend(order = 2))
ggplot()+
geom_line(aes(x = grid, y = t_dist, color = 'Null Distribution'), lwd=1.5) +
geom_vline(aes(xintercept = 0.241, linetype = ' t value \n for x2'), color='red') +
scale_color_manual(name = '',values = 'black') +
scale_linetype_manual(name='', values = 2) +
xlim(c(-5,25)) +
labs(x = 't value', y = 'Density', title = 'Students t-distribution') +
theme(legend.position = 'top') +
guides(color = guide_legend(order = 1),
linetype = guide_legend(order = 2))
```
Variable importance in a fitted model can be assessed in R using the
**vip** package [@R-vip]. If we provide a linear regression model as
argument, it automatically uses the absolute value of the *t*-statistic
for quantifying the variable importance.
```{r}
library(vip)
linmod3 <- lm(GPP_NT_VUT_REF ~ ., data = ddf %>% dplyr::select(-TIMESTAMP))
vip(linmod3)
```
**Checkpoint**
Verify that the bars plotted in the vip plot correspond to the
t-statistic of the linear regression model.
**Solution**
```{r}
summary(linmod3) # compare with values under `t-value`
```
The results of the feature importance plot using `vip()`, agree with the
statistics obtaind from the `summary()` function. The variables
`LW_IN_F` and `VPD_F` have largest $|t|$ statistic values (thus, the
values on the tail ends of the t-distribution have low measure), which
results in a small p-value, thus indicating that these are the most
important features. On the other end, the variable `P_F` has the
smallest $|t|$ statistc, which results in a large p-value, thus
indicating that this variable is not as important.
To sum this up, variable importance can be measured for linear
regression models using the *t*-statistic.
While *t*-statistic values provide information for feature importance
analysis that is straight-forward to quantify and interpret,
corresponding information has to be extracted by other means for other
ML algorithms (e.g., accuracy degradation in random forests). This makes
it difficult to compare their results across different classes of ML
models.
**Model-Agnostic approaches**
Model-Agnostic approaches offer a solution to the intercomparability
problem for feature importance analysis. These treat the ML algorithm as
a "black-box", and the separation of the specific model types from their
interpretability measure enables us to compare the feature importance
across different models and ML algorithms. *Permutation-based feature
importance* is such an algorithm. Its idea is that if we *permute*
(re-shuffle) the values of a feature in the training data, we
effectively destroy the relationship between that particular feature and
the target variable. Thus, if the feature is important, the model
performance should degrade substantially. In contrast, for unimportant
features, permuting their values should not degrade the model
performance much.
Algorithmically speaking, this works as follows:
- We first compute the loss for the original model to establish a
reference loss.
- For predictor $i$ in $\{1,2, …, p\}$
- Randomly permute values of predictor $i$ in the dataset.
- Train the model.
- Compute the loss.
- Compute the feature importance, e.g., as the ratio or difference
of the loss with the permuted data and the reference loss.
- Store this as the feature importance for feature $i$
- Sort the variable by their feature importance
This can be easily implemented using the `vip` package along with a few
extra parameters
Arguments to the function `vip()`, for permutation based feature
importance
- `linmod3`: A fitted model object
- `train` : the training data to be used to compute the feature importances
- `method` : `"permute"` for permutation based method
- `target` : specifying the target variable for the train datset
- `metric` : `"RMSE"` root mean squared error for the regression based task
- `n_sim` : number of times the simulation is to be repeated, the result is averaged over all the simulations ; choose greater values to increase the stability of estimates
- `sample_frac` : specifies the proportion of the training data to be used in each simulation. Default value is NULL i.e. all of training data is used
- `pred_wrapper` : Prediction function that requires two arguments, `object` and `newdata`. The output of this function should be determined by the metric being used - A numeric vector of predicted outcomes for a regression task - A vector of predicted class labels, or a matrix of predicted class probabilitites for a classification task
- `type` : Character string specifying how to compare the baseline and permuted performance metrics; default value is `'difference'`, and the other option is `'ratio'`
*Note: In case of large datasets, `n_sim` and `sample_frac` can be used
to reduce the execution time of the algorithm, as permutation based
approaches can become slow as the number of predictors grows.*
```{r}
# inline function defined to output the predictions for newdata given a particular model object
pfun <- function(object, newdata) predict(object, newdata = newdata)
## variable importance plot using the permute method, with type == 'ratio'
vip(
linmod3,
train = ddf %>% dplyr::select(-TIMESTAMP),
method = "permute",
target = "GPP_NT_VUT_REF",
metric = "RMSE",
nsim = 5,
sample_frac = 0.8,
pred_wrapper = pfun,
type = 'ratio'
)
## variable importance plot using the permute method, with type == 'difference'
vip(
linmod3,
train = ddf %>% dplyr::select(-TIMESTAMP),
method = "permute",
target = "GPP_NT_VUT_REF",
metric = "RMSE",
nsim = 5,
sample_frac = 0.8,
pred_wrapper = pfun,
type = 'difference'
)
```
Now we can compare the importance plots obtained from the model-specific
t-statistic approach and the model agnostic permuation based method. The
feature importances obtained from both approaches differ a little,
because of the underlying differences in the feature importance
evaluation. Despite of the differences, both methods robustly return the
key important features, particularly their ranks.
**Checkpoint**
Compute the variable importance for the best fit knn model, which we
obtained earlier in the tutorial, using the `vip()` function in default
mode. Repeat the same with permuatation based method, with type as
'difference' and 'ratio'. Do the two return the same top-4 features ?
Verify.
**Solution**
```{r}
## Data wrangling done above
## Set up recipe
library(recipes)
pp <- recipe(GPP_NT_VUT_REF ~ ., data = ddf_train %>% dplyr::select(-TIMESTAMP)) %>%
step_center(all_numeric(), -all_outcomes()) %>% # normalizes numeric data to have a mean of zero
step_scale(all_numeric(), -all_outcomes())
## Set up knn
my_cv <- trainControl(
method = "repeatedcv", # method define the resampling method such as 'boot', 'none', 'cv', etc.
number = 10, # number of folds or number of resampling iterations
repeats = 5 # the number of complete sets of folds to compute (only for repeated k-fold cross-validation)
)
hyper_grid <- data.frame(k = c(2, 5, 10, 15, 18, 20, 22, 24, 26, 30, 35, 40, 60, 100))
set.seed(123)
knn_fit <- train(
pp,
data = ddf_train %>% dplyr::select(-TIMESTAMP),
method = "knn",
trControl = my_cv,
tuneGrid = hyper_grid,
metric = "RMSE"
)
## variable importance plot without the permute method
vip(knn_fit)
## variable importance plot using the permute method, with type == 'ratio'
vip(
knn_fit,
train = ddf %>% dplyr::select(-TIMESTAMP),
method = "permute",
target = "GPP_NT_VUT_REF",
metric = "RMSE",
nsim = 5,
sample_frac = 0.8,
pred_wrapper = pfun,
type = 'difference'
)
## variable importance plot using the permute method, with type == 'difference'
vip(
knn_fit,
train = ddf %>% dplyr::select(-TIMESTAMP),
method = "permute",
target = "GPP_NT_VUT_REF",
metric = "RMSE",
nsim = 5,
sample_frac = 0.8,
pred_wrapper = pfun,
type = 'ratio'
)
```
#### Partial dependence plots
While variable importance analysis is helpful for understanding which
predictors are helping us predict, *partial dependence plots* (PDPs)
yield information on what the relationship between one individual
predictor and the target variable looks like. It quantifies how the
response of the target variable changes when we vary one particular
predictor and hold all the remaining predictors at a constant value. If
this is still a little muddy, the pseudo-code of this algorithm should
make things clear (see
[this](https://bradleyboehmke.github.io/HOML/iml.html#partial-dependence)
by @boehmke_hands-machine_2019).
```{r eval=F}
For a selected predictor (x)
1. Construct a grid of j evenly spaced values across the range of x: {x1, x2, ..., xj}
2. For i in {1,...,j} do
| Copy the training data and replace the original values of x with the constant xi
| Apply the fitted ML model to obtain vector of predictions for each data point.
| Average predictions across all data points.
End
3. Plot the averaged predictions against x1, x2, ..., xj
```
As the steps explain, PDPs plot the change in the average predicted
value ($\hat{y}$) as a function of a specified feature. As you will see
in later chapters, PDPs become more useful when non-linear relationships
are present. However, PDPs of linear models help illustrate how a fixed
change in $x_i$ relates to a fixed linear change in $\hat{y_i}$ while
taking into account the average effect of all the other features in the
model. For linear regression models, the slope of the PDP is equal to
the corresponding coefficient of the model.
Considering Figure \@ref(fig:pdp), to compute the PDP of target variable
with respect to Gr_Liv_Area, we modify the values of this variable for
all the rows in the data set, for each [j1, j2, ..., j20] in the grid.
Feed this modified dataset to the pre-trained model and compute the mean
of our target variable (mean taken over all rows). Finally at the end of
the algorithm we have 20 mean responses of the target variable for 20
different values of Gr_Liv_Area. Finally we plot the mean response of
the target variable with respect to the changing variable Gr_Liv_Area to
get the parital dependence plot.
```{r pdp, echo = FALSE, fig.cap = "Visualiation of a partial dependence plot (from [Boehmke & Greenwell (2019)](https://bradleyboehmke.github.io/HOML/images/pdp-illustration.png))."}
knitr::include_graphics("./figures/pdp-illustration.png")
```
Lucky for us, this code has already been implemented in the **pdp**
package in R [@R-pdp]. We use the `partial()` function to plot the
partial dependence plot for a single predictor variable, `SW_IN_F` or
`VPD_F`, or for a subset of both these variables. Since our models are
linear, we expect the PDP curves to be linear as well. The first curve
shows a positive linear relationship between `SW_IN_F` and the target
y_hat (`GPP_NT_VUT_REF`), i.e. as the `SW_IN_F` increases, the target
increases. Whereas the second curve shows us that there is a negative
linear dependence between `VPD_F` and `GPP_NT_VUT_REF`. The final plot
shows a 2D partial dependence plot with `SW_IN_F` and `VPD_F`, and this
follows a linear relationship too. As expected, we get the highest value
for our target variable in the lower right corner and the smallest value
in the upper left corner.
```{r}
library(pdp)
partial(linmod3, "SW_IN_F", plot = TRUE)
partial(linmod3, "VPD_F", plot = TRUE)
partial(linmod3, c("SW_IN_F", "VPD_F"), plot = TRUE)
```
The `pdp` package works with various model classes. For classes that are
not available, a prediction wrapper function has to be specified as
described for example
[here](https://bradleyboehmke.github.io/HOML/iml.html#implementation-6)
[@boehmke_hands-machine_2019]. If you want to learn more about PDP, have
a look the [online
book](https://christophm.github.io/interpretable-ml-book/) by
@molnar_interpretable_2019.
Note: The computational demand of partial dependence analysis rapidly
increases as the number of search grid points (`j` in the algorithm
description above) rises.
*For the interested: If you do not want to average the y_hat,
predictions over all the rows for each value of* $X_i$, look into
Individual conditional expectation (ICE) curves, which rather than
averaging the predicted values across all observations, observe and plot
the individual observation-level predictions.
## Exercise
### Setup
The following exercise builds on the exercise of the previous tutorial.
So let's load the work done there first:
```{r}
## Load data and extrat relevant subset
library(tidyverse)
ddf <- read_csv("./data/FLX_CH-Dav_FLUXNET2015_FULLSET_DD_1997-2014_1-3.csv") %>%
select(starts_with("TIMESTAMP"),
ends_with("_F"),
GPP_NT_VUT_REF,
NEE_VUT_REF_QC,
-contains("JSB")) %>%
mutate(TIMESTAMP = lubridate::ymd(TIMESTAMP)) %>%
na_if(-9999) %>%
select(-ends_with("_QC"), NEE_VUT_REF_QC) %>%
mutate(GPP_NT_VUT_REF = ifelse(NEE_VUT_REF_QC < 0.9, NA, GPP_NT_VUT_REF)) %>%
select(-NEE_VUT_REF_QC) %>%
drop_na()
## Split data into testing and training data