-
Notifications
You must be signed in to change notification settings - Fork 0
/
04-modeling.Rmd
1625 lines (1345 loc) · 60.5 KB
/
04-modeling.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# Let's boost the models {#modeling}
Beside the XGBoost model also a random forest model will be fitted here for comparison. The whole modeling approach and procedure is closely following the principles that are outlined in the **great** book *Tidy modeling with R* by Max Kuhn and Julia Silge.
For the modeling three additional packages are required.[@xgboost_package; @ranger_package; @vipPack]
```{r loadModelPackages, message=FALSE, warning=FALSE}
library(xgboost) # for the xgboost model
library(ranger) # for the random forest model
# + the vip package but it will not be loaded into
# the namespace
```
For parallel computations the `doParallel` package is used.[@doParallel_package]
This is most useful for the tuning part.
```{r registerCLuster, eval=TRUE, warning=FALSE, message=FALSE}
library(doParallel)
# Create a cluster object and then register:
cl <- makePSOCKcluster(2)
registerDoParallel(cl)
```
First one has to set a **metric** for the evaluation of the final performance. As one knows from the EDA that not a lot of outliers are present in the data one can leave the default for XGBoost as the optimization metric for regression which is the root mean squared error (RMSE) which basically corresponds to the $L_2$ loss. Besides also the mean average error (MAE) which is kind of the $L_1$ norm will be covered but the optimization and tuning will focus on the RMSE.
## Burnout data
### Baseline models
The first thing is to set up the trivial **baseline models**.
```{r}
# the trivial intercept only model:
bout_predict_trivial_mean <- function(new_data) {
rep(mean(burnout_train$burn_rate), nrow(new_data))
}
# the trivial scoring of mental fatigue score (if missing intercept model)
bout_predict_trivial_mfs <- function(new_data) {
# these two scoring parameters are correspondint to the
# simple linear regression model containing only one predictor
# i.e. mfs
pred <- new_data[["mental_fatigue_score"]] * 0.097 - 0.1
pred[is.na(pred)] <- mean(burnout_train$burn_rate)
pred
}
```
The predictions of these baseline models on the test data set will be compared with the predictions of the tree-based models that will be constructed.
### Model specification
```{r modelSpecburn}
# Models:
# Random forest model for comparison
bout_rf_model <- rand_forest(trees = tune(),
mtry = tune()) %>%
set_engine("ranger") %>%
set_mode("regression")
# XGBoost model
bout_boost_model <- boost_tree(trees = tune(),
learn_rate = tune(),
loss_reduction = tune(),
tree_depth = tune(),
mtry = tune(),
sample_size = tune(),
stop_iter = 10) %>%
set_engine("xgboost") %>%
set_mode("regression")
```
```{r workflowSpecburn}
# Workflows: model + recipe
# Random Forest workflow
bout_rf_wflow <-
workflow() %>%
add_model(bout_rf_model) %>%
add_recipe(burnout_rec_rf)
# XGBoost workflow with the untransformed target
bout_boost_wflow <-
workflow() %>%
add_model(bout_boost_model) %>%
add_recipe(burnout_rec_boost)
# XGBoost workflow with the transformed target (empirical logit)
bout_boost_wflow_trans <-
workflow() %>%
add_model(bout_boost_model) %>%
add_recipe(burnout_rec_boost_trans)
```
### Tuning
For the hyperparameter tuning one needs validation sets to monitor the models on unseen data. To do this 5-fold cross validation (CV) is used here.
```{r resamplingObjects}
# Create Resampling object
set.seed(2)
burnout_folds <- vfold_cv(burnout_train, v = 5)
```
Now adjust or check the hyperparameter ranges that the tuning will use. First the Random forest model.
```{r}
# Have a look at the hyperparameters that have to be tuned and finalize them
bout_rf_params <- bout_rf_wflow %>%
parameters()
bout_rf_params
```
This shows that the `mtry` hyperparameter has to be adjusted depending on the data. Moreover with the `dials::update` function one can manually set the ranges that should be used for tuning.
```{r}
# default range for tuning is 1 up to 2000 for the trees argument
# set the lower bound of the range to 100. Then finalize the
# parameters using the training data.
bout_rf_params <- bout_rf_params %>%
update(trees = trees(c(100, 2000))) %>%
finalize(burnout_train)
bout_rf_params
bout_rf_params %>% pull_dials_object("mtry")
bout_rf_params %>% pull_dials_object("trees")
```
Now this parameter object is ready for tuning and the same steps have to be performed on the main boosting workflow. The parameter set for the untransformed target can then also be used for the transformed one.
```{r}
bout_boost_params <- bout_boost_wflow %>%
parameters()
bout_boost_params
```
```{r}
# first a look at the default ranges
trees()
tree_depth()
learn_rate()
loss_reduction()
sample_size()
```
So `sample_size` must also be finalized. Again the lower bound on the number of trees will be raised to 100. The other scales are really sensible and will be left as is.
```{r}
bout_boost_params <- bout_boost_params %>%
update(trees = trees(c(100, 2000))) %>%
finalize(burnout_train)
bout_boost_params
bout_boost_params %>%
pull_dials_object("sample_size")
bout_boost_params %>%
pull_dials_object("mtry")
# use the same object for the tuning of the boosted model
# with the transformed target
bout_boost_params_trans <- bout_boost_params
```
Now also these parameter objects are ready to be used. The next step is to define the metrics used for evaluation while tuning.
```{r}
# define a metrics set used for evaluation of the hyperparameters
regr_metrics <- metric_set(rmse, mae)
```
Now the actual tuning begins. The models will be tuned by a space filling grid search. As one has defined ranges for each parameter one can then use different algorithms to construct a grid of combinations of parameter values that tries to best fill the defined ranges by random sampling also in a high dimensional setting. In the following the function `tune::grid_latin_hypercube` is used for this. So basically one tries out all hyperparameter values for each fold and saves the performance of the performance metrics on the hold out fold. These metrics are then aggregated for each combination and with the resulting estimates of performance one can choose the final set of hyperparameters. The more hyperparameters the model has the more tuning rounds might be needed to refine the grid. Besides the classical grid search there are also iterative methods for tuning. These are mainly good to tune a single hyperparameter at once and not for a bunch of them simultaneously. They will not be used in the following.
The **random forest model** goes first with the tuning.
Here there are as seen above only two hyperparameters to be tuned thus 30 combinations should suffice.
```{r tuneRFburnout, cache=TRUE, eval=FALSE}
# took roughly 30 minutes
system.time({
set.seed(2)
bout_rf_tune <- bout_rf_wflow %>%
tune_grid(
resamples = burnout_folds,
grid = bout_rf_params %>%
grid_latin_hypercube(size = 30, original = FALSE),
metrics = regr_metrics
)
})
# visualization of the tuning results (snapshot of the output below)
autoplot(bout_rf_tune) + theme_light()
# this functions shows the best combinations wrt the rmse metric of all the
# combinations in the grid
show_best(bout_rf_tune, metric = "rmse")
```
```{r rfburntuneplot,echo=FALSE,fig.height=4, out.width='70%', fig.align='center', fig.cap="Result of a spacefilling grid search for the random forest model."}
knitr::include_graphics("_pictures/rf_burn_tune_plot.png")
```
The visualization alongside the best performing results suggest that a value of `mtry`of 3 and 1000 `trees` should give good results. Thus one can finalize and fit this model.
```{r}
final_bout_rf_wflow <-
bout_rf_wflow %>%
finalize_workflow(tibble(
trees = 1000,
mtry = 3
)) %>%
fit(burnout_train)
```
Now the main **boosting model**.
First tune only the number of trees in order to detect a number of
trees that is 'large enough'. Then tune the tree specific arguments.
If one tunes all parameters at the same time the grid grows to large. Moreover a tuning of the trees argument could encourage overfitting. See the book *Hands-on Machine Learning with R* for a detailed explenation of the tuning strategies.[@HandsOnMLwithR]
The first tuning round:
To display the discussed impact of different maximum tree depths the first grid search for a good number of trees will besides the kind of standard value 6 also be performed for regression stumps i.e. a value of 1 for the maximum tree depth.
```{r}
# tuning grid just for the #trees one with max tree depth 6 and one with
# only stumps for comparison
first_grid_boost_burn_depth6 <- crossing(
trees = seq(250, 2000, 250),
mtry = 9,
tree_depth = 6,
loss_reduction = 0.000001,
learn_rate = 0.01,
sample_size = 1
)
first_grid_boost_burn_stumps <- crossing(
trees = seq(250, 2000, 250),
mtry = 9,
tree_depth = 1,
loss_reduction = 0.000001,
learn_rate = 0.01,
sample_size = 1
)
```
```{r tuneBoostBurnfirststumps, eval=FALSE, cache=TRUE}
# took roughly 1 minute
system.time({
set.seed(2)
bout_boost_tune_first_stumps <- bout_boost_wflow %>%
tune_grid(
resamples = burnout_folds,
grid = first_grid_boost_burn_stumps,
metrics = regr_metrics
)
})
autoplot(bout_boost_tune_first_stumps) + theme_light()
show_best(bout_boost_tune_first_stumps)
```
```{r tuneBoostBurnfirst6, eval=FALSE, cache=TRUE}
# took roughly 1 minute
system.time({
set.seed(2)
bout_boost_tune_first_depth6 <- bout_boost_wflow %>%
tune_grid(
resamples = burnout_folds,
grid = first_grid_boost_burn_depth6,
metrics = regr_metrics
)
})
# plot output is shown in the figure below
autoplot(bout_boost_tune_first_depth6) + theme_light()
show_best(bout_boost_tune_first_depth6)
```
```{r boostburntuneplot1,echo=FALSE,fig.height=4, out.width='70%', fig.align='center', fig.cap="Result of the first grid search for the XGBoost model (max depth 6)."}
knitr::include_graphics("_pictures/burn_boost_tune_first.png")
```
Compare the two different first tuning rounds:
```{r, eval=FALSE}
# output of this code shown in the figure below
bout_boost_tune_first_stumps %>%
collect_metrics() %>%
mutate(tree_depth = 1) %>%
bind_rows(
bout_boost_tune_first_depth6 %>%
collect_metrics() %>%
mutate(tree_depth = 6)
) %>%
ggplot(aes(x = trees, y = mean, col = factor(tree_depth))) +
geom_point() +
geom_line() +
geom_linerange(aes(ymin = mean - std_err, ymax = mean + std_err)) +
labs(col = "Max tree depth", y = "", x = "Number of trees",
title = "") +
scale_color_viridis_d(option = "C") +
facet_wrap(~ .metric) +
theme_light()
```
```{r bouttuningstumpsVs6,echo=FALSE,fig.height=4, out.width='70%', fig.align='center', fig.cap="Comparison of the initial grid search for the number of trees w.r.t. the maximum tree depth. Actually with tiny confidence bands around the estimates."}
knitr::include_graphics("_pictures/bout_tuning_stumpsVs6.png")
```
This shows that the model with just stumps converges much slower or may even never reach the low level of when using deeper trees. One reason for this could be that these stumps do not account for interactions and of course the same number of deep trees encode much more information than the same number of stumps.
So 1500 trees should suffice here. Thus the workflow and model will be adjusted accordingly.
```{r}
# fix the number of trees by redefining the boosted model with a
# fixed number of trees.
bout_boost_model <- boost_tree(trees = 1500,
learn_rate = tune(),
loss_reduction = tune(),
tree_depth = tune(),
mtry = tune(),
sample_size = tune(),
stop_iter = 10) %>%
set_engine("xgboost") %>%
set_mode("regression")
# update the workflow
bout_boost_wflow <-
bout_boost_wflow %>%
update_model(bout_boost_model)
bout_boost_params <- bout_boost_wflow %>%
parameters() %>%
finalize(burnout_train)
# reduced hyperparameter space
bout_boost_params
```
Now perform the first major grid search over all the other hyperparameters.
```{r tuneBoostBurnsecond, eval=FALSE, cache=TRUE}
# now tune all the remaining hyperparameters with a large space filling grid
# took roughly 1.5 hours
system.time({
set.seed(2)
bout_boost_tune_second <- bout_boost_wflow %>%
tune_grid(
resamples = burnout_folds,
grid = bout_boost_params %>%
grid_latin_hypercube(
size = 200),
metrics = regr_metrics
)
})
show_best(bout_boost_tune_second, metric = "rmse")
```
Now refine the grid i.e. the parameter space according to the results of the last grid search and perform a third one.
```{r tuneBoostBurnthird, eval=FALSE, cache=TRUE}
# now tune all the remaining hyperparameters with a refined space filling grid
# took roughly 2 hours
system.time({
set.seed(2)
bout_boost_tune_third <- bout_boost_wflow %>%
tune_grid(
resamples = burnout_folds,
grid = bout_boost_params %>%
update(
mtry = mtry(c(5,9)),
tree_depth = tree_depth(c(4,5)),
learn_rate = learn_rate(c(-1.7, -1.3)),
loss_reduction = loss_reduction(c(-8,-3)),
sample_size = sample_prop(c(0.4, 0.9))
) %>%
grid_latin_hypercube(
size = 200),
metrics = regr_metrics
)
})
show_best(bout_boost_tune_third, metric = "rmse")
```
With this final grid search one is ready to finalize the model.
The results of the three grid searches suggest that no column-subsampling should be applied, the maximum tree depth should be 4, the learning rate should be small but not extremely small ($\sim 0.02$), the loss reduction regularization effect is not needed here (very small) and the sample size for each tree should be set to 0.8.
```{r}
final_bout_boost_wflow <-
bout_boost_wflow %>%
finalize_workflow(tibble(
mtry = 9,
tree_depth = 4,
learn_rate = 0.02,
loss_reduction = 0.0000003,
sample_size = 0.8
)) %>%
fit(burnout_train)
```
Now tune the **transformed outcome boosting model**.
Again start with the `trees`.
```{r}
first_grid_boost_burn_trans <- crossing(
trees = seq(250, 2000, 250),
mtry = 9,
tree_depth = 6,
loss_reduction = 0.000001,
learn_rate = 0.01,
sample_size = 1
)
```
```{r tuneBoostBurnfirsttrans, eval=FALSE, cache=TRUE}
# took roughly 1 minute
system.time({
set.seed(2)
bout_boost_tune_first_trans <- bout_boost_wflow_trans %>%
tune_grid(
resamples = burnout_folds,
grid = first_grid_boost_burn_trans,
metrics = regr_metrics
)
})
# plot output in the figure below
autoplot(bout_boost_tune_first_trans) + theme_light()
show_best(bout_boost_tune_first_trans)
```
```{r burnboosttunefirsttrans,echo=FALSE,fig.height=4, out.width='70%', fig.align='center', fig.cap="Result of the first grid search for the XGBoost model with transformed outcome."}
knitr::include_graphics("_pictures/burn_boost_tune_first_trans.png")
```
Again 1500 trees should easily suffice.
```{r}
# fix the number of trees by redefining the boosted model with a
# fixed number of trees.
bout_boost_model_trans <- boost_tree(trees = 1500,
learn_rate = tune(),
loss_reduction = tune(),
tree_depth = tune(),
mtry = tune(),
sample_size = tune(),
stop_iter = 10) %>%
set_engine("xgboost") %>%
set_mode("regression")
# update the workflow
bout_boost_wflow_trans <-
bout_boost_wflow_trans %>%
update_model(bout_boost_model_trans)
bout_boost_params_trans <- bout_boost_wflow_trans %>%
parameters() %>%
finalize(burnout_train)
# reduced hyperparameter space
bout_boost_params_trans
```
Now perform the first major grid search over all the other hyperparameters and the second one overall.
```{r tuneBoostBurnsecondtrans, eval=FALSE, cache=TRUE}
# now tune all the remaining hyperparameters with a large space filling grid
# took roughly 2 hours
system.time({
set.seed(2)
bout_boost_tune_second_trans <- bout_boost_wflow_trans %>%
tune_grid(
resamples = burnout_folds,
grid = bout_boost_params_trans %>%
grid_latin_hypercube(
size = 200),
metrics = regr_metrics
)
})
show_best(bout_boost_tune_second_trans, metric = "rmse")
```
From the tuning results one can conclude that in this setting actually the same hyperparameter setting as for the raw target variable seems appropriate. So one finalizes the workflow with the same hyperparameters and fits the model.
```{r}
final_bout_boost_wflow_trans <-
bout_boost_wflow_trans %>%
finalize_workflow(tibble(
mtry = 9,
tree_depth = 4,
learn_rate = 0.02,
loss_reduction = 0.0000003,
sample_size = 0.8
)) %>%
fit(burnout_train)
```
Now all models w.r.t. the burnout data set are fitted.
### Evaluate and understand the model
First a visualization of the variable importance. The variable importance is basically calculated by measuring the gain w.r.t. the loss of each feature in the single trees and splits.[@elements]
**For a single tree**
The feature importance for the tree $t$ and the feature $X_l$ is given by:
$$
\mathcal{I}_l^2(t) = \sum_{j \in \{\text{Internal nodes}\}} \mathcal{i}_j^2\; I(\text{splitting variable}=X_l)
$$
Where $i_j^2$ is the estimated improvement in squared error of the split at node $j$ to stop splitting at this node.
**For the full ensemble**
$$
\mathcal{I}_l^2 = \frac{1}{K} \sum_{k = 1}^K \mathcal{I}_l^2(t_k)
$$
```{r, echo=TRUE, message=FALSE}
vip_burn <- final_bout_boost_wflow %>%
pull_workflow_fit() %>%
vip::vip() +
theme_light() +
labs(title = "Variable importance of the XGBoost model")
vip_burn
ggsave("_pictures/vip_burn.png",plot = vip_burn)
remove(vip_burn)
```
This shows that indeed the `mental_fatigue_score` is the most influential predictor by far followed by the `ressource_allocation` and `designation` features. These results are not at all surprising as the EDA exactly came to these conclusions. Especially the very few appearances of the features connected to `company_type` and `date_of_joining` are most likely just some minor overfitting.
Now compare the variable importance of the model with the raw and the transformed target variable.
```{r, echo=FALSE}
raw_importance <- final_bout_boost_wflow %>%
pull_workflow_fit() %>%
vip::vip() +
theme_light() +
labs(title = "Raw target")
trans_importance <- final_bout_boost_wflow_trans %>%
pull_workflow_fit() %>%
vip::vip() +
theme_light() +
labs(title = "Transformed target")
raw_importance + trans_importance + plot_layout(guides = )
remove(raw_importance, trans_importance)
```
One can observe the exact same ordering. The only difference seems to be a higher influence of the designation variable for the transformed target. This is not surprising as the resource allocation and designation variable are highly correlated.
Now have a look at the performance of the main boosting model w.r.t. missing values in the most influential variable.
```{r, echo=TRUE, message=FALSE}
miss_vals_burn_no <- burnout_test %>%
mutate(boost_pred = predict(final_bout_boost_wflow,
new_data = .)[[".pred"]]) %>%
filter(!is.na(mental_fatigue_score)) %>%
ggplot(aes(x = burn_rate, y = boost_pred)) +
geom_point(col = plasma(1), alpha = 0.1, na.rm = TRUE) +
geom_abline(slope = 1, intercept = 0, size = 1) +
xlim(0,1) + ylim(0,1) +
theme_light() +
labs(y = "XGBoost predictions",
x = "True score",
title = "XGBoost performance for non missing mental fatigue score") +
coord_equal() +
theme(plot.title = ggtext::element_markdown(size = 11))
miss_vals_burn_no
miss_vals_burn_yes <- burnout_test %>%
mutate(boost_pred = predict(final_bout_boost_wflow,
new_data = .)[[".pred"]]) %>%
filter(is.na(mental_fatigue_score)) %>%
ggplot(aes(x = burn_rate, y = boost_pred)) +
geom_point(col = plasma(1)) +
geom_abline(slope = 1, intercept = 0, size = 1) +
xlim(0,1) + ylim(0,1) +
theme_light() +
labs(y = "XGBoost predictions",
x = "True score",
title = "XGBoost performance for missing mental fatigue score") +
coord_equal() +
theme(plot.title = ggtext::element_markdown(size = 11))
miss_vals_burn_yes
```
```{r, echo=FALSE, include=FALSE}
ggsave("_pictures/miss_vals_burn1.png", plot = miss_vals_burn_no)
ggsave("_pictures/miss_vals_burn2.png", plot = miss_vals_burn_yes)
remove(miss_vals_burn_no, miss_vals_burn_yes)
```
The performance for the missing values is indeed not that precise but still quite good. There is at least no huge outlier detectable here. While at first glance the fact that outliers are handled naturally by the model is not astonishing it really is one the core strengths of the model to deal with messy data that includes missing and sparse data. So there is no need for imputation or a second fallback model for missing values.
Now it is interesting to check which model performed the best on the test data set. The results can be viewed below in tabular and graphical form.
```{r perfBurn, echo=TRUE}
# calculate the mae and the rmse for all models (random forest and xgboost)
bout_test_perf <- burnout_test %>%
mutate(rf_pred = predict(final_bout_rf_wflow,
new_data = .)[[".pred"]],
boost_pred = predict(final_bout_boost_wflow,
new_data = .)[[".pred"]],
boost_trans_pred = predict(final_bout_boost_wflow_trans,
new_data = .)[[".pred"]],
# transform the predictions back
boost_trans_pred = (2 / (exp(-boost_trans_pred) + 1)) - 0.5,
intercept_pred = bout_predict_trivial_mean(.),
mfs_scored_pred = bout_predict_trivial_mfs(.)) %>%
mutate(across(ends_with("pred"), function(col) {
case_when(
col < 0 ~ 0,
col > 1 ~ 1,
TRUE ~ col
)
})) %>%
select(burn_rate, rf_pred, boost_pred, boost_trans_pred,
intercept_pred, mfs_scored_pred) %>%
pivot_longer(-burn_rate, names_to = "model") %>%
group_by(model) %>%
summarise(
mae = mae_vec(
truth = burn_rate,
estimate = value
),
rmse = rmse_vec(
truth = burn_rate,
estimate = value
)
) %>%
arrange(rmse)
knitr::kable(bout_test_perf, digits = 4,
booktabs = TRUE,
caption = "Performance on the test data")
```
```{r, echo=FALSE, message=FALSE}
final_perf_burn <- bout_test_perf %>%
pivot_longer(-model) %>%
mutate(model = as.factor(model),
model = fct_recode(model,
"XGBoost" = "boost_pred",
"XGBoost transformed" = "boost_trans_pred",
"Random Forest" = "rf_pred",
"Baseline scaled" = "mfs_scored_pred",
"Intercept only" = "intercept_pred"),
model = fct_reorder(model,-value)) %>%
ggplot(aes(x = value, y = model, col = name)) +
geom_point(shape = 16, size = 3) +
geom_segment(aes(y = model, yend = model,
x = 0, xend = value),
size = 1) +
scale_color_viridis_d(option = "C") +
labs(x = "",y = "", title = "Performance on the test data") +
theme_light() +
theme(legend.position = "None") +
facet_wrap(~name)
final_perf_burn
ggsave("_pictures/final_perf_burn.png",plot = final_perf_burn)
remove(final_perf_burn)
```
It can be observed that already the really simple baseline model that just scales the mental fatigue score (so it reflects just a single linear influence) has a really low MAE and RMSE. Still both random forest as well as the boosting model can further improve this metric but the huge linear influence of the mental fatigue score obviously leaves not much room for improvement. By the way the simple linear model with just this one predictor has already an $R^2$ of more than 0.92. XGBoost manages to get a slightly better performance on the test data set but this would only be nice in a machine learning competition as in real life this difference would negligible. Actually in this use case for a real life application a very easy explainable linear model might be somewhat better than a complex model like XGBoost as the difference in performance is not too big. Nevertheless in my opinion the predictor `mental_fatigue_score` should be treated with extreme care as in a real life situation the collection of this score could be as costly or hard as the one of the outcome. There might be even latent variables that are highly linearly correlated to both scores. But this data was not intended to be used in a real life application but was shared at a machine learning competition and actually many of the best submissions there used XGBoost models. The transformation of the outcome variable did actually not change the predictive power of the model as the result for both the model with the raw target as well as the one with the transformed one performed equally good. Below one can see a scatterplot comparing the individual predictions of these two models on the test set which perfectly underlines the hypothesis that the models are basically the same as there is almost no variation around the identity slope.
```{r, echo=FALSE, message=FALSE}
bout_raw_vs_trans <- burnout_test %>%
mutate(boost_pred = predict(final_bout_boost_wflow,
new_data = .)[[".pred"]],
boost_trans_pred = predict(final_bout_boost_wflow_trans,
new_data = .)[[".pred"]],
boost_trans_pred = (2 / (exp(-boost_trans_pred) + 1)) - 0.5) %>%
mutate(across(ends_with("pred"), function(col) {
case_when(
col < 0 ~ 0,
col > 1 ~ 1,
TRUE ~ col
)
})) %>%
select(boost_pred, boost_trans_pred) %>%
ggplot(aes(x = boost_pred, y = boost_trans_pred)) +
geom_point(col = plasma(1), alpha = 0.1) +
geom_abline() +
xlim(0,1) + ylim(0,1) +
coord_equal() +
labs(x = "Raw target model",
y = "Transformed target model",
title = "Predictions on the test data set") +
theme_light()
bout_raw_vs_trans
ggsave("_pictures/bout_raw_vs_trans.png",plot = bout_raw_vs_trans)
remove(bout_raw_vs_trans)
```
So all in all for this data set it was not a way better predictive performance (if one is not in an artificial machine learning setup) that was the core strength of the tree-based gradient boosting model but the minimal pre-processing and exploratory work (for example no interactions have to be detected manually or be tested for significance) that was needed and the natural handling of missing values to achieve the model. This of course came to a quite high computational price. The famous predictive performance could probably be displayed better when having a data set with more complex non-linear patterns.
Below one can find a comparison between the models presented here and the models that other participants of the seminar proposed. The respective model formulas and model types will not be discussed in detail at this point.
```{r, warning=FALSE, message=FALSE}
######################################################
# Visualize the results of the other seminar
# participants in order to compare the predictive results
# Burnout data set
######################################################
# omit missing values here in order to have the exact same test data set
bout_test_perf_on_nonmissing <- burnout_test %>%
drop_na() %>%
mutate(rf_pred = predict(final_bout_rf_wflow,
new_data = .)[[".pred"]],
boost_pred = predict(final_bout_boost_wflow,
new_data = .)[[".pred"]],
boost_trans_pred = predict(final_bout_boost_wflow_trans,
new_data = .)[[".pred"]],
# transform the predictions back
boost_trans_pred = (2 / (exp(-boost_trans_pred) + 1)) - 0.5,
intercept_pred = bout_predict_trivial_mean(.),
mfs_scored_pred = bout_predict_trivial_mfs(.)) %>%
mutate(across(ends_with("pred"), function(col) {
case_when(
col < 0 ~ 0,
col > 1 ~ 1,
TRUE ~ col
)
})) %>%
select(burn_rate, rf_pred, boost_pred, boost_trans_pred,
intercept_pred, mfs_scored_pred) %>%
pivot_longer(-burn_rate, names_to = "model") %>%
group_by(model) %>%
summarise(
mae = mae_vec(
truth = burn_rate,
estimate = value
),
rmse = rmse_vec(
truth = burn_rate,
estimate = value
)
) %>%
arrange(rmse)
bout_test_perf_on_nonmissing %>%
select(-mae) %>%
# here the calculated rmse can be just appended as the same train
# test split was used, the models below were proposed by Ferdinand
# Buchner
bind_rows(
tribble(~model, ~rmse,
"Simple beta regression", 0.07017522,
"Beta regression", 0.06394935,
"Simple 0-1-inflated Beta regression",0.06475547,
"0-1-inflated Beta regression", 0.05739319,
"Reduced 0-1-inflated Beta regression", 0.05763967)
) %>%
mutate(model = as.factor(model),
model = fct_recode(model,
"XGBoost" = "boost_pred",
"XGBoost transformed" = "boost_trans_pred",
"Random forest" = "rf_pred",
"Simple linear regression" =
"mfs_scored_pred",
"Intercept only" = "intercept_pred"),
model = fct_reorder(model, -rmse)) %>%
filter(model != "Intercept only") %>%
arrange(rmse) %>%
ggplot(aes(x = rmse, y = model)) +
geom_point(shape = 16, size = 3, col = plasma(1)) +
geom_segment(aes(y = model, yend = model,
x = min(rmse), xend = rmse),
size = 1, col = plasma(1)) +
scale_color_viridis_d(option = "C") +
labs(x = "RMSE",y = "", title = "Performance on the test data",
subtitle = "Missing values omitted here") +
theme_light() +
theme(legend.position = "None")
```
This finishes the analysis of the burnout data set. But no worries there is still one data set and boosting model left to explore namely the insurance data set.
## Insurance data
### Baseline models
The first thing is to set up the trivial **baseline models**.
```{r}
# the trivial intercept only model:
ins_predict_trivial_mean <- function(new_data) {
rep(mean(ins_train$log10_charges), nrow(new_data))
}
# the trivial linear model without any interactions (as we do not have
# missing values does not have to be dealt with them)
ins_baseline_lm <- lm(log10_charges ~ age + sex + bmi +
children + smoker + region,
data = ins_train %>%
mutate(across(where(is.character), as.factor)))
summary(ins_baseline_lm)
```
The predictions of these baseline models on the test data set will be compared with the predictions of the tree-based models that will be constructed.
### Model specification
```{r modelSpecIns}
# Models:
# Random forest model for comparison
ins_rf_model <- rand_forest(trees = tune(),
mtry = tune()) %>%
set_engine("ranger") %>%
set_mode("regression")
# XGBoost model
ins_boost_model <- boost_tree(trees = tune(),
learn_rate = tune(),
loss_reduction = tune(),
tree_depth = tune(),
mtry = tune(),
sample_size = tune(),
stop_iter = 10) %>%
set_engine("xgboost") %>%
set_mode("regression")
```
```{r workflowSpecIns}
# Workflows: model + recipe
ins_rf_wflow <-
workflow() %>%
add_model(ins_rf_model) %>%
add_recipe(ins_rec_rf)
ins_boost_wflow <-
workflow() %>%
add_model(ins_boost_model) %>%
add_recipe(ins_rec_boost)
```
### Tuning
For the hyperparameter tuning one needs validation sets to monitor the models on unseen data. To do this 5-fold cross validation (CV) is used here.
```{r resamplingObjectsIns}
# Create Resampling object
set.seed(2)
ins_folds <- vfold_cv(ins_train, v = 5)
```
Now the parameter objects will be fixed for the grid searches analogously to above.
```{r}
ins_rf_params <- ins_rf_wflow %>%
parameters() %>%
update(trees = trees(c(100, 2000))) %>%
finalize(ins_train)
ins_rf_params %>% pull_dials_object("trees")
ins_rf_params %>% pull_dials_object("mtry")
```
```{r}
ins_boost_params <- ins_boost_wflow %>%
parameters() %>%
update(trees = trees(c(100, 2000))) %>%
finalize(ins_train)
ins_boost_params
```
The **random forest model** goes first with the tuning.
Here there are as seen above only two hyperparameters to be tuned thus 30 combinations should suffice.
```{r tuneRFins, cache=TRUE, eval=FALSE}
# took roughly 2 minutes
system.time({
set.seed(2)
ins_rf_tune <- ins_rf_wflow %>%
tune_grid(
resamples = ins_folds,
grid = ins_rf_params %>%
grid_latin_hypercube(size = 30, original = FALSE),
metrics = regr_metrics
)
})
# visualization of the tuning results (snapshot of the output below)
autoplot(ins_rf_tune) + theme_light()
# this functions shows the best combinations wrt the rmse metric of all the
# combinations in the grid
show_best(ins_rf_tune, metric = "rmse")
```
```{r rfInstuneplot,echo=FALSE,fig.height=4, out.width='70%', fig.align='center', fig.cap="Result of a spacefilling grid search for the random forest model."}
knitr::include_graphics("_pictures/rf_ins_tune_plot.png")
```
The visualization alongside the best performing results suggest that a value of `mtry`of 4 and 1000 `trees` should give good results. Thus one can finalize and fit this model.
```{r}
final_ins_rf_wflow <-
ins_rf_wflow %>%
finalize_workflow(tibble(
trees = 1000,
mtry = 4
)) %>%
fit(ins_train)
```
Now the **boosting model**.
Again one starts to tune over the number of trees first (this time we also use three different tree depths to visualize the impact of this parameter too)
```{r}
first_grid_boost_ins <- crossing(
trees = seq(250, 2000, 250),
mtry = 8,
tree_depth = c(1, 4, 6),
loss_reduction = 0.000001,
learn_rate = 0.01,
sample_size = 1
)
```
```{r tuneBoostInsfirst, eval=FALSE, cache=TRUE}
# took roughly half a minute
system.time({
set.seed(2)
ins_boost_tune_first <- ins_boost_wflow %>%
tune_grid(
resamples = ins_folds,
grid = first_grid_boost_ins,
metrics = regr_metrics
)
})
show_best(ins_boost_tune_first)
```
```{r, eval=FALSE}
# visualize the tuning results (output in the figure below)
ins_boost_tune_first %>%
collect_metrics() %>%
ggplot(aes(x = trees, y = mean, col = factor(tree_depth))) +
geom_point() +
geom_line() +
geom_linerange(aes(ymin = mean - std_err, ymax = mean + std_err)) +
labs(col = "Max tree depth", y = "", x = "Number of trees",
title = "") +
scale_color_viridis_d(option = "C") +
facet_wrap(~ .metric) +
theme_light()
```
```{r boostInstuneplot1,echo=FALSE,fig.height=4, out.width='70%', fig.align='center', fig.cap="Comparison of the initial grid search for the number of trees w.r.t. the maximum tree depth. Actually with tiny confidence bands around the estimates."}