forked from m-clark/book-of-models
-
Notifications
You must be signed in to change notification settings - Fork 0
/
knowing_models.qmd
2915 lines (2188 loc) · 116 KB
/
knowing_models.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# Knowing Your Model {#sec-knowing}
<!--
TODO: Consider more consistency of examples - we use several, but maybe it's not too bad.
TODO: Maybe split 'knowing model', vs. 'knowing features'; the split appears to be roughly 60/40.
TODO: totally okay with a screenshot of Redd here to go with the quote.
TODO: cache the saving of data/models
-->
```{r}
#| label: setup-knowing
#| include: false
library(reticulate)
use_condaenv("book-of-models")
```
```{r}
#| label: setup-knowing-r-load-objects
#| include: false
load("data/models/knowing_models/model_train_reg.RData")
load("data/models/knowing_models/model_train_class.RData")
load("data/models/knowing_models/model_reviews_3feat.RData")
load("data/models/knowing_models/knowing_model_data.RData")
df_reviews = read_csv("data/movie_reviews_processed.csv")
```
```{python}
#| label: setup-knowing-py$mlp_loss
#| include: false
import pandas as pd
import numpy as np
import statsmodels.api as sm
import statsmodels.formula.api as smf
from statsmodels.iolib.smpickle import load_pickle
training_data = pd.read_csv("data/models/knowing_models/knowing_model_data_train.csv")
testing_data = pd.read_csv("data/models/knowing_models/knowing_model_data_test.csv")
model_train_reg = load_pickle("data/models/knowing_models/model_train_reg.pkl")
model_train_class = load_pickle("data/models/knowing_models/model_train_class.pkl")
model_reviews_3feat = load_pickle("data/models/knowing_models/model_reviews_3feat.pkl")
df_reviews = pd.read_csv("data/movie_reviews_processed.csv")
# misc helpful objects
features = [
"review_year_0",
"release_year_0",
"age_sc",
"length_minutes_sc",
"total_reviews_sc",
"word_count_sc",
"genre",
]
```
In addition to giving the world one of the greatest television show theme songs -- Quincy Jones' *The Streetbeater* -- *Sanford & Son* gave us an insightful quote for offering criticism: "You big dummy." While we don't advocate for swearing at or denigrating your model, how do you know if your model is performing up to your expectations? It is easy to look at your coefficients, *t*-values, and an adjusted $R^2$, and say, "Wow! Look at this great model!" Your friends will be envious of such terrific *p*-values, and all of the strangers that you see at social functions will be impressed. What happens if that model falls apart on new data, though? What if a stakeholder wants to know exactly how a prediction was made for a specific business decision? Sadly, all of the stars that you gleefully pointed towards in your console will not offer you any real answers.
Instead of falling in immediate love with your model, you should ask some hard questions of it. How does it perform on different slices of data? Do predictions make sense? Is your classification cut-point appropriate? In other words, you should criticize your model before you decide it can be used for its intended purposes. Remember that it is **data modeling**, not **data truthing**. In other words, you should always be prepared to call your model a "big dummy".
## Key Ideas {#sec-knowing-key}
- Metrics can help you assess how well your model is performing, and they can also help you compare different models.
- Different metrics can be used depending on the goals of your model.
- Visualizations can help you understand how your model is making predictions and which features are important.
- Feature importance is very difficult to ascertain even in the simplest of models, but there are tools to help you understand how much each feature contributes to a prediction.
### Why this matters {#sec-knowing-why}
It's never good enough to simply get model results. You need to know how well your model is performing and how it is making predictions. You also should be comparing your model to other alternatives. Doing so provides more confidence in your model and helps you to understand how it is working, and just as importantly, where it fails. This is actionable knowledge.
### Good to know {#sec-knowing-good}
This takes some of the things we see in other chapters on linear models and machine learning, so we'd suggest having the linear model basics down pretty well.
## Understanding the Model {#sec-knowing-model-metrics}
A first step in understanding our model can be done with summary statistics, typically called **metrics**. Regression and classification have different metrics for assessing model performance. We want to give you a sample of some of the more common ones, but we also want to acknowledge that there are many more that you can use, and any might be useful. We would always recommend looking at a few different metrics for any given model to get a better sense of how your model is performing.
@tbl-performance-metrics illustrates some of the most commonly used performance metrics. Just because these are popular or applicable for your situation, doesn't mean they are the only ones you can or even should use. Nothing keeps you from using more than one metric for assessment, and in fact, it is often a good idea to do so. You should have a working knowledge of these.
<!-- TODO: check table for pdf -->
\newpage
<!--
this option appears to make the whole book do left-right margining.
\KOMAoptions{paper=landscape,pagesize}
\recalctypearea
-->
\scriptsize
\blandscape
```{r}
#| echo: false
#| label: tbl-performance-metrics
#| tbl-cap: Commonly used performance metrics in machine learning.
performance_metrics = tribble(
~Problem_Type, ~Metric, ~Description, ~`Other Names/Notes`,
"Regression", "RMSE", "Root mean squared error", 'MSE (before square root)',
"Regression", "MAE", "Mean absolute error", '',
"Regression", "MAPE", "Mean absolute percentage error", '',
"Regression", "RMSLE", "Root mean squared log error", '',
"Regression", "R-squared", "Amount of variance shared by predictions and target", 'Coefficient of determination',
"Regression", "Deviance/AIC", "Generalization of sum of squared error ", 'Also "deviance explained" for similar R-sq interpretation',
"Classification", "Accuracy", "Proportion correct", 'Error rate is 1 - Accuracy',
"Classification", "Precision", "Proportion of positive predictions that are correct", 'Positive Predictive Value',
"Classification", "Recall", "Proportion of positive samples that are predicted correctly", 'Sensitivity, True Positive Rate',
"Classification", "Specificity", "Proportion of negative samples that are predicted correctly", 'True Negative Rate',
'Classification', 'Negative Predictive Value', 'Proportion of negative predictions that are correct', '',
"Classification", "F1", "Harmonic mean of precision and recall", 'F-Beta',
"Classification", "AUC", "Area under the ROC curve", '',
"Classification", 'False Positive Rate', 'Proportion of negative samples that are predicted incorrectly', 'Type I Error, alpha',
"Classification", 'False Negative Rate', 'Proportion of positive samples that are predicted incorrectly', 'Type II Error, beta, Power is 1 - beta',
"Classification", 'Phi', 'Correlation between predicted and actual', "Matthews Correlation",
"Classification", "Log loss", "Negative log likelihood of the predicted probabilities", ''
)
performance_metrics %>%
group_by(Problem_Type) %>%
gt() %>%
tab_header(
title = "",
subtitle = ""
) |>
tab_footnote(
footnote = "Beta = 1 for F1",
locations = cells_body(columns = vars(`Other Names/Notes`), rows = Metric == 'F1')
) |>
# tab_footnote(
# footnote = "Not sure which fields refer to Matthews Correlation outside of CS and bioinformatics, since 'phi' had already been widely used for over 60 years before Matthews forgot to cite it in his paper, and phi was literally the first correlation coefficient devised by Pearson.[phi == mcc](https://en.wikipedia.org/wiki/Phi_coefficient#Machine_learning)",locations = cells_body(columns = vars(`Other Names/Notes`), rows = Metric == 'Phi')
# ) |>
tab_options(
footnotes.font.size = 10,
)
```
\elandscape
\newpage
<!-- \KOMAoptions{paper=portrait,pagesize}
\recalctypearea -->
\normalsize
### Regression Metrics {#sec-knowing-reg-metrics}
The primary goal of our endeavor is to come up with a predictive model. The closer our model predictions are to the observed target values, the better our model is performing. As we saw in the above table, when we have a numeric target there are quite a few metrics that help us understand prediction-target correspondence, so let's look at some of those.
But before we create a model to get us started, we are going to read in our data and then create two different splits within our data: a **training** set and a **test** set. In other words, we are going to **partition** our data so that we can train a model and then see how well that model performs with data it hasn't seen[^pyvr-splits]. For more on this process and the reasons why we do it see -@sec-ml-generalization and -@sec-ml-cv. For now, we just need to know that assessing prediction error on the test set will give us a better estimate of our metric of choice.
[^pyvr-splits]: For anyone comparing Python to R results, the data splits are not the same so outputs likewise will not be identical, though they should be very similar.
:::{.callout-note title='Splitting Data' collapse="true"}
This basic split is the foundation of **cross-validation**. Cross-validation is a method for partitioning data into training and non-training sets in a way that allows you to better understand the model's performance. You'll find more explicit demonstration of how to do this in the machine learning chapter @sec-ml-common-models.
:::
:::{.panel-tabset}
##### R
```{r}
#| label: r-regression-import-split
df_reviews = read.csv(
"data/movie_reviews_processed.csv"
)
initial_split = sample(
x = 1:nrow(df_reviews),
size = nrow(df_reviews) * .75,
replace = FALSE
)
training_data = df_reviews[initial_split, ]
testing_data = df_reviews[-initial_split, ]
```
##### Python
```{python}
#| label: python-regression-import-split
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
df_reviews = pd.read_csv("data/movie_reviews_processed.csv")
training_data, testing_data = train_test_split(
df_reviews,
test_size = 0.25,
random_state = 123
)
```
:::
```{r}
#| label: r-knowing-save-model-data
#| echo: false
save(
training_data,
testing_data,
file = "data/models/knowing_models/knowing_model_data.RData"
)
```
```{python}
#| label: python-knowing-save-model-data
#| echo: false
training_data.to_csv("data/models/knowing_models/knowing_model_data_train.csv", index = False)
testing_data.to_csv("data/models/knowing_models/knowing_model_data_test.csv", index = False)
```
You'll notice that we created training data with 75% of our data and we will use the other 25% to test our model. With training data in hand, let's produce a model to predict review rating. We'll use the standardized (scaled `_sc`) versions of several features, and use the 'year' features starting at year 0, which is the earliest year in our data. Finally, we also include the genre of the movie as a categorical feature.
:::{.panel-tabset}
##### R
```{r}
#| label: r-regression-model-train
model_train_reg = lm(
rating ~
review_year_0
+ release_year_0
+ age_sc
+ length_minutes_sc
+ total_reviews_sc
+ word_count_sc
+ genre
,
training_data
)
```
##### Python
```{python}
#| label: python-regression-model-train
import statsmodels.api as sm
import statsmodels.formula.api as smf
# we'll use 'features' later also
features = [
"review_year_0",
"release_year_0",
"age_sc",
"length_minutes_sc",
"total_reviews_sc",
"word_count_sc",
"genre",
]
model = 'rating ~ ' + " + ".join(features)
model_train_reg = smf.ols(
formula = model,
data = training_data
).fit()
```
:::
```{r}
#| label: r-regression-model-save
#| echo: false
save(
model_train_reg,
file = "data/models/knowing_models/model_train_reg.RData"
)
```
```{python}
#| label: python-regression-model-save
#| echo: false
model_train_reg.save("data/models/knowing_models/model_train_reg.pkl")
```
Now that we have a model on our training data, we can use it to make predictions on our test data:
:::{.panel-tabset}
##### R
```{r}
#| label: r-regression-model-predict
predictions = predict(model_train_reg, newdata = testing_data)
```
##### Python
```{python}
#| label: python-regression-model-predict
predictions = model_train_reg.predict(testing_data)
```
:::
The goal now is to find out how close our predictions match reality. Let's look at them first:
```{r}
#| label: fig-pred-vs-obs
#| echo: false
p_dat = tibble(
observed = testing_data$rating,
predicted = predictions
)
p1 = p_dat |>
ggplot(aes(observed, predicted)) +
geom_point(alpha = 0.36, size = 2, position = position_jitter(width = .1)) +
geom_abline(
intercept = 0,
slope = 1,
color = okabe_ito[2],
linewidth = 1
) +
lims(x = c(1, 5), y = c(1, 5)) +
labs(
x = "Observed",
y = "Predicted",
# title = "Observed vs. Predicted Ratings"
)
p2 = p_dat |>
select(observed, predicted) |>
pivot_longer(everything()) |>
ggplot(aes(x = value)) +
# geom_density(aes(color = name, fill = name), alpha = 0.15) +
stat_ecdf(geom = "borderline", aes(color = name, fill = name), linewidth = 1) +
scale_fill_manual(values = c(okabe_ito[1], okabe_ito[2]), aesthetics = c("fill", "color")) +
labs(y = "Cumulative Density", x = "Rating")
p1 + p2
```
Obviously, our points do not make a perfect line on the left, which would indicate perfect prediction. Also, the distribution of our values suggests we're over predicting on the lower end and under predicting on the higher end of the target's range. But we'd like to determine how far off we are in a general sense. There are a number of metrics that can be used to measure this. We'll go through a few of them here.
#### R-squared {#sec-knowing-metrics-r2}
Anyone that has done linear regression has come across the $R^2$ value. It is a measure of how well the model explains the variance in the target. One way to calculate it is as follows:
$$R^2 = 1 - \frac{\sum_{i=1}^{n}(y_i - \hat{y}_i)^2}{\sum_{i=1}^{n}(y_i - \bar{y})^2}$$
where $y_i$ is the observed value, $\hat{y}_i$ is the predicted value, and $\bar{y}$ is the mean of the observed values. The $R^2$ value is a measure of how much variance in the target (the denominator) is attributable to the model's predictions (numerator). It is a value between 0 and 1, with 1 indicating that the model explains all of the variance in the target.
More simply, $R^2$ is the squared correlation of our predicted values and the target. In that sense it can almost always be useful as a *descriptive* measure, just like we use means and standard deviations in exploratory data analysis. However, it is not so great at telling us about predictive quality. Why? Take your predictions from our rating model, and add 10 to them, or make them all negative. In both cases your predictions would be ridiculous, but your $R^2$ will be the same. Another problem is that for training data, $R^2$ will always increase as you add more features to your model, whether they are useful or pure noise! This is why we use other metrics to assess predictive quality.
:::{.panel-tabset}
##### R
```{r}
#| label: r-regression-r2
1 - sum((testing_data$rating - predictions)^2) / sum((testing_data$rating - mean(testing_data$rating))^2)
yardstick::rsq_trad_vec(testing_data$rating, predictions)
# conceptually identical, but slight difference due to how internal calculations are done
cor(testing_data$rating, predictions)^2
yardstick::rsq_vec(testing_data$rating, predictions)
```
##### Python
```{python}
#| label: python-regression-r2
from sklearn.metrics import r2_score
1 - np.sum((testing_data.rating - predictions)**2) / np.sum((testing_data.rating - np.mean(testing_data.rating))**2)
r2_score(testing_data.rating, predictions)
# conceptually identical, but slight difference due to how calculations are done
np.corrcoef(testing_data.rating, predictions)[0, 1]**2
```
:::
:::{.callout-tip title='R-squared variants' }
There are different versions of R-squared. 'Adjusted' R-squared is a common one, and it penalizes the model for adding features that don't really explain the target variance. This is a nice sentiment, but its difference versus the standard R-squared would only be noticeable for very small datasets. Some have also attempted to come up with R-squared values that are more appropriate for GLMs for count, binary and other models. Unfortunately, these 'pseudo-R-squared' values are not as interpretable as the original R-squared, and generally suffer several issues.
:::
#### Mean Squared Error {#sec-knowing-metrics-mse}
One of the most common *performance* metrics for numeric targets is the mean squared error (MSE) and its square root, root mean squared error (RMSE). The MSE is the average of the squared differences between the predicted and actual values. It is calculated as follows:
$$\text{MSE} = \frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2$$
MSE is penalizes large errors more- since errors are squared, the larger the error, the larger the penalty. As mentioned, the root mean squared error (RMSE) is just the square root of the MSE. Like MSE, RMSE likewise penalizes large errors, but if you want a metric that is in the same units as the original target data, RMSE is the metric for you. It is calculated as follows:
$$\text{RMSE} = \sqrt{\text{MSE}}$$
:::{.panel-tabset}
##### R
```{r}
#| label: r-regression-mse
mean((testing_data$rating - predictions)^2)
yardstick::rmse_vec(testing_data$rating, predictions)^2
sqrt(mean((testing_data$rating - predictions)^2))
yardstick::rmse_vec(testing_data$rating, predictions)
```
##### Python
```{python}
#| label: python-regression-mse
from sklearn.metrics import mean_squared_error
np.mean((testing_data.rating - predictions)**2)
mean_squared_error(testing_data.rating, predictions)
np.sqrt(np.mean((testing_data.rating - predictions)**2))
mean_squared_error(testing_data.rating, predictions, squared = False)
```
:::
#### Mean Absolute Error {#sec-knowing-metrics-mae}
The mean absolute error (MAE) is the average of the absolute differences between the predicted and observed values. It is calculated as follows:
$$\text{MAE} = \frac{1}{n}\sum_{i=1}^{n}|y_i - \hat{y}_i|$$
MAE is a great metric when all you really want to know is how far off your predictions are from the observed values. It is not as sensitive to large errors as the MSE.
:::{.panel-tabset}
##### R
```{r}
#| label: r-regression-mae
mean(abs(testing_data$rating - predictions))
yardstick::mae_vec(testing_data$rating, predictions)
```
##### Python
```{python}
#| label: python-regression-mae
from sklearn.metrics import mean_absolute_error
np.mean(abs(testing_data.rating - predictions))
mean_absolute_error(testing_data.rating, predictions)
```
:::
#### Mean Absolute Percentage Error {#sec-knowing-metrics-mape}
The mean absolute percentage error (MAPE) is the average of the absolute differences between the predicted and observed values, expressed as a percentage of the observed values. It is calculated as follows:
$$MAPE = \frac{1}{n}\sum_{i=1}^{n}\frac{|y_i - \hat{y}_i|}{y_i}$$
:::{.panel-tabset}
MAPE is a great metric when you want to know how far off your predictions are from the observed values, but you want to express that difference as a percentage of the actual value. It is not as sensitive to large errors as the MSE.
##### R
```{r}
#| label: r-regression-mape
mean(
abs(testing_data$rating - predictions) /
testing_data$rating
) * 100
yardstick::mape_vec(testing_data$rating, predictions)
```
##### Python
```{python}
#| label: python-regression-mape
from sklearn.metrics import mean_absolute_percentage_error
np.mean(
abs(testing_data.rating - predictions) /
testing_data.rating
) * 100
mean_absolute_percentage_error(testing_data.rating, predictions) * 100
```
:::
#### Which Regression Metric Should I Use?
In the end, it won't hurt to look at a few of these metrics to get a better idea of how well your model is performing. You will *always* be using these metrics to compare different models, so use a few of them to get a better sense of how well your models are performing relative to one another. Does adding a feature help drive down RMSE, indicating that the feature helps to reduce large errors? In other words, does adding complexity to your model provide a big reduction in error? If adding features doesn't help reduce error, do you really need to include it in your modelU+0203D;
### Classification Metrics {#sec-knowing-class-metrics}
Whenever we are classifying outcomes, we don't have the same ability to compare a predicted score to an observed score -- instead, we are going to use the predicted probability of an outcome, establish a cut-point for that probability, convert everything below that cut-point to 0, and then convert everything at or above that cut-point to 1. We can then compare a table predicted versus target **classes**, typically called a **confusion matrix**[^whysoconfused].
[^whysoconfused]: The origin of the term "confusion matrix" is a bit muddled, and it's not clear why it's not just called a classification table/matrix (as it actually is from time to time). If you call it a classification table, probably everyone will know exactly what you mean, but if you call it a confusion matrix, few outside of data science (or domains that use it) will likely know what you're talking about.
Let's start with a model to predict whether a review is "good" or "bad". We will use the same training and testing data that we created above.
:::{.panel-tabset}
##### R
```{r}
#| label: r-classification-model-train
model_train_class = glm(
rating_good ~
genre + review_year_0
+ release_year_0
+ age_sc
+ length_minutes_sc
+ total_reviews_sc
+ word_count_sc
+ genre
,
training_data,
family = binomial
)
summary(model_train_class)
# for later
y_target_testing_bin = ifelse(testing_data$rating_good == "good", 1, 0)
```
##### Python
```{python}
#| label: python-classification-model-train
import statsmodels.api as sm
import statsmodels.formula.api as smf
model = 'rating_good ~ ' + " + ".join(features)
model_train_class = smf.glm(
formula = model,
data = training_data,
family = sm.families.Binomial()
).fit()
model_train_class.summary()
```
:::
```{r}
#| label: r-classification-model-save
#| echo: false
save(
model_train_class,
file = "data/models/knowing_models/model_train_class.RData"
)
```
```{python}
#| label: python-classification-model-save
#| echo: false
model_train_class.save("data/models/knowing_models/model_train_class.pkl")
```
Now that we have our model trained, we can use it to get the predicted probabilities for each observation[^predict_proba].
[^predict_proba]: Most machine learning libraries in Python will have a `predict_proba` method that will give you the probability of each class, while the `predict` method will give you the predicted class.
:::{.panel-tabset}
##### R
```{r}
#| label: r-classification-model-predict
predicted_prob = predict(
model_train_class,
newdata = testing_data,
type = "response"
)
```
##### Python
```{python}
#| label: python-classification-model-predict
predicted_prob = model_train_class.predict(testing_data)
```
:::
We are going to take those probability values and make a decision to convert everything at or above .5 to the positive class (a "good" review). It is a bold assumption, but one that we will make at first!
:::{.panel-tabset}
##### R
```{r}
#| label: r-classification-predictions
predicted_class = ifelse(predicted_prob >= .5 , 1, 0)
```
##### Python
```{python}
#| label: python-classification-predictions
predicted_class = np.where(predicted_prob >= .5, 1, 0)
predicted_class = pd.Series(predicted_class)
```
:::
#### Confusion Matrix {#sec-knowing-metrics-confusion}
The confusion matrix is a table that shows the number of correct and incorrect predictions made by the model. As we'll see, it's easy enough to get one from scratch, but we recommend using a function that will give you a nice table and all of the metrics you need along with it. To get us started, here is a confusion matrix for our (R) model, but the Python table is similar.
```{r}
#| echo: false
#| label: r-classification-confusion-matrix
rating_cm = mlr3measures::confusion_matrix(
factor(testing_data$rating_good),
factor(predicted_class),
positive = "1"
)
```
```{python}
#| echo: false
#| label: python-classification-confusion-matrix
from sklearn.metrics import confusion_matrix
rating_cm = confusion_matrix(testing_data.rating_good, predicted_class)
```
```{r}
#| echo: false
#| label: pretty-confmat
new_cm = rating_cm$matrix
new_cm['0','0'] = paste0("TN: ", new_cm['0','0'])
new_cm['1','1'] = paste0("TP: ", new_cm['1','1'])
new_cm['0', '1'] = paste0("FN: ", new_cm['0', '1'])
new_cm['1', '0'] = paste0("FP: ", new_cm['1', '0'])
new_cm = new_cm[2:1,2:1]
class(new_cm) = "matrix" # strip table class
new_cm |>
as.data.frame() |>
rownames_to_column(var = " ") |>
rename(`True 0` = `0`, `True 1` = `1`) |>
mutate(` ` = glue('Predicted {` `}')) |>
gt()
```
- **TN**: A True Negative is an outcome where the model correctly predicts the negative class -- the model correctly predicted that the review was not good.
- **FN**: A False Negative is an outcome where the model incorrectly predicts the negative class -- the model incorrectly predicted that the review was not good.
- **FP**: A False Positive is an outcome where the model incorrectly predicts the positive class -- the model incorrectly predicted that the review was good.
- **TP**: A True Positive is an outcome where the model correctly predicts the positive class -- the model correctly predicted that the review was good.
In an ideal world, we would have all of our observations fitting nicely in the diagonal of that table. Unfortunately, we don't live in that world and the more values we have in the off diagonal (i.e., in the FN and FP spots), the worse our model is at classifying outcomes.
Let's look at some metrics that will help to see if we've got a suitable model or not. We'll describe each, then show them all after.
<!-- TODO: Maybe just table these up? But then we have two tables -->
#### Accuracy
Accuracy's allure is in its simplicity and because we use it for so many things in our everyday affairs. The accuracy is the proportion of correct predictions made by the model. Of all the metrics to assess the quality of classification, accuracy is the easiest to cheat. If you have any **class imbalance** (i.e., one class within the target has far more observations than the other), you can get a high accuracy by simply predicting the majority class all of the time! To get around the false sense of confidence that accuracy alone can promote, we can look at a few other metrics.
:::{.callout-warning title='Accuracy is not enough' collapse="true"}
Accuracy is the first thing you see and the last thing that you trust! Seriously, accuracy alone should not be your sole performance metric unless you have a perfectly even split in the target! If you find yourself in a meeting where people are presenting their classification models and they only talk about accuracy, you should be very skeptical of their model; this is especially true when those accuracy values seem too good to be true. At the very least, always be ready to compare it to the baseline rate, or prevalence of the majority class.
:::
#### Sensitivity/Recall/True Positive Rate {#sec-knowing-metrics-sensitivity}
**Sensitivity**, also known as **recall** or the **true positive rate**, is the proportion of observed positives that are correctly predicted by the model. If your focus is on the positive class above all else, sensitivity is the metric for you.
#### Specificity/True Negative Rate {#sec-knowing-metrics-specificity}
**Specificity**, also known as the **true negative rate**, is the proportion of observed negatives that are correctly predicted as such. If you want to know how well your model will work with the negative class, specificity is a great metric.
#### Precision/Positive Predictive Value {#sec-knowing-metrics-precision}
The **precision** is the proportion of positive predictions that are correct, and is often a key metric in many business use cases. While similar to sensitivity, precision focuses on is positive predictions, while sensitivity focuses on observed positive cases.
#### Negative Predictive Value {#sec-knowing-metrics-npv}
The **negative predictive value** is the proportion of negative predictions that are correct, and is the complement to precision.
Let's see how we'd do this ourselves. We'll create a basic confusion matrix then extract the values to create the metrics we need.
:::{.panel-tabset}
##### R
```{r}
#| label: r-cm-basic
confusion_matrix = table(
predicted_class,
testing_data$rating_good
)
TN = confusion_matrix[1, 1]
TP = confusion_matrix[2, 2]
FN = confusion_matrix[1, 2]
FP = confusion_matrix[2, 1]
acc = (TP + TN) / sum(confusion_matrix) # accuracy
tpr = TP / (TP + FN) # sensitivity, true positive rate, recall
tnr = TN / (TN + FP) # specificity, true negative rate
ppv = TP / (TP + FP) # precision, positive predictive value
npv = TN / (TN + FN) # negative predictive value
```
##### Python
```{python}
#| label: python-cm-basic
confusion_matrix = pd.crosstab(
predicted_class,
testing_data.rating_good.reset_index(drop=True),
).to_numpy()
TN = confusion_matrix[0, 0]
TP = confusion_matrix[1, 1]
FN = confusion_matrix[0, 1]
FP = confusion_matrix[1, 0]
acc = (TP + TN) / np.sum(confusion_matrix) # accuracy
tpr = TP / (TP + FN) # sensitivity, true positive rate, recall
tnr = TN / (TN + FP) # specificity, true negative rate
ppv = TP / (TP + FP) # precision, positive predictive value
npv = TN / (TN + FN) # negative predictive value
```
:::
Now that we have a sense of some metrics, let's get a confusion matrix and stats using packages that will give us a lot of these metrics at once. In both cases we have an 0/1 integer where 0 is a rating of "bad" and 1 is "good".
:::{.panel-tabset}
##### R
We use [mlr3verse]{.pack} in the ML chapters, so we'll use it here too. Though our predictions are 0/1, we need to convert it to a factor for this function.
```{r}
#| label: r-cm-summary
# note the columns and rows are reversed from our previous confusion matrix
cm = mlr3measures::confusion_matrix(
factor(testing_data$rating_good), # requires a factor
factor(predicted_class),
positive = "1" # class 1 is 'good'
)
tibble(
metric = c('ACC', 'TPR', 'TNR', 'PPV', 'NPV'),
ours = c(acc, tpr, tnr, ppv, npv),
package = cm$measures[c('acc', 'tpr', 'tnr', 'ppv', 'npv')]
)
```
##### Python
We find [pycm]{.pack} to be a great package for this purpose, practically every metric you can think of is available. You can also use [sklearn.metrics]{.pack} and its corresponding `classification_report` function.
```{python}
#| label: python-cm-summary
from pycm import ConfusionMatrix
cm = ConfusionMatrix(
testing_data.rating_good.to_numpy(),
predicted_class.to_numpy(),
digit = 3
)
# print(cm) # lots of stats!
package_result = [
cm.class_stat[stat][1] # get results specific to class 1
for stat in ['ACC', 'TPR', 'TNR', 'PPV', 'NPV']
]
pd.DataFrame({
'metric':['ACC', 'TPR', 'TNR', 'PPV', 'NPV'],
'ours': [acc, tpr, tnr, ppv, npv],
'package': package_result
})
```
:::
Some additional measures we might look at include the following. Those with an asterisk are also in table @tbl-performance-metrics.
- Kappa: A measure of how much better the model is than random guessing.
- Prevalence: The proportion of actual positives in the data. If you don't know this, accuracy is fairly meaningless.
- Balanced Accuracy: The average of the sensitivity (TPR) and specificity (TNR). Typically a better measure than accuracy when there is class imbalance.
- F1*: The harmonic mean of precision and recall.
- AUC*: The area under the ROC curve. We'll talk about this next.
#### Ideal Decision Points for Classification {#sec-knowing-ideal-cut-point}
Earlier when we obtained the predicted class, and then all the metrics based on it, we used a predicted probability value of 0.5 as a cutoff for a 'good' vs. a 'bad' rating, and this is usually the default if we don't specify it. Think that this is the best for a given situation is actually a pretty bold assumption on our part, and we should probably make sure that the cut-off value we choose is going to offer use the best result given the modeling context.
But what is the 'best' result? That's going to depend on the situation. If we are predicting whether a patient has a disease, we might want to minimize false negatives, since if we miss the diagnosis, the patient could be in serious trouble. Meanwhile if we are predicting whether a transaction is fraudulent, we might want to minimize false positives, since if we flag a transaction as fraudulent when it isn't, we could be causing a lot of trouble for the customer. In other words, we might want to maximize sensitivity or specificity, respectively.
Whatever we decide, we ultimately are just shifting the metrics around relative to one another. As an easy example, if we were to classify all of our observations as 'good', we would have a sensitivity of 1 because all good ratings would be classified correctly. However, our positive predictive value would not be 1, and we'd have a specificity of 0. No matter which cutpoint we choose, we are going to have to make a tradeoff between these metrics.
Where this comes into play is with **model selection**, where we choose a model based on a particular metric, and something we will talk about very soon. If we are comparing models based on accuracy, we might choose a different model than if we are comparing based on sensitivity. And given a particular threshold, we might choose a different model based on the same metric, than we would have with a different threshold.
To help us with the task of choosing a threshold, we will start by creating what's called a **Receiver Operating Characteristic** (ROC) curve. This curve plots the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings. The **area under the curve** (AUC) is a measure of how well the model is able to distinguish between the two classes. The closer the AUC is to 1, the better the model is at distinguishing between the two classes. The AUC is a very popular metric because it is not sensitive to our threshold, and actually concerns two metrics we are often interested in[^prerecallcurve].
[^prerecallcurve]: The **precision-recall curve** is very similar approach which visualizes the tradeoff between precision and recall. The area under the precision-recall curve is its corresponding metric.
:::{.panel-tabset}
##### R
```{r}
#| results: hide
#| label: r-roc-auc
#| fig-show: 'hide'
roc = performance::performance_roc(model_train_class, new_data = testing_data)
roc
# requires the 'see' package
plot(roc)
```
##### Python
```{python}
#| results: hide
#| label: python-roc-auc
#| fig-show: 'hide'
from sklearn.metrics import roc_curve, auc, RocCurveDisplay
fpr, tpr, thresholds = roc_curve(
testing_data.rating_good,
predicted_prob
)
RocCurveDisplay(fpr=fpr, tpr=tpr).plot()
auc(fpr, tpr)
```
:::
```{r}
#| echo: false
#| label: fig-pretty-roc-auc
#| fig-cap: ROC curve and AUC value
auc_value = bayestestR::area_under_curve(
roc$Specificity,
roc$Sensitivity
)
roc |>
as_tibble() |>
ggplot(aes(1-Specificity, Sensitivity)) +
geom_path(linewidth = 1.25) +
geom_abline(intercept = 1, slope = 1, linetype = "dashed", color = 'gray75') +
annotate(
"text",
x = 0.5,
y = 0.75,
label = paste("AUC:", round(auc_value, 2)),
size = 5,
fontface = "bold",
color = 'gray33'
) +
scale_x_reverse()
```
With ROC curves and AUC values in hand, now we can find the ideal cut-point for balancing the TPR and FPR. There are different ways to do this, but one common way is to use the Youden's J statistic, which both of the following do.
:::{.panel-tabset}
##### R
<!-- TODO: this result seems bad, but checked with
roc_$thresholds[which.max(roc_$sensitivities - (1-roc_$specificities))] and it gives same result
-->
```{r}
#| label: r-ideal-cut-point
roc_ = pROC::roc(testing_data$rating_good, predicted_prob) # produces the same value as before
threshold = pROC::coords(roc_, "best", ret = "threshold")
predictions = ifelse(
predict(model_train_class, testing_data, type='response') >= threshold$threshold,
1,
0
)
cm_new = mlr3measures::confusion_matrix(
factor(testing_data$rating_good),
factor(predictions),
positive = "1"
)
tibble(
threshold = threshold,
TNR = cm_new$measures['tnr'],
TPR = cm_new$measures['tpr']
)
```
##### Python
```{python}
#| label: python-ideal-cut-point
cut = thresholds[np.argmax(tpr - fpr)]
pd.DataFrame({
'threshold': [cut],
'TNR': [1 - fpr[np.argmax(tpr - fpr)]],
'TPR': [tpr[np.argmax(tpr - fpr)]]
})
```
:::
The result is a "best" decision cut-point for converting our predicted probabilities to classes, though again, there are different and equally valid ways of going about this. The take home point is that instead of being naive about setting our probability to .5, this will give a cut-point that will lead to more balanced approach that recognizes other metrics that are important beyond accuracy. We will leave it to you to take that ideal cut-point value and update your metrics to see how much of a difference it will make.
Note that this only changes the values relative to one another, not the overall performance of the model - the actual predicted probabilities don't change after all. For example, accuracy may go down while recall increases. You'll need to match these metrics to your use case to see if the change is worth it. Whether it is a meager, modest, or meaningful improvement is going to vary from situation to situation, as will how you determine if your model is "good" or "bad". Is this a good model? Are you more interested in correctly identifying the positive class, or the negative class? Are you more interested in avoiding false positives/negatives? These are all questions that you will need to answer depending on the modeling context.
### Model Selection & Comparison {#sec-knowing-model-compare}