-
Notifications
You must be signed in to change notification settings - Fork 0
/
14-lda.qmd
660 lines (541 loc) · 32.6 KB
/
14-lda.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
# Linear discriminant analysis {#sec-lda}
\index{classification!linear discriminant analysis (LDA)}
Linear discriminant analysis (LDA) dates to the early 1900s. It's one of the most elegant and simple techniques for both modeling separation between groups, and as an added bonus, producing a low-dimensional representation of the differences between groups. LDA has two strong assumptions: the groups are samples from multivariate normal distributions, and each have the same variance-covariance. If the latter assumption is relaxed, a slightly less elegant solution results from quadratic discriminant analysis.
Useful explanations can be found in @VR02 and @Ri96. A good general treatment of parametric methods for supervised classification can be found in @JW02 or another similar multivariate analysis textbook. It's also useful to know that hypothesis testing for the difference in multivariate means using multivariate analysis of variance (MANOVA) has similar assumptions to LDA. Also model-based clustering assumes that each cluster arises from a multivariate normal distribution, and is related to LDA. The methods described here can be used to check these assumptions when applying these methods, too.\index{classification!multivariate analysis of variance (MANOVA)} \index{cluster analysis!model-based}
::: {.content-visible when-format="html"}
::: info
Because LDA is a parametric model it is important to check that these assumptions are reasonably satisfied:
- shape of clusters are elliptical.
- spread of the observations are the same.
:::
:::
::: {.content-visible when-format="pdf"}
\infobox{Because LDA is a parametric model it is important to check that these assumptions are reasonably satisfied:
\begin{itemize} \itemsep 0in
\item shape of clusters are elliptical.
\item spread of the observations are the same.
\end{itemize}
}
:::
<!-- Algorithmic methods have overtaken parametric methods in the practice of supervised classification. A parametric method such as linear discriminant analysis (LDA) yields a set of interpretable output parameters, so it leaves a clear trail helping us to understand what was done to produce the results. An algorithmic method, on the other hand, is more or less a black box, with various input parameters that are adjusted to tune the algorithm. The algorithm's input and output parameters do not always correspond in any obvious way to the interpretation of the results. Because if it's simplicity, if the assumptions (approximately) hold, LDA can provide an elegant classification solution.-->
## Extracting the key elements of the model
LDA builds the model on the between-group sum-of-square matrix
$$B=\sum_{k=1}^g n_k(\bar{X}_k-\bar{X})(\bar{X}_k-\bar{X})^\top$$
which measures the differences between the class means,
compared with the overall data mean $\bar{X}$ and the within-group sum-of-squares matrix,
$$
W =
\sum_{k=1}^g\sum_{i=1}^{n_k}
(X_{ki}-\bar{X}_k)(X_{ki}-\bar{X}_k)^\top
$$
which measures the variation of values around each class mean. The linear discriminant space is generated by computing the eigenvectors (canonical coordinates) of $W^{-1}B$, and this is the $(g-1)$-D space where the group means are most separated with respect to the
pooled variance-covariance. For each class we compute
\index{classification!discriminant space}
$$
\delta_k(x) = (x-\mu_k)^\top W^{-1}\mu_k + \log \pi_k
$$
where $\pi_k$ is a prior probability for class $k$ that might be based on unequal sample sizes, or cost of misclassification. The LDA classifier rule is to *assign a new observation to the class with the largest value* of $\delta_k(x)$.
We can fit an LDA model using the `lda()` function from the `MASS` package. Here we have used the `penguins` data, assuming equal prior probability, to illustrate.
```{r}
#| message: false
#| code-fold: false
# Code to fit the model
library(dplyr)
library(mulgar)
library(MASS)
load("data/penguins_sub.rda")
p_lda <- lda(species~bl+bd+fl+bm,
data=penguins_sub,
prior=c(1/3, 1/3, 1/3))
options(digits=2)
# p_lda
```
Because there are three classes the dimension of the discriminant space is 2D. We can easily extract the group means from the model.
```{r}
#| code-fold: false
# Extract the sample means
p_lda$means
```
The coefficients to project the data into the discriminant space, that is the eigenvectors of $W^{-1}B$ are:
```{r}
#| code-fold: false
# Extract the discriminant space
p_lda$scaling
```
and the predicted values, which include class predictions, and coordinates in the discriminant space are generated as:
```{r}
#| code-fold: false
# Extract the fitted values
p_lda_pred <- predict(p_lda, penguins_sub)
```
The best separation between classes can be viewed from this object, which can be shown to match the original data projected using the `scaling` component of the model object (see @fig-p-lda).
```{r}
#| label: fig-p-lda
#| warning: false
#| fig-cap: "Penguins projected into the 2D discriminant space, done two ways: (a) using the predicted values, (b) directly projecting using the model component. The scale is not quite the same but the projected data is identical in shape."
#| message: false
#| fig-width: 8
#| fig-height: 4
#| code-summary: "Code to generate LDA plots"
# Check calculations from the fitted model, and equations
library(colorspace)
library(ggplot2)
library(ggpubr)
# Using the predicted values from the model object
p_lda_pred_x1 <- data.frame(p_lda_pred$x)
p_lda_pred_x1$species <- penguins_sub$species
p_lda1 <- ggplot(p_lda_pred_x1,
aes(x=LD1, y=LD2,
colour=species)) +
geom_point() +
xlim(-6, 8) + ylim(-6.5, 5.5) +
scale_color_discrete_divergingx("Zissou 1") +
ggtitle("(a)") +
theme_minimal() +
theme(aspect.ratio = 1, legend.title = element_blank())
# matches the calculations done manually
p_lda_pred_x2 <- data.frame(as.matrix(penguins_sub[,1:4]) %*%
p_lda$scaling)
p_lda_pred_x2$species <- penguins_sub$species
p_lda2 <- ggplot(p_lda_pred_x2,
aes(x=LD1, y=LD2,
colour=species)) +
geom_point() +
xlim(-6, 8) + ylim(-7, 5.5) +
scale_color_discrete_divergingx("Zissou 1") +
ggtitle("(b)") +
theme_minimal() +
theme(aspect.ratio = 1, legend.title = element_blank())
ggarrange(p_lda1, p_lda2, ncol=2,
common.legend = TRUE, legend = "bottom")
```
The $W$ and $B$ matrices cannot be extracted from the model object, so we need to compute these separately. We only need $W$ actually. It is useful to think of this as the pooled variance-covariance matrix. Because the assumption for LDA is that the population group variance-covariances are identical, we estimate this by computing them for each class and then averaging them to get the pooled variance-covariance matrix. It's laborious, but easy.
```{r}
#| code-fold: false
# Compute pooled variance-covariance
p_vc_pool <- mulgar::pooled_vc(penguins_sub[,1:4],
penguins_sub$species)
p_vc_pool
```
This can be used to draw an ellipse corresponding to the pooled variance-covariance that is used by the LDA model.
```{r}
#| echo: false
#| eval: false
ginv(as.matrix(p_vc_pool))
```
## Checking assumptions
This LDA approach is widely applicable, but it is useful
to check the underlying assumptions on which it depends: (1) that the cluster structure corresponding to each class forms an ellipse, showing that the class is consistent with a sample from a multivariate normal distribution, and (2) that the variance of values around each mean is nearly the same. @fig-lda-assumptions1 and @fig-lda-assumptions2 illustrates two datasets, of which only one is consistent with these assumptions. Other parametric models, such as quadratic discriminant analysis or logistic regression, also depend on assumptions about the data which should be validated. \index{classification!quadratic discriminant analysis (QDA)} \index{classification!logistic regression}
::: {.content-visible when-format="html"}
::: info
To check the equal and elliptical variance-covariance assumption, generate points on the surface of an ellipse corresponding to the variance-covariance for each group. When watching these ellipses in a tour, they should similar in all projections.
:::
:::
::: {.content-visible when-format="pdf"}
\infobox{To check the equal and elliptical variance-covariance assumption, generate points on the surface of an ellipse corresponding to the variance-covariance for each group. When watching these ellipses in a tour, they should similar in all projections.
}
:::
\index{classification!variance-covariance}
\index{variance-covariance}
```{r}
# Generate ellipses for each group's variance-covariance
p_ell <- NULL
for (i in unique(penguins_sub$species)) {
x <- penguins_sub %>% dplyr::filter(species == i)
e <- gen_xvar_ellipse(x[,1:2], n=150, nstd=1.5)
e$species <- i
p_ell <- bind_rows(p_ell, e)
}
```
```{r echo=knitr::is_html_output()}
#| label: fig-lda-assumptions1
#| fig-cap: "Scatterplot of flipper length by bill length of the penguins data, and corresponding variance-covariance ellipses. There is a small amount of difference between the ellipses, but they are similar enough to be confident in assuming the population variance-covariances are equal."
#| message: false
#| fig-width: 8
#| fig-height: 4
#| code-summary: "Code for penguins data and ellipse plots"
#| code-fold: true
lda1 <- ggplot(penguins_sub, aes(x=bl,
y=bd,
colour=species)) +
geom_point() +
scale_color_discrete_divergingx("Zissou 1") +
xlim(-2.5, 3) + ylim(-2.5, 2.5) +
ggtitle("(a)") +
theme_minimal() +
theme(aspect.ratio = 1)
lda2 <- ggplot(p_ell, aes(x=bl,
y=bd,
colour=species)) +
geom_point() +
scale_color_discrete_divergingx("Zissou 1") +
xlim(-2.5, 3) + ylim(-2.5, 2.5) +
ggtitle("(b)") +
theme_minimal() +
theme(aspect.ratio = 1)
ggarrange(lda1, lda2, ncol=2,
common.legend = TRUE, legend = "bottom")
```
```{r echo=knitr::is_html_output()}
#| label: fig-lda-assumptions2
#| fig-cap: "Scatterplot of distance to cfa and road for the bushfires data, and corresponding variance-covariance ellipses. There is a lot of difference between the ellipses, so it cannot be assumed that the population variance-covariances are equal."
#| message: false
#| fig-width: 8
#| fig-height: 4
#| code-summary: "Code for bushfires data and ellipse plots"
#| code-fold: true
# Now repeat for a data set that violates assumptions
data(bushfires)
lda3 <- ggplot(bushfires, aes(x=log_dist_cfa,
y=log_dist_road,
colour=cause)) +
geom_point() +
scale_color_discrete_divergingx("Zissou 1") +
xlim(6, 11) + ylim(-1, 10.5) +
ggtitle("(a)") +
theme_minimal() +
theme(aspect.ratio = 1)
b_ell <- NULL
for (i in unique(bushfires$cause)) {
x <- bushfires %>% dplyr::filter(cause == i)
e <- gen_xvar_ellipse(x[,c(57, 59)], n=150, nstd=2)
e$cause <- i
b_ell <- bind_rows(b_ell, e)
}
lda4 <- ggplot(b_ell, aes(x=log_dist_cfa,
y=log_dist_road,
colour=cause)) +
geom_point() +
scale_color_discrete_divergingx("Zissou 1") +
xlim(6, 11) + ylim(-1, 10.5) +
ggtitle("(b)") +
theme_minimal() +
theme(aspect.ratio = 1)
ggarrange(lda3, lda4, ncol=2,
common.legend = TRUE, legend = "bottom")
```
\index{data!bushfires}
::: {.content-visible when-format="html"}
::: insight
The equal and elliptical variance-covariance assumption is reasonable for the penguins data because the ellipse shapes roughly match the spread of the data. It is not a suitable assumption for the bushfires data, because the spread is not elliptically-shaped and varies in size between groups.
:::
:::
::: {.content-visible when-format="pdf"}
\insightbox{The equal and elliptical variance-covariance assumption is reasonable for the penguins data because the ellipse shapes roughly match the spread of the data. It is not a suitable assumption for the bushfires data, because the spread is not elliptically-shaped and varies in size between groups.}
:::
This approach extends to any dimension. We would use the same projection sequence to view both the data and the variance-covariance ellipses, as in `r ifelse(knitr::is_html_output(), '@fig-penguins-lda-ellipses-html', '@fig-penguins-lda-ellipses-pdf')`. It can be seen that there is some difference in the shape and size of the ellipses between species, in some projections, and also with the spread of points in the projected data. However, it is the differences are small, so it would be safe to assume that the population variance-covariances are equal.
```{r echo=knitr::is_html_output()}
#| message: false
#| eval: false
#| code-summary: "Code for making animated gifs"
#| code-fold: true
library(tourr)
p_ell <- NULL
for (i in unique(penguins_sub$species)) {
x <- penguins_sub %>% dplyr::filter(species == i)
e <- gen_xvar_ellipse(x[,1:4], n=150, nstd=1.5)
e$species <- i
p_ell <- bind_rows(p_ell, e)
}
p_ell$species <- factor(p_ell$species)
load("data/penguins_tour_path.rda")
animate_xy(p_ell[,1:4], col=factor(p_ell$species))
render_gif(penguins_sub[,1:4],
planned_tour(pt1),
display_xy(half_range=0.9, axes="off", col=penguins_sub$species),
gif_file="gifs/penguins_lda1.gif",
frames=500,
loop=FALSE)
render_gif(p_ell[,1:4],
planned_tour(pt1),
display_xy(half_range=0.9, axes="off", col=p_ell$species),
gif_file="gifs/penguins_lda2.gif",
frames=500,
loop=FALSE)
```
::: {.content-visible when-format="html"}
::: {#fig-penguins-lda-ellipses-html layout-ncol=2}
![Data](gifs/penguins_lda1.gif){#fig-lda-4D-assumptions1 fig-alt="Animation showing a tour of the penguins data, with colour indicating species. The spread of points in each group is reasonably similar regardless of projection." width=300}
![Variance-covariance ellipses](gifs/penguins_lda2.gif){#fig-lda-4D-assumptions2 fig-alt="Animation showing a tour of the ellipses corresponding to variance-covariance matrices for each species. The shape of the ellipse for each group is reasonably similar regardless of projection." width=300}
Checking the assumption of equal variance-covariance matrices for the 4D penguins data. Each ellipse corresponds to the sample variance-covariance for each species.
:::
:::
::: {.content-visible when-format="pdf"}
::: {#fig-penguins-lda-ellipses-pdf layout-ncol=2}
![Data](images/penguins_lda1.png){#fig-lda-4D-assumptions1 fig-alt="Single frame from a tour of the penguins data, with colour indicating species. Each group has slightly negatively elliptical shaped spread."}
![Variance-covariance ellipses](images/penguins_lda2.png){#fig-lda-4D-assumptions2 fig-alt="Single frame from a tour of the ellipses corresponding to variance-covariance matrices for each species. Each ellipse has a north-west to south-east oriention. The red group is slightly smaller and more elliptical than the other two groups."}
Checking the assumption of equal variance-covariance matrices for the 4D penguins data. Each ellipse corresponds to the sample variance-covariance for each species.
:::
:::
As a further check, we could generate three ellipses corresponding to the pooled variance-covariance matrix, as would be used in the model, centered at each of the means. Overlay this with the data, as done in `r ifelse(knitr::is_html_output(), '@fig-penguins-lda-pooled-html', '@fig-penguins-lda-pooled-pdf')`. Now you will compare the spread of the observations in the data, with the elliptical shape of the pooled variance-covariance. If it matches reasonably we can safely use LDA. This can also be done group by group when multiple groups make it difficult to view all together.
::: {.content-visible when-format="html"}
::: info
To check the fit of the equal variance-covariance assumption, simulate points on the ellipse corresponding to the **pooled sample variance-covariance matrix**. Generate one for each group centered at the group mean, and compare with the data.
:::
:::
::: {.content-visible when-format="pdf"}
\infobox{To check the fit of the equal variance-covariance assumption, simulate points on the ellipse corresponding to the \emph{pooled sample variance-covariance matrix}. Generate one for each group centered at the group mean, and compare with the data.}
:::
\index{classification!pooled variance-covariance}
\index{pooled variance-covariance}
```{r}
#| message: false
#| code-summary: "Code for adding ellipses to data"
#| code-fold: true
# Create an ellipse corresponding to pooled vc
pool_ell <- gen_vc_ellipse(p_vc_pool,
xm=rep(0, ncol(p_vc_pool)))
# Add means to produce ellipses for each species
p_lda_pool <- data.frame(rbind(
pool_ell +
matrix(rep(p_lda$means[1,],
each=nrow(pool_ell)), ncol=4),
pool_ell +
matrix(rep(p_lda$means[2,],
each=nrow(pool_ell)), ncol=4),
pool_ell +
matrix(rep(p_lda$means[3,],
each=nrow(pool_ell)), ncol=4)))
# Create one data set with means, data, ellipses
p_lda_pool$species <- factor(rep(levels(penguins_sub$species),
rep(nrow(pool_ell), 3)))
p_lda_pool$type <- "ellipse"
p_lda_means <- data.frame(
p_lda$means,
species=factor(rownames(p_lda$means)),
type="mean")
p_data <- data.frame(penguins_sub[,1:5],
type="data")
p_lda_all <- bind_rows(p_lda_means,
p_data,
p_lda_pool)
p_lda_all$type <- factor(p_lda_all$type,
levels=c("mean", "data", "ellipse"))
shapes <- c(3, 4, 20)
p_pch <- shapes[p_lda_all$type]
```
```{r echo=knitr::is_html_output()}
#| eval: false
#| message: false
#| code-summary: "Code to generate animated gifs"
#| code-fold: true
# Code to run the tour
animate_xy(p_lda_all[,1:4], col=p_lda_all$species, pch=p_pch)
load("data/penguins_tour_path.rda")
render_gif(p_lda_all[,1:4],
planned_tour(pt1),
display_xy(col=p_lda_all$species, pch=p_pch,
axes="off", half_range = 0.7),
gif_file="gifs/penguins_lda_pooled1.gif",
frames=500,
loop=FALSE)
# Focus on one species
render_gif(p_lda_all[p_lda_all$species == "Gentoo",1:4],
planned_tour(pt1),
display_xy(col="#F5191C",
pch=p_pch[p_lda_all$species == "Gentoo"],
axes="off", half_range = 0.7),
gif_file="gifs/penguins_lda_pooled2.gif",
frames=500,
loop=FALSE)
```
::: {.content-visible when-format="html"}
::: {#fig-penguins-lda-pooled-html layout-ncol=2}
![All species](gifs/penguins_lda_pooled1.gif){#fig-lda-pooled1 fig-alt="Animation showing a tour of the pooled variance-covariance ellipse, computed for each species, overlaid on the data. The shape of the ellipse for each group is reasonably similar to the spread of the points in all projections." width=300}
![Gentoo](gifs/penguins_lda_pooled2.gif){#fig-lda-pooled1 fig-alt="Animation showing a tour of the pooled variance-covariance ellipse overlaid on the data for the Gentoo penguins (red). The shape of the ellipse is reasonably similar to the spread of the points in all projections." width=300}
Checking how the pooled variance-covariance matches the spread of points in each group.
:::
:::
::: {.content-visible when-format="pdf"}
::: {#fig-penguins-lda-pooled-pdf layout-ncol=2}
![All species](images/penguins_lda_pooled1.png){#fig-lda-pooled1 fig-alt="A single frame from a tour of the pooled variance-covariance ellipse, computed for each species, overlaid on the data. The ellipse has a north-east to south-west orientation, and is almost circular, and reasonably matched the spread of the points in each group."}
![Gentoo](images/penguins_lda_pooled2.png){#fig-lda-pooled1 fig-alt="A single frame from a tour of the pooled variance-covariance ellipse overlaid on the data for the Gentoo penguins (red). The ellipse has a north-west to south-east orientation, and is relatively skinny. This roughly matches the spread of the points."}
Checking how the pooled variance-covariance matches the spread of points in each group.
:::
:::
::: {.content-visible when-format="html"}
::: insight
From the tour, we can see that the assumption of equal elliptical variance-covariance is a reasonable assumption for the penguins data. In all projections the ellipse is reasonably matching the spread of the observations.
:::
:::
::: {.content-visible when-format="pdf"}
\insightbox{From the tour, we can see that the assumption of equal elliptical variance-covariance is a reasonable assumption for the penguins data. In all projections the ellipse is reasonably matching the spread of the observations.
}
:::
## Examining results
The boundaries for a classification model can be examined by:
1. generating a large number of test points in the domain of the data
2. predicting the class for each test point
We'll look at this for 2D using the LDA model fitted to `bl`, and `bd` of the `penguins` data.
```{r}
#| message: false
#| code-fold: false
p_bl_bd_lda <- lda(species~bl+bd, data=penguins_sub,
prior = c(1/3, 1/3, 1/3))
```
The fitted model means
$\bar{x}_{Adelie} = ($ `r p_bl_bd_lda$means[1,1]`, `r p_bl_bd_lda$means[1,2]`$)^\top$, $\bar{x}_{Chinstrap} = ($ `r p_bl_bd_lda$means[2,1]`, `r p_bl_bd_lda$means[2,2]`$)^\top$, and $\bar{x}_{Gentoo} = ($ `r p_bl_bd_lda$means[3,1]`, `r p_bl_bd_lda$means[3,2]`$)^\top$ can be added to the plots.
The boundaries can be examined using the `explore()` function from the `classifly` package, which generates observations in the range of all values of`bl` and `bd` and predicts their class. @fig-lda-2D-boundary shows the resulting prediction regions, with the observed data and the sample means overlaid.
```{r}
#| message: false
#| label: fig-lda-2D-boundary
#| fig-cap: "Prediction regions of the LDA model for two variables of the three species of penguins indicated by the small points. Large points are the observations, and the sample mean of each species is represented by the plus. The boundaries between groups can be seen to be roughly half-way between the means, taking the elliptical spread into account, and mostly distinguishes the three species."
#| fig-alt: "Square divided into three regions by the coloured points, primarily a wedge of yellow (Chinstrap) extending from the top right divides the other two groups. The regions mostly contain the observed values, with just a few over the boundary in the wrong region."
#| fig-width: 3
#| fig-height: 3
#| out-width: 70%
#| code-fold: false
# Compute points in domain of data and predict
library(classifly)
p_bl_bd_lda_boundaries <- explore(p_bl_bd_lda, penguins_sub)
p_bl_bd_lda_m1 <- ggplot(p_bl_bd_lda_boundaries) +
geom_point(aes(x=bl, y=bd,
colour=species,
shape=.TYPE), alpha=0.8) +
scale_color_discrete_divergingx("Zissou 1") +
scale_shape_manual(values=c(46, 16)) +
theme_minimal() +
theme(aspect.ratio = 1, legend.position = "none")
p_bl_bd_lda_means <- data.frame(p_bl_bd_lda$means,
species=rownames(p_bl_bd_lda$means))
p_bl_bd_lda_m1 +
geom_point(data=p_bl_bd_lda_means,
aes(x=bl, y=bd),
colour="black",
shape=3,
size=3)
```
\index{classification!boundaries}
\index{tour!slice}
This approach can be readily extended to higher dimensions. One first fits the model with all four variables, and uses the `explore()` to generate points in the 4D space with predictions, generating a representation of the prediction regions. `r ifelse(knitr::is_html_output(), '@fig-penguins-lda-boundaries-html', '@fig-penguins-lda-boundaries-pdf')`(a) shows the results using a slice tour [@slicetour]. Points inside the slice are shown in larger size. The slice is made in the centre of the data, to show the boundaries in this neighbourhood. As the tour progresses we see a thin slice through the centre of the data, parallel with the projection plane. In most projections there is some small overlap of points between groups, which happens because we are examining a 4D object with 2D. The slice helps to alleviate this, allowing a focus on the boundaries in the centre of the cube. In all projections the boundaries between groups is linear, as would be expected when using LDA. We can also see that the model roughly divides the cube into three relatively equally-sized regions.
`r ifelse(knitr::is_html_output(), '@fig-penguins-lda-boundaries-html', '@fig-penguins-lda-boundaries-pdf')`(b) shows the three prediction regions, represented by points in 4D, projected into the discriminant space. Linear boundaries neatly divide the full space, which is to be expected because the LDA model computes it's classification rules in this 2D space.
\index{classification!discriminant space}
```{r echo=knitr::is_html_output()}
#| code-fold: false
p_lda <- lda(species ~ ., penguins_sub[,1:5], prior = c(1/3, 1/3, 1/3))
p_lda_boundaries <- explore(p_lda, penguins_sub)
```
```{r echo=knitr::is_html_output()}
#| eval: false
#| code-summary: "Code for generating slice tour"
# Code to run the tour
p_lda_boundaries$species
animate_slice(p_lda_boundaries[p_lda_boundaries$.TYPE == "simulated",1:4], col=p_lda_boundaries$species[p_lda_boundaries$.TYPE == "simulated"], v_rel=0.02, axes="bottomleft")
render_gif(p_lda_boundaries[p_lda_boundaries$.TYPE == "simulated",1:4],
planned_tour(pt1),
display_slice(v_rel=0.02,
col=p_lda_boundaries$species[p_lda_boundaries$.TYPE == "simulated"],
axes="bottomleft"), gif_file="gifs/penguins_lda_boundaries.gif",
frames=500,
loop=FALSE
)
```
```{r echo=knitr::is_html_output()}
#| code-summary: "Code for projecting into LDA space"
# Project the boundaries into the 2D discriminant space
p_lda_b_sub <- p_lda_boundaries[
p_lda_boundaries$.TYPE == "simulated",
c(1:4, 6)]
p_lda_b_sub_ds <- data.frame(as.matrix(p_lda_b_sub[,1:4]) %*%
p_lda$scaling)
p_lda_b_sub_ds$species <- p_lda_b_sub$species
p_lda_b_sub_ds_p <- ggplot(p_lda_b_sub_ds,
aes(x=LD1, y=LD2,
colour=species)) +
geom_point(alpha=0.5) +
geom_point(data=p_lda_pred_x1, aes(x=LD1,
y=LD2,
shape=species),
inherit.aes = FALSE) +
scale_color_discrete_divergingx("Zissou 1") +
scale_shape_manual(values=c(1, 2, 3)) +
theme_minimal() +
theme(aspect.ratio = 1,
legend.position = "bottom",
legend.title = element_blank())
```
::: {.content-visible when-format="html"}
::: {#fig-penguins-lda-boundaries-html layout-ncol=2}
![4D](gifs/penguins_lda_boundaries.gif){#fig-lda-4D-boundaries fig-alt="Animation plot, where three groups of coloured points can be seen. They roughly break the square into three regions with linear boundaries, which varies in each projection." width=300}
```{r}
#| echo: false
#| label: fig-lda-2D-boundaries
#| fig-width: 5
#| fig-height: 5
#| fig-cap: "Discriminant space"
#| fig-alt: "Scatterplot of points divided neatly into three colour groups by linear boundaries. The regions are split at the middle, and with the boundary between blue and red, blue and yellow, and yellow and red, starting near 11, 4, and o'clock, respectively."
p_lda_b_sub_ds_p
```
Examining the boundaries produced by the LDA model in the full 4D with a slice tour (left) and in the discriminant space (right). Large points indicate observations within the slice, and dots are observations outside the slice. Focusing on the within-slice points, there is some overlap of points between regions in most views which represents the occlusion of 4D shapes when examining projections with thin slices. The linear boundaries are seen exactly in the discriminant space, that is they are orthogonal to these two dimensions.
:::
:::
::: {.content-visible when-format="pdf"}
::: {#fig-penguins-lda-boundaries-pdf layout-ncol=2 fig-alt="FIX-ME"}
![4D](images/penguins_lda_boundaries.png){#fig-lda-4D-boundaries fig-alt="A scatterplot with large points showing roughly linear boundaries between three groups." width=200}
![Discriminant space](images/fig-lda-2D-boundaries-1.png){#fig-lda-4D-ds fig-alt="Scatterplot of points divided neatly into three colour groups by linear boundaries. The regions are split at the middle, and with the boundary between blue and red, blue and yellow, and yellow and red, starting near 11, 4, and o'clock, respectively." width=200}
```{r}
#| echo: false
#| eval: false
#| fig-width: 5
#| fig-height: 5
#| out-width: 50%
#| fig-cap: "Discriminant space"
# Note: had to use the image above between quarto compile
# broke when this chunk was evaluated.
p_lda_b_sub_ds_p
```
Examining the boundaries produced by the LDA model in the full 4D with a slice tour (left shows a single frame) and in the discriminant space (right). Large points indicate observations within the slice, and dots are observations outside the slice. Focusing on the within-slice points, there is some overlap of points between regions in most views which represents the occlusion of 4D shapes when examining projections with thin slices. The linear boundaries are seen exactly in the discriminant space, that is they are orthogonal to these two dimensions.
:::
:::
::: {.content-visible when-format="html"}
::: insight
From the tour, we can see that the LDA boundaries divide the classes only in the discriminant space. It is not using the space orthogonal to the 2D discriminant space. You can see this because the boundary is sharp in just one 2D projection, while most of the projections show some overlap of regions.
:::
:::
::: {.content-visible when-format="pdf"}
\insightbox{From the tour, we can see that the LDA boundaries divide the classes only in the discriminant space. It is not using the space orthogonal to the 2D discriminant space. You can see this because the boundary is sharp in just one 2D projection, while most of the projections show some overlap of regions.}
:::
## Exercises {-}
1. For the `simple_clusters` compute the LDA model, and make a plot of the data, with points coloured by the class. Overlay variance-covariance ellipses, and a $+$ indicating the sample mean for each class. Is it reasonable to assume that the two classes are sampled from populations with the same variance-covariance?
2. Examine the clusters corresponding to the classes in the `clusters` data set, using a tour. Based on the shape of the data is the assumption of equal variance-covariance reasonable?
3. Examine the pooled variance-covariance for the `clusters` data, overlaid on the data in a tour on the 5D. Does it fit the variance of each cluster nicely?
4. Fit an LDA model to the `simple_clusters` data. Examine the boundaries produced by the model, in 2D.
5. Fit an LDA model to the `clusters` data. Examine the boundaries produced by the model in 5D.
6. Assess the LDA assumptions for the `multicluster` data. Is LDA an appropriate model?
7. Compute the first 12 PCs of the `sketches` data. Check the assumption of equal, elliptical variance-covariance of the 6 groups. Regardless of whether you decide that the assumption is satisfied or not, fit an LDA to the 12 PCs. Extract the discriminant space (the `x` component of the `predict` object), and examine the separation (or not) of the 6 groups in this 5D space. Is LDA providing a good classification model for this data?
8. Even though the `bushfires` data does not satisfy the assumptions for LDA, fit LDA to the first five PCs. Examine the class differences in the 3D discriminant space.
9. Compute the boundary between classes, for the LDA model where the prior probability reflects the sample size, and the LDA model where the priors are equal for all groups. How does the boundary between lightning caused fires and the other groups change?
```{r}
#| eval: false
#| echo: false
library(mulgar)
library(dplyr)
library(tourr)
library(MASS)
data("sketches_train")
sketches_pca <- prcomp(sketches_train[,1:784])
ggscree(sketches_pca, q=25, guide=FALSE)
sketches_pc <- as.data.frame(sketches_pca$x[,1:12])
sketches_pc$word <- sketches_train$word
animate_xy(sketches_pc[,1:12], col=sketches_pc$word)
sketches_lda <- lda(word~., data=sketches_pc)
sketches_pred <- predict(sketches_lda, sketches_pc)
sketches_ds <- data.frame(sketches_pred$x) %>%
mutate(word = sketches_pc$word)
animate_xy(sketches_ds[,1:5], col=sketches_ds$word)
# bushfires
data(bushfires)
bushfires_std <- bushfires %>%
dplyr::select(`rf`:`log_dist_road`) %>%
mutate_if(is.numeric, function(x) (x-
mean(x, na.rm=TRUE))/
sd(x, na.rm=TRUE))
bushfires_pca <- prcomp(bushfires_std)
bushfires_pc <- data.frame(bushfires_pca$x[,1:5]) %>%
mutate(cause = bushfires$cause)
bushfires_lda <- lda(cause~., data=bushfires_pc)
bushfires_pred <- predict(bushfires_lda, bushfires_pc)
bushfires_ds <- data.frame(bushfires_pred$x) %>%
mutate(cause = factor(bushfires$cause))
animate_xy(bushfires_ds[,1:3], col=bushfires_ds$cause)
```