-
Notifications
You must be signed in to change notification settings - Fork 1
/
4-pca.qmd
958 lines (783 loc) · 47.3 KB
/
4-pca.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
## Principal component analysis
\index{dimension reduction!principal component analysis (PCA)}
Reducing dimensionality using principal component analysis (PCA) dates back to @pearson-pca and @hotelling-pca, and @joliffe2016 provides a current overview. The goal is to find a smaller set of variables, $q (< p)$, that contain as much information as the original as possible. The new set of variables, known as principal components (PCs), are linear combinations of the original variables. The PCs can be used to represent the data in a lower-dimensional space.
The process is essentially an optimisation procedure, although PCA has an analytical solution. It solves the problem of
$$
\max_{a_k} ~\text{Var} (Xa_k),
$$
where $X$ is the $n \times p$ data matrix, $a_k (k=1, ..., p)$ is a 1D projection vector, called an eigenvector, and the $\text{Var} (Xa_k)$ is called an eigenvalue. So PCA is a sequential process, that will find the direction in the high-dimensional space (as given by the first eigenvector) where the data is most varied, and then find the second most varied direction, and so on. The eigenvectors define the combination of the original variables, and the eigenvalues define the amount of variance explained by the reduced number of variables.
\index{dimension reduction!eigenvalue}
\index{dimension reduction!eigenvector}
PCA is very broadly useful for summarising linear association by using combinations of variables that are highly correlated. However, high correlation can also occur when there are outliers, or clustering. PCA is commonly used to detect these patterns also.
::: {.content-visible when-format="html"}
::: info
With visualisation we want to assess whether it is appropriate to use PCA to summarise any linear association by using combinations of variables that are highly correlated. It can help to detect other patterns that might affect the PCA results such as outliers, clustering or non-linear dependence.
:::
:::
::: {.content-visible when-format="pdf"}
\infobox{With visualisation we want to assess whether it is appropriate to use PCA to summarise any linear association by using combinations of variables that are highly correlated. It can help to detect other patterns that might affect the PCA results such as outliers, clustering or non-linear dependence.}
:::
PCA is not very effective when the distribution of the variables is highly skewed, so it can be helpful to transform variables to make them more symmetrically distributed before conducting PCA. It is also possible to summarise different types of structure by generalising the optimisation criteria to any function of projected data, $f(XA)$, which is called *projection pursuit* (PP). PP has a long history (@Kr64a, @FT74, @DF84, @JS87, @Hu85), and there are regularly new developments (e.g. @Lee2009, @perisic2009, @Lee2013, @loperfido, @bickel2018, @zhang2023).
## Determining how many dimensions
We would start by examining the data using a grand tour. The goal is to check whether there might be potential issues for PCA, such as skewness, outliers or clustering, or even non-linear dependencies.
We'll start by showing PCA on the simulated data from @sec-dimension-overview. The scree plots are produced using the `mulgar::ggscree()` function, and include a grey guideline to help decide how many PCs are sufficient. This guideline is generated by taking the median value from of the eigenvalues generated by doing PCA on 100 samples from a standard multivariate normal distribution. Any values much lower than this line would indicate that those PCs are not contributing to the explanation of variation. For these three simulated examples, the scree plots illustrate that PCA supports that the data are 2D, 3D and 5D respectively.
\index{dimension reduction!scree plot}
```{r echo=knitr::is_html_output()}
#| message: FALSE
#| error: FALSE
#| warning: FALSE
#| code-summary: "Code to make scree plots"
# Conduct PCA and make the scree plot for
# the 2-, 3- and 5-D planar data
library(dplyr)
library(ggplot2)
library(mulgar)
data(plane)
data(box)
library(geozoo)
cube5d <- data.frame(cube.solid.random(p=5, n=300)$points)
colnames(cube5d) <- paste0("x", 1:5)
cube5d <- data.frame(apply(cube5d, 2,
function(x) (x-mean(x))/sd(x)))
p_pca <- prcomp(plane)
b_pca <- prcomp(box)
c_pca <- prcomp(cube5d)
p_scree <- ggscree(p_pca, q = 5) + theme_minimal()
b_scree <- ggscree(b_pca, q = 5) + theme_minimal()
c_scree <- ggscree(c_pca, q = 5) + theme_minimal()
```
```{r}
#| label: fig-2D-pca
#| echo: false
#| fig-width: 9
#| fig-height: 3
#| out-width: 100%
#| fig-cap: "Scree plots for the three simulated data sets shown in Figure 3.2. The 2D in 5D is clearly recognised by PCA to be 2D because the variance drops substantially between 2-3 principal components. The 3D in 5D is possibly 3D because the variance drops from 3-4 principal components. The fully 5D data has no drop in variance, and all values are close to the typical value one would observe if the data was fully 5D."
library(patchwork)
p_scree + b_scree + c_scree + plot_layout(ncol=3)
```
The next step is to look at the coefficients for the selected number of PCs. @tbl-plane-pcs shows the coefficients for the first two PCs of the `plane` data. All five variables contribute, with `x1`, `x2`, `x3` contributing more to `PC1`, and `x4`, `x5` contributing more to `PC2`. @tbl-box-pcs shows the coefficients for the first three PCs of the `box` data. Variables `x1`, `x2`, `x3` contribute strongly to `PC1`, `PC2` has contributions from all variables except `x3` and variables `x4` and `x5` contribute strongly to `PC3`.
\index{dimension reduction!coefficients}
\index{dimension reduction!principal components}
```{r echo=knitr::is_html_output()}
#| label: tbl-plane-pcs
#| tbl-cap: "Coefficients for the first two PCs for the plane data."
#| code-summary: "Code to print PC coefficients"
library(gt)
p_pca$rotation[,1:2] %>%
as_tibble(rownames="Variable") %>%
gt() %>%
fmt_number(columns = c(PC1, PC2),
decimals = 2)
```
```{r echo=knitr::is_html_output()}
#| label: tbl-box-pcs
#| tbl-cap: "Coefficients for the first three PCs for the box data."
#| code-summary: "Code to print PC coefficients"
b_pca$rotation[,1:3] %>%
as_tibble(rownames="Variable") %>%
gt() %>%
fmt_number(columns = c(PC1, PC2, PC3),
decimals = 2)
```
In each of these simulated data sets, all five variables contributed to the dimension reduction. If we added two purely noise variables to the plane data, as done in @sec-dimension-overview, the scree plot in @fig-plane-noise-scree would indicate that the data is now 4D, and we would get a different interpretation of the coefficients from the PCA, see @tbl-plane-noise-pcs. We see that `PC1` and `PC2` are approximately the same as before, with main variables being (`x1`, `x2`, `x3`) and (`x4`, `x5`) respectively. `PC3` and `PC4` are both `x6` and `x7`.
```{r echo=knitr::is_html_output()}
#| label: fig-plane-noise-scree
#| fig-cap: Additional noise variables expands the plane data to 4D.
#| code-fold: false
#| fig-width: 6
#| fig-height: 4
#| out-width: 80%
set.seed(5143)
plane_noise <- plane
plane_noise$x6 <- rnorm(100)
plane_noise$x7 <- rnorm(100)
plane_noise <- data.frame(apply(plane_noise, 2, function(x) (x-mean(x))/sd(x)))
pn_pca <- prcomp(plane_noise)
ggscree(pn_pca, q = 7) + theme_minimal()
```
```{r echo=knitr::is_html_output()}
#| label: tbl-plane-noise-pcs
#| tbl-cap: "Coefficients for PCs 1-4 of the plane plus noise data."
#| code-summary: "Code to print PC coefficients"
pn_pca$rotation[,1:4] %>%
as_tibble(rownames="Variable") %>%
gt() %>%
fmt_number(columns = c(PC1, PC2, PC3, PC4),
decimals = 2)
```
### Example: pisa
\index{data!pisa}
The `pisa` data contains simulated data from math, reading and science scores, totalling 30 variables. PCA is used here to examine the association. We might expect that it is 3D, but what we see suggests it is primarily 1D. This means that a student that scores well in math, will also score well in reading and science.
```{r}
#| code-fold: false
data(pisa)
pisa_std <- pisa %>%
filter(CNT == "Australia") %>%
select(-CNT) %>%
mutate_all(mulgar:::scale2)
pisa_pca <- prcomp(pisa_std)
pisa_scree <- ggscree(pisa_pca, q = 15) + theme_minimal()
```
The scree plot in `r ifelse(knitr::is_html_output(), '@fig-pisa-pca-html', '@fig-pisa-pca-pdf')` shows a big drop from one to two PCs in the amount of variance explained. A grand tour on the 30 variables can be run using `animate_xy()`:
```{r eval=FALSE}
#| code-fold: false
animate_xy(pisa_std, half_range=1)
```
or rendered as an animated gif using `render_gif()`:
```{r eval=FALSE}
#| code-fold: false
render_gif(pisa_std,
grand_tour(),
display_xy(half_range=0.9),
gif_file="gifs/pisa_gt.gif",
frames=500,
width=400,
height=400,
loop=FALSE)
```
and we can see that the data is elliptical in most projections, sometimes shrinking to be a small circle. This pattern strongly indicates that there is one primary direction of variation in the data, with only small variation in any direction away from it. Shrinking to the small circle is analogous to to how *a pencil or cigar or water bottle in 3D looks from some angles*.
::: {.content-visible when-format="html"}
::: {#fig-pisa-pca-html fig-align="center" layout-ncol=2}
```{r}
#| label: fig-pisa-scree-html
#| eval: true
#| message: false
#| warning: false
#| fig-cap: "Scree plot"
#| fig-width: 6
#| fig-height: 4
#| out-width: 80%
#| echo: false
pisa_scree
```
![Grand tour](gifs/pisa_gt.gif){#fig-pisa-gt fig-alt="Tour showing lots of linear projections of the pisa data. You can see strong linear dependence." fig.align="center"}
Scree plot and tour of the `pisa` data, with 30 variables being the plausible scores for Australian students. In combination, these suggest that the data is effectively 1D.
:::
:::
::: {.content-visible when-format="pdf"}
::: {#fig-pisa-pca-pdf fig-align="center" layout-ncol=2}
![Scree plot](images/fig-pisa-scree-1.png){#fig-pisa-scree-pdf fig-alt="FIX ME"}
![Grand tour frame](images/pisa_gt_249.png){#fig-pisa-gt fig-alt="Selected linear projection of the pisa data from a grand tour. You can see strong linear dependence." fig.align="center"}
Scree plot and a frame from a tour of the `pisa` data, with 30 variables being the plausible scores for Australian students. In combination, these suggest that the data is effectively 1D. {{< fa play-circle >}}
:::
:::
The coefficients of the first PC (first eigenvector) are roughly equal in magnitude (as shown below), which tells us that all variables roughly contribute. Interestingly, they are all negative, which is not actually meaningful. With different software these could easily have been all positive. The sign of the coefficients can be reversed, as long as all are reversed, which is the same as an arrow pointing one way, changing and pointing the other way.
```{r}
#| code-summary: "Code to print PC coefficients"
round(pisa_pca$rotation[,1], 2)
```
::: {.content-visible when-format="html"}
::: insight
The tour verifies that the `pisa` data is primarily 1D, indicating that a student who scores well in math, probably scores well in reading and science, too. More interestingly, the regular shape of the data strongly indicates that it is "synthetic", simulated rather than observed.
:::
:::
::: {.content-visible when-format="pdf"}
\insightbox{The tour verifies that the `pisa` data is primarily 1D, indicating that a student who scores well in math, probably scores well in reading and science, too. More interestingly, the regular shape of the data strongly indicates that it is "synthetic", simulated rather than observed.}
:::
### Example: aflw
\index{data!aflw}
This data has player statistics for all the matches in the 2021 season. We would be interested to know which variables contain similar information, and thus might be combined into single variables. We would expect that many statistics group into a few small sets, such as offensive and defensive skills. We might also expect that some of the statistics are skewed, most players have low values and just a handful of players are stellar. It is also possible that there are some extreme values. These are interesting features, but they will distract from the main purpose of grouping the statistics. Thus the tour is used to check for potential problems with the data prior to conducting PCA.
```{r}
#| label: pca-libraries
#| eval: TRUE
#| echo: TRUE
#| message: FALSE
#| error: FALSE
#| warning: FALSE
#| code-fold: false
library(tourr)
data(aflw)
aflw_std <- aflw %>%
mutate_if(is.numeric, function(x) (x-
mean(x, na.rm=TRUE))/
sd(x, na.rm=TRUE))
```
To look at all of the `r ncol(aflw_std[,7:35])` player statistics in a grand tour in `r ifelse(knitr::is_html_output(), '@fig-aflw-gt-html', '@fig-aflw-gt-pdf')`.
```{r}
#| label: aflw-gt
#| eval: false
#| code-summary: "Code to generate tour"
animate_xy(aflw_std[,7:35], half_range=0.9)
render_gif(aflw_std[,7:35],
grand_tour(),
display_xy(half_range=0.9),
gif_file="gifs/aflw_gt.gif",
frames=500,
loop=FALSE)
```
::: {.content-visible when-format="html"}
::: {#fig-aflw-gt-html}
![](gifs/aflw_gt.gif){fig-alt="Tour showing lots of linear projections of the aflw data. You can see linear dependence, and some outliers." fig.align="center"}
Grand tour of the AFLW player statistics. Most player statistics concentrate near the centre, indicating most players are "average"! There are a few outliers appearing in different combinations of the skills, which one would expect to be the star players for particular skill sets.
:::
:::
::: {.content-visible when-format="pdf"}
::: {#fig-aflw-gt-pdf fig.align="center" fig-alt="Example linear projection of the aflw data from a grand tour. You can see linear dependence, and some outliers." layout-ncol=2}
![](images/aflw_gt_70.png){width=228}
![](images/aflw_gt_329.png){width=228}
Two frames from a grand tour of the AFLW player statistics. Most player statistics concentrate near the centre, indicating most players are "average"! There are a few outliers appearing in different combinations of the skills, which one would expect to be the star players for particular skill sets. {{< fa play-circle >}}
:::
:::
No major surprises! There is a small amount of skewness, and there are no major outliers. Skewness indicates that most players have reasonably similar skills (bunching of points), except for some key players (the moderate outliers). The skewness could be reduced by applying a log or square root transformation to some variables prior to running the PCA. However, we elect not to do this because the moderate outliers are of interest. These correspond to talented players that we'd like to explore further with the analysis.
Below we have the conventional summary of the PCA, a scree plot showing the reduction in variance to be explained when each additional PC is considered. It is also conventional to look at a table summarising the proportions of variance explained by PCs, but with almost 30 variables it is easier to make some decision on the number of PCs needed based on the scree plot.
```{r echo=knitr::is_html_output()}
#| label: fig-aflw-pca
#| fig-cap: "Scree plot showing decay in variance of PCs. There are sharp drops for the first four PCs, and then smaller declines."
#| alt-text: "Scree plot showing variance vertically against PC number horizontally. Variance drops from close to 10 for PC 1 to about 1.2 for PC 4 then slowly decays through to PC 29."
#| fig-width: 6
#| fig-height: 4
#| out-width: 80%
#| code-summary: "Code to make screeplot"
aflw_pca <- prcomp(aflw_std[,7:35],
scale = FALSE,
retx=TRUE)
ggscree(aflw_pca, q = 29) + theme_minimal()
```
\index{dimension reduction!scree plot}
From the scree plot in @fig-aflw-pca, we see a sharp drop from one to two, two to three and then smaller drops. After four PCs the variance drops again at six PCs and then gradually decays. We will choose four PCs to examine more closely. This explains `r round(summary(aflw_pca)$importance[3,4]*100,1)`% of the variance.
```{r echo=knitr::is_html_output()}
#| label: tbl-aflw-pcs
#| tbl-cap: "Coefficients for the first four PCs. PC 1 contrasts some with PC 1, with the first having large coefficients primarily on field play statistics, and the second having large coefficients on the scoring statistics."
#| code-summary: "Code to print PC coefficients"
library(gt)
aflw_pca$rotation[,1:4] %>%
as_tibble(rownames="Variable") %>%
arrange(desc(PC1), desc(PC2), desc(PC3)) %>%
gt() %>%
fmt_number(columns = c(PC1, PC2, PC3, PC4),
decimals = 2)
```
When there are as many variables as this, it can be hard to digest the combinations of variables most contributing to each PC. Rearranging the table by sorting on a selected PC can help. @tbl-aflw-pcs has been sorted according to the PC 1 coefficients.
PC 1 is primarily composed of `disposals`, `possessions`, `kicks`, `metres`, `uncontested`, `contested`, .... primarily the field play statistics! It is quite common in PCA for the first PC to be a combination of all variables, which suggests that there is one main direction of variation in the data. Here it is not quite that. PCA suggests that the primary variation is through a combination of field skills, or basic football playing skills.
Thus the second PC contrasts the first, because it is primarily a combination of `shots`, `goals`, `marks_in50`, `accuracy`, and `behinds` contrasted against `rebounds_in50` and `intercepts`. The positive coefficients are primary offensive skills and the negative coefficients are defensive skills. This PC is reasonable measure of the offensive vs defensive skills of a player.
\index{dimension reduction!interpretation}
We could continue to interpret each PC by examining large coefficients to help decide how many PCs are a suitable summary of the information in the data. Briefly, PC 3 mixed but it is possibly a measure of worth of the player because `time_pct` has a large coefficient, so players that are on the field longer will contribute strongly to this new variable. It also has large (and opposite) contributions from `clearances`, `tackles`, `contested_marks`. PC 4 appears to be related to aggressive play with `clangers`, `turnovers`, `bounces` and `frees_against` featuring. All four PCs have useful information.
For deeper exploration, when we tour the four PCs, we'd like to be able to stop and identify players. This can be done by creating a pre-computed animation, with additional mouse-over, made using `plotly`. However, it is not size-efficient, can is only feasible with a small number of observations. Because all of the animation frames with the fully projected data in each, are composed into a single object, which gets large very quickly.
::: {.content-visible when-format="html"}
The result is shown in @fig-aflw-pcatour. We can see that the shape of the four PCs is similar to that of all the variables, bunching of points in the centre with a lot of moderate outliers.
:::
```{r echo=knitr::is_html_output()}
#| label: aflw-plotly
#| eval: false
#| code-summary: "Code to make tour animation"
library(plotly)
library(htmlwidgets)
set.seed(20)
b <- basis_random(4, 2)
aflw_pct <- tourr::save_history(aflw_pca$x[,1:4],
tour_path = grand_tour(),
start = b,
max_bases = 5)
# To reconstruct projected data plots, later
save(aflw_pct, file="data/aflw_pct.rda")
aflw_pcti <- interpolate(aflw_pct, 0.1)
aflw_anim <- render_anim(aflw_pca$x[,1:4],
frames=aflw_pcti,
obs_labels=paste0(aflw$surname,
aflw$given_name))
aflw_gp <- ggplot() +
geom_path(data=aflw_anim$circle,
aes(x=c1, y=c2,
frame=frame), linewidth=0.1) +
geom_segment(data=aflw_anim$axes,
aes(x=x1, y=y1,
xend=x2, yend=y2,
frame=frame),
linewidth=0.1) +
geom_text(data=aflw_anim$axes,
aes(x=x2, y=y2,
frame=frame,
label=axis_labels),
size=5) +
geom_point(data=aflw_anim$frames,
aes(x=P1, y=P2,
frame=frame,
label=obs_labels),
alpha=0.8) +
xlim(-1,1) + ylim(-1,1) +
coord_equal() +
theme_bw() +
theme(axis.text=element_blank(),
axis.title=element_blank(),
axis.ticks=element_blank(),
panel.grid=element_blank())
aflw_pctour <- ggplotly(aflw_gp,
width=500,
height=550) %>%
animation_button(label="Go") %>%
animation_slider(len=0.8, x=0.5,
xanchor="center") %>%
animation_opts(easing="linear", transition = 0)
htmlwidgets::saveWidget(aflw_pctour,
file="html/aflw_pca.html",
selfcontained = TRUE)
```
::: {.content-visible when-format="html"}
::: {#fig-aflw-pcatour}
<iframe width="750" height="550" src="html/aflw_pca.html" title="Animation of four PCs of the aflw data with interactive labelling. "></iframe>
Animation of four PCs of the aflw data with interactive labelling.
:::
:::
```{r}
#| eval: false
#| echo: false
animate_pca(aflw_pca$x[,1:5],
pc_coefs = aflw_pca$rotation[,1:5],
col = "orange",
pch = 16,
cex = 0.5,
half_range=1.5)
animate_xy(aflw_pca$x[,1:5])
```
```{r echo=knitr::is_html_output()}
#| message: false
#| warning: false
#| code-summary: "Code to generate interactive plot of frame 18"
load("data/aflw_pct.rda")
aflw_pcti <- interpolate(aflw_pct, 0.1)
f18 <- matrix(aflw_pcti[,,18], ncol=2)
p18 <- render_proj(aflw_pca$x[,1:4], f18,
obs_labels=paste0(aflw$surname,
aflw$given_name))
pg18 <- ggplot() +
geom_path(data=p18$circle, aes(x=c1, y=c2)) +
geom_segment(data=p18$axes, aes(x=x1, y=y1, xend=x2, yend=y2)) +
geom_text(data=p18$axes, aes(x=x2, y=y2, label=rownames(p18$axes))) +
geom_point(data=p18$data_prj, aes(x=P1, y=P2, label=obs_labels)) +
xlim(-1,1) + ylim(-1, 1) +
#ggtitle("Frame 18") +
theme_bw() +
theme(
axis.text=element_blank(),
axis.title=element_blank(),
axis.ticks=element_blank(),
panel.grid=element_blank())
```
::: {.content-visible when-format="html"}
```{r eval=knitr::is_html_output()}
#| label: fig-aflw-pcaplots-html
#| message: false
#| warning: false
#| fig-cap: "Frame 18 re-plotted so that players can be identified on mouse-over."
#| fig-width: 5
#| fig-height: 4
#| out-width: 80%
library(plotly)
ggplotly(pg18, width=500, height=500)
```
:::
::: {.content-visible when-format="pdf"}
```{r eval=knitr::is_latex_output()}
#| label: fig-aflw-pcaplots-pdf
#| echo: false
#| message: false
#| warning: false
#| fig-cap: "Frame 18 re-plotted so that players can be identified. Here some players are labelled, but ideally this plot is interactive and any player can be identified. {{< fa play-circle >}}"
#| fig-width: 5
#| fig-height: 4
#| out-width: 60%
p18_labels <- p18[["data_prj"]] %>% as_tibble() %>% filter(obs_labels %in% c("AhrensLauren", "LivingstoneStacey", "ParkerAlyce", "BonniciBrittany", "HookerDana", "BowersKiara"))
pg18 + theme(aspect.ratio=1, legend.position="none") +
geom_point(data=p18_labels, aes(x=P1, y=P2,
label=obs_labels,
colour=obs_labels), shape=7) +
geom_text(data=p18_labels, aes(x=P1, y=P2,
label=obs_labels,
colour=obs_labels),
vjust = 1.2)
```
:::
For any particular frame, like 18 re-plotted in `r ifelse(knitr::is_html_output(), '@fig-aflw-pcaplots-html', '@fig-aflw-pcaplots-pdf')`, we can investigate further. Here there is a branching pattern, where the branch points in the direction of PC 1. Mouse-over the players at the tip of this branch and we find players like Alyce Parker, Brittany Bonnici, Dana Hooker, Kiara Bowers. If you look up the bios of these players you'll find they all have generally good player descriptions like "elite disposals", "powerful left foot", "hard-running midfielder", "best and fairest".
In the direction of PC 2, you'll find players like Lauren Ahrens, Stacey Livingstone who are star defenders. Players in this end of PC 2, have high scores on `intercepts` and `rebounds_in50`.
Another interesting frame for inspecting PC 2 is 59. PC 2 at one end has players with high goal scoring skills, and the other good defending skills. So mousing over the other end of PC 2 finds players like Gemma Houghton and Katie Brennan who are known for their goal scoring. The branch pattern is an interesting one, because it tells us there is some combination of skills that are lacking among all players, primarily this appears to be there some distinction between defenders skills and general playing skills. It's not as simple as this because the branching is only visible when PC 1 and PC 2 are examined with PC 3.
PCA is useful for getting a sense of the variation in a high-dimensional data set. Interpreting the principal components is often useful, but it can be discombobulating. For the `aflw` data it would be good to think about it as a guide to the main directions of variation and to follow with a more direct engineering of variables into interesting player characteristics. For example, calculate offensive skill as an equal combination of goals, accuracy, shots, behinds. A set of new variables specifically computed to measure particular skills would make explaining an analysis easier.
::: {.content-visible when-format="html"}
::: insight
The tour verifies that PCA on the `aflw` data is complicated and doesn't capture all of the variation. However, it does provide useful insights. It detected outstanding players, and indicated the different skills sets of top goal scorers and top defensive players.
:::
:::
::: {.content-visible when-format="pdf"}
\insightbox{The tour verifies that PCA on the `aflw` data is complicated and doesn't capture all of the variation. However, it does provide useful insights. It detected outstanding players, and indicated the different skills sets of top goal scorers and top defensive players.}
:::
## Examining the PCA model in the data space
\index{model-in-the-data-space}
When you choose a smaller number of PCs $(k)$ than the number of original variables, this is essentially producing a model for the data. The model is the lower dimensional $k$-D space. It is analogous to a linear regression model, except that the residuals from the model are $(p-k)$-D.
It is common to show the model, that is the data projected into the $k$-D model space. When $k=2$ this is called a "biplot". For the `plane` and `plane_noise` data the biplots are shown in @fig-plane-biplot. This is useful for checking which variables contribute most to the new principal component variables, and also to check for any problems that might have affected the fit, such as outliers, clusters or non-linearity. Interestingly, biplots are typically only made in 2D, even if the data should be summarised by more than two PCs. Occasionally you will see the biplot made for PC $j$ vs PC $k$ also. With the `pca_tour()` function in the `tourr` package you can view a $k$-D biplot. This will display the $k$ PCs with the axes displaying the original variables, and thus show their contribution to the PCs.
```{r echo=knitr::is_html_output()}
#| label: fig-plane-biplot
#| fig-cap: "Biplots of the plane (a) and plane + noise (b) data. All five variables contribute strongly to the two principal components in (a): PC1 is primarily `x1`, `x2` and `x3` and PC2 is primarily `x4` and `x5`. In (b) the same four variables contribute in almost the same way, with variables `x6` and `x7` contributing very little. The data was constructed this way, that these two dimensions were purely noise."
#| code-summary: Code for biplots
#| fig-width: 8
#| fig-height: 4
#| out-width: 100%
library(ggfortify)
library(patchwork)
plane_pca <- prcomp(plane)
pl1 <- autoplot(plane_pca, loadings = TRUE,
loadings.label = TRUE) +
ggtitle("(a)") +
theme_minimal() +
theme(aspect.ratio=1)
plane_noise_pca <- prcomp(plane_noise)
pl2 <- autoplot(plane_noise_pca, loadings = TRUE,
loadings.label = TRUE) +
ggtitle("(b)") +
theme_minimal() +
theme(aspect.ratio=1)
pl1 + pl2
```
It can be useful to examine this model using the tour. The model is simply a plane in high dimensions. This would be considered to be the model in the data space. The reason to do this is to check how well the model fits the data. The plane corresponding to the model should be oriented along the main direction of the points, and the spread of points around the plane should be small. We should also be able to see if there has been any strong non-linear relationship missed by the model, or outliers and clusters.
The function `pca_model()` from the `mulgar` package can be used to represent the model as a $k$-D wire-frame plane. `r ifelse(knitr::is_html_output(), '@fig-plane-box-model-html', '@fig-plane-box-model-pdf')` shows the models for the `plane` and `box` data, 2D and 3D respectively.
::: {.content-visible when-format="html"}
::: info
We look at the model in the data space to check how well the model fits the data. If it fits well, the points will cluster tightly around the model representation, with little spread in other directions.
:::
:::
::: {.content-visible when-format="pdf"}
\infobox{We look at the model in the data space to check how well the model fits the data. If it fits well, the points will cluster tightly around the model representation, with little spread in other directions.}
:::
```{r echo=knitr::is_html_output()}
#| eval: false
#| code-summary: Code for model-in-the-data
plane_m <- pca_model(plane_pca)
plane_m_d <- rbind(plane_m$points, plane)
animate_xy(plane_m_d, edges=plane_m$edges,
axes="bottomleft",
edges.col="#E7950F",
edges.width=3)
render_gif(plane_m_d,
grand_tour(),
display_xy(half_range=0.9,
edges=plane_m$edges,
edges.col="#E7950F",
edges.width=3),
gif_file="gifs/plane_model.gif",
frames=500,
width=400,
height=400,
loop=FALSE)
box_pca <- prcomp(box)
box_m <- pca_model(box_pca, d=3)
box_m_d <- rbind(box_m$points, box)
animate_xy(box_m_d, edges=box_m$edges,
axes="bottomleft", edges.col="#E7950F", edges.width=3)
render_gif(box_m_d,
grand_tour(),
display_xy(half_range=0.9,
edges=box_m$edges,
edges.col="#E7950F",
edges.width=3),
gif_file="gifs/box_model.gif",
frames=500,
width=400,
height=400,
loop=FALSE)
```
::: {.content-visible when-format="html"}
::: {#fig-plane-box-model-html fig-align="center" layout-ncol=2}
![Model for the 2D in 5D data.](gifs/plane_model.gif){#fig-plane-model fig-alt="FIX ME." fig.align="center"}
![Model for the 3D in 5D data.](gifs/box_model.gif){#fig-box-model fig-alt="FIX ME." fig.align="center"}
PCA model overlaid on the data for the 2D in 5D, and 3D in 5D simulated data.
:::
:::
::: {.content-visible when-format="pdf"}
::: {#fig-plane-box-model-pdf fig-align="center" layout-ncol=2}
![Model for the 2D in 5D data.](images/plane_model_55.png){#fig-plane-model fig-alt="FIX ME." fig.align="center"}
![Model for the 3D in 5D data.](images/box_model_13.png){#fig-box-model fig-alt="FIX ME." fig.align="center"}
PCA model overlaid on the data for the 2D in 5D, and 3D in 5D simulated data. {{< fa play-circle >}}
:::
:::
### Example: pisa
\index{data!pisa}
The model for the `pisa` data is a 1D vector, shown in `r ifelse(knitr::is_html_output(), '@fig-pisa-model-html', '@fig-pisa-model-pdf')`. In this example there is a good agreement between the model and the data.
```{r echo=knitr::is_html_output()}
#| eval: false
#| code-summary: Code for model-in-the-data
pisa_model <- pca_model(pisa_pca, d=1, s=2)
pisa_all <- rbind(pisa_model$points, pisa_std)
animate_xy(pisa_all, edges=pisa_model$edges,
edges.col="#E7950F", edges.width=3)
render_gif(pisa_all,
grand_tour(),
display_xy(half_range=0.9,
edges=pisa_model$edges,
edges.col="#E7950F",
edges.width=5),
gif_file="gifs/pisa_model.gif",
frames=500,
width=400,
height=400,
loop=FALSE)
```
::: {.content-visible when-format="html"}
::: {#fig-pisa-model-html}
![](gifs/pisa_model.gif){fig-alt="Something here" fig.align="center"}
PCA model of the `pisa` data. The 1D model captures the primary variation in the data and there is a small amount of spread in all directions away from the model.
:::
:::
::: {.content-visible when-format="pdf"}
::: {#fig-pisa-model-pdf fig-alt="Something here" fig.align="center"}
![](images/pisa_model_17.png){width=300}
PCA model of the `pisa` data. The 1D model captures the primary variation in the data and there is a small amount of spread in all directions away from the model. {{< fa play-circle >}}
:::
:::
::: {.content-visible when-format="html"}
::: insight
The `pisa` data fits fairly closely to the 1D PCA model. The variance of points away from the model is symmetric and relatively small. These suggest the 1D model is a reasonably summary of the test scores.
:::
:::
::: {.content-visible when-format="pdf"}
\insightbox{The `pisa` data fits fairly closely to the 1D PCA model. The variance of points away from the model is symmetric and relatively small. These suggest the 1D model is a reasonably summary of the test scores.}
:::
### Example: aflw
\index{data!aflw}
It is less useful to examine the PCA model for the `aflw` data, because the main patterns that were of interest were the exceptional players. However, we will do it anyway! `r ifelse(knitr::is_html_output(), '@fig-aflw-model-html', '@fig-aflw-model-pdf')` shows the 4D PCA model overlain on the data. Even though the distribution of points is not as symmetric and balanced as the other examples, we can see that the cube structure mirrors the variation. We can see that the relationships between variables are not strictly linear, because the spread extends unevenly away from the box.
```{r echo=knitr::is_html_output()}
#| eval: false
#| code-summary: Code for model-in-the-data
aflw_model <- pca_model(aflw_pca, d=4, s=1)
aflw_all <- rbind(aflw_model$points, aflw_std[,7:35])
animate_xy(aflw_all, edges=aflw_model$edges,
edges.col="#E7950F",
edges.width=3,
half_range=0.8,
axes="off")
render_gif(aflw_all,
grand_tour(),
display_xy(half_range=0.8,
edges=aflw_model$edges,
edges.col="#E7950F",
edges.width=3,
axes="off"),
gif_file="gifs/aflw_model.gif",
frames=500,
width=400,
height=400,
loop=FALSE)
```
::: {.content-visible when-format="html"}
::: {#fig-aflw-model-html}
![](gifs/aflw_model.gif){ fig-alt="Something here" fig.align="center"}
PCA model of the `aflw` data. The linear model is not ideal for this data, which has other patterns like outliers, and some branching. However, the model roughly captures the linear associations, and leaves unexplained and unequal variation in different directions.
:::
:::
::: {.content-visible when-format="pdf"}
::: {#fig-aflw-model-pdf fig-alt="Something here" fig.align="center"}
![](images/aflw_model_70.png){width=300}
PCA model of the `aflw` data. The linear model is not ideal for this data, which has other patterns like outliers, and some branching. However, the model roughly captures the linear associations, and leaves unexplained variation in different directions. {{< fa play-circle >}}
:::
:::
::: {.content-visible when-format="html"}
::: insight
From the tour we see that the 4D model leaves substantial variation unexplained. It is also not symmetric, and there is some larger variation away from the model in some combinations of variables than others.
:::
:::
::: {.content-visible when-format="pdf"}
\insightbox{From the tour we see that the 4D model leaves substantial variation unexplained. It is also not symmetric, and there is some larger variation away from the model in some combinations of variables than others.}
:::
## When relationships are not linear
### Example: outliers
\index{outliers}
@fig-plane-n-o-scree shows the scree plot for the planar data with noise and outliers. It is very similar to the scree plot on the data without the outliers (@fig-plane-noise-scree). However, what we see from `r ifelse(knitr::is_html_output(), '@fig-p-o-pca-html', '@fig-p-o-pca-pdf')` is that PCA loses the outliers. The animation in (a) shows the full data, and the outliers marked by colour and labels 1, 2, are clearly unusual in some projections. When we examine the tour of the first four PCs (as suggested by the scree plot) the outliers are not unusual. They are almost contained in the point cloud. The reason is clear when all the PCs are plotted, and the outliers can be seen to be clearly detected only in PC5, PC6 and PC7.
```{r}
#| echo: false
plane_noise_outliers <- plane_noise
plane_noise_outliers[101,] <- c(2, 2, -2, 0, 0, 0, 0)
plane_noise_outliers[102,] <- c(0, 0, 0,-2, -2, 0, 0)
```
```{r}
#| label: fig-plane-n-o-scree
#| code-summary: Code for screeplot
#| fig-cap: "Scree plot of the planar data with noise and an outlier. It is almost the same as the data without the outliers."
#| fig-width: 6
#| fig-height: 4
#| out-width: 80%
plane_n_o_pca <- prcomp(plane_noise_outliers)
ggscree(plane_n_o_pca, q = 7) + theme_minimal()
```
```{r echo=knitr::is_html_output()}
#| eval: false
#| code-summary: "Code to generate tours"
clrs <- hcl.colors(12, "Zissou 1")
p_col <- c(rep("black", 100), clrs[11], clrs[11])
p_obs_labels <- c(rep("", 100), "1", "2")
animate_xy(plane_n_o_pca$x[,1:4],
col=p_col,
obs_labels=p_obs_labels)
animate_xy(plane_noise_outliers,
col=p_col,
obs_labels=p_obs_labels)
render_gif(plane_noise_outliers,
grand_tour(),
display_xy(half_range=0.8,
col=p_col,
obs_labels=p_obs_labels),
gif_file="gifs/plane_n_o_clr.gif",
frames=500,
width=200,
height=200,
loop=FALSE)
render_gif(plane_n_o_pca$x[,1:4],
grand_tour(),
display_xy(half_range=0.8,
col=p_col,
obs_labels=p_obs_labels),
gif_file="gifs/plane_n_o_pca.gif",
frames=500,
width=200,
height=200,
loop=FALSE)
```
::: {.content-visible when-format="html"}
::: {#fig-p-o-pca-html fig-align="center" layout-ncol=2}
![Outliers clearly visible](gifs/plane_n_o_clr.gif){#fig-plane-n-o-clr width=250}
![Outliers not clearly visible in PC1-4](gifs/plane_n_o_pca.gif){#fig-plane-n-o-pca width=250}
Examining the handling of outliers in the PCA of the planar data with noise variables and two outliers. PCA has lost these two extreme values.
:::
:::
::: {.content-visible when-format="pdf"}
::: {#fig-p-o-pca-pdf fig-align="center" layout-ncol=2}
![Outliers clearly visible](images/plane_n_o_clr_181.png){#fig-plane-n-o-clr width=210}
![Outliers not clearly visible in PC1-4](images/plane_n_o_pca_181.png){#fig-plane-n-o-pca width=210}
Examining the handling of outliers in the PCA of the planar data with noise variables and two outliers. PCA has lost these two extreme values. {{< fa play-circle >}}
:::
:::
```{r echo=knitr::is_html_output()}
#| label: fig-plane-o-n-pairs
#| message: false
#| warning: false
#| fig-width: 5
#| fig-height: 5
#| out-width: 80%
#| fig-cap: "From the scatterplot matrix we can see that the outliers are present in PC5, PC6 and PC7. That means by reducing the dimensionality to the first four PCs the model has missed some important characteristics in the data."
#| code-summary: "Code to make scatterplot matrix"
library(GGally)
ggscatmat(plane_n_o_pca$x) + theme_minimal() +
theme(axis.title = element_blank(),
axis.text = element_blank())
```
### Example: Non-linear associations
\index{nonlinearity}
`r ifelse(knitr::is_html_output(), '@fig-plane-nonlin-html', '@fig-plane-nonlin-pdf')` shows the tour of the full 5D data containing non-linear relationships in comparison with a tour of the first three PCs, as recommended by the scree plot (@fig-plane-nonlin-scree). The PCs capture some clear and very clean non-linear relationship, but it looks like it has missed some of the complexities of the relationships. The scatterplot matrix of all 5 PCs (@fig-plane-nonlin-pairs) shows that PC4 and PC5 contain interesting features: more non-linearity, and curiously an outlier.
```{r echo=knitr::is_html_output()}
#| label: fig-plane-nonlin-scree
#| code-summary: Code for screeplot
#| fig-cap: "Scree plot of the non-linear data suggests three PCs."
#| fig-width: 6
#| fig-height: 4
#| out-width: 80%
data(plane_nonlin)
plane_nonlin_pca <- prcomp(plane_nonlin)
ggscree(plane_nonlin_pca, q = 5) + theme_minimal()
```
```{r echo=knitr::is_html_output()}
#| eval: false
#| code-summary: "Code to generate tour"
animate_xy(plane_nonlin_pca$x[,1:3])
render_gif(plane_nonlin_pca$x[,1:3],
grand_tour(),
display_xy(half_range=0.8),
gif_file="gifs/plane_nonlin_pca.gif",
frames=500,
width=200,
height=200)
```
::: {.content-visible when-format="html"}
::: {#fig-plane-nonlin-html fig-align="center" layout-ncol=2}
![All five variables](gifs/plane_nonlin.gif){#fig-nonlinear2 width=230}
![First three PCs](gifs/plane_nonlin_pca.gif){#fig-plane-nonlin-pca width=230}
Comparison of the full data and first three principal components. Non-linear relationships between several variables can be seen in a tour on all five variables. The first three principal components reveal a strong non-linear relationship. Some of the non-linearity is clearly visible in the reduced dimension space, but the full data has more complexities.
:::
:::
::: {.content-visible when-format="pdf"}
::: {#fig-plane-nonlin-pdf fig-align="center" layout-ncol=2}
![All five variables](images/plane_nonlin_61.png){#fig-nonlinear2 width=210}
![First three PCs](images/plane_nonlin_pca_129.png){#fig-plane-nonlin-pca width=210}
Comparison of the full data and first three principal components. Non-linear relationships between several variables can be seen in a tour on all five variables. The first three principal components reveal a strong non-linear relationship. Some of the non-linearity is clearly visible in the reduced dimension space, but the full data has more complexities. {{< fa play-circle >}}
:::
:::
```{r echo=knitr::is_html_output()}
#| label: fig-plane-nonlin-pairs
#| message: false
#| warning: false
#| fig-width: 5
#| fig-height: 5
#| out-width: 80%
#| fig-cap: "From the scatterplot matrix we can see that the there is a non-linear relationship visible in PC1 and PC2, with perhaps a small contribution from PC3. However, we can see that when the data is reduced to three PCs, it misses catching all on the non-linear relationships and also interestingly it seems that there is an unusual observation also."
#| code-summary: "Code to make scatterplot matrix"
ggscatmat(plane_nonlin_pca$x) +
theme_minimal() +
theme(axis.title = element_blank(),
axis.text = element_blank())
```
::: {.content-visible when-format="html"}
::: info
One of the dangers of PCA is that interesting and curious details of the data only emerge in the lowest PCs, that are usually discarded. The tour, and examining the smaller PCs, can help to discover them.
:::
:::
::: {.content-visible when-format="pdf"}
\infobox{One of the dangers of PCA is that interesting and curious details of the data only emerge in the lowest PCs, that are usually discarded. The tour, and examining the smaller PCs, can help to discover them.}
:::
## Exercises {-}
1. Make a scatterplot matrix of the first four PCs of the `aflw` data. Is the branch pattern visible in any pair?
2. Construct five new variables to measure these skills offense, defense, playing time, ball movement, errors. Using the tour, examine the relationship between these variables. Map out how a few players could be characterised based on these directions of skills.
3. Symmetrise any `aflw` variables that have skewed distributions using a log or square root transformation. Then re-do the PCA. What do we learn that is different about associations between the skill variables?
4. Examine the `bushfires` data using a grand tour on the numeric variables, ignoring the `cause` (class) variable. Note any issues such as outliers, or skewness that might affect PCA. How many principal components would be recommended by the scree plot? Examine this PCA model with the data, and explain how well it does or doesn't fit.
5. Use the `pca_tour` to examine the first five PCs of the `bushfires` data. How do all of the variables contribute to this reduced space?
6. Reduce the dimension of the `sketches` data to 12 PCs. How much variation does this explain? Is there any obvious clustering in this lower dimensional space?
```{r}
#| label: aflw-pairs
#| eval: false
#| echo: false
#| message: false
#| warning: false
library(GGally)
ggscatmat(aflw_pca$x, columns=1:4)
```
```{r}
#| eval: false
#| echo: false
library(mulgar)
library(dplyr)
data(bushfires)
bushfires_std <- bushfires %>%
dplyr::select(`rf`:`log_dist_road`) %>%
mutate_if(is.numeric, function(x) (x-
mean(x, na.rm=TRUE))/
sd(x, na.rm=TRUE))
bushfires_pca <- prcomp(bushfires_std)
ggscree(bushfires_pca, q=20)
bushfires_pc <- data.frame(bushfires_pca$x[,1:5])
bushfires_model <- pca_model(bushfires_pca, d=5, s=1)
bushfires_all <- rbind(bushfires_model$points,
bushfires_std)
animate_xy(bushfires_all, edges=bushfires_model$edges,
edges.col="#E7950F",
edges.width=3,
half_range=12,
axes="off")
```
## Project {-}
Linear dimension reduction can optimise for other criteria, and here we will explore one example: the algorithm implemented in the `dobin` package finds a basis in which the first few directions are optimized for the detection of outliers in the data. We will examine how it performs for the `plane_noise_outliers` data (the example where outliers were hidden in the first four principal components.)
1. Start by looking up the documentation of `dobin::dobin`. How many parameters does the method depend on?
2. We first apply the function to the `plane_noise_outliers` data using default values for all parameters.
3. Recall that the outliers were added in rows 101 and 102 of the data. Make a scatter plots showing the projection onto the first, second and third component found by `dobin`, using color to highlight the outliers. Are they visible as outliers with three components?
4. Adjust the `frac` parameter of the `dobin` function to `frac = 0.99` and repeat the graphical evaluation from point 3. How does it compare to the previous solution?
```{r}
#| eval: false
#| echo: false
#| message: false
#| warning: false
library(dobin)
library(GGally)
dobin_pno_1 <- as.data.frame(dobin(plane_noise_outliers)$coords)
dobin_pno_1$outlier <- c(rep("no", 100), rep("yes", 2))
ggscatmat(dobin_pno_1, columns = 1:3, color = "outlier")
# outliers not visible in V1 vs V2
# with V3 we can identify one of the two outliers
dobin_pno_2 <- as.data.frame(dobin(plane_noise_outliers, frac = 0.99)$coords)
dobin_pno_2$outlier <- c(rep("no", 100), rep("yes", 2))
ggscatmat(dobin_pno_2, columns = 1:3, color = "outlier")
# with three dimensions we capture both outliers in our graph
```