-
Notifications
You must be signed in to change notification settings - Fork 1
/
A1-toolbox.qmd
413 lines (311 loc) · 21.8 KB
/
A1-toolbox.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
# Toolbox {#toolbox}
## Using tours in the `tourr` package
### Installation
You can install the released version of `tourr` from [CRAN](https://CRAN.R-project.org) with:
``` r
install.packages("tourr")
```
and the development version from the [GitHub repo](https://github.com/ggobi/tourr) with:
``` r
# install.packages("remotes")
remotes::install_github("ggobi/tourr")
```
### Getting started
To run a tour in R, use one of the animate functions. The following code will show a 2D tour displayed as a scatterplot on a 6D data set with three labelled classes.
``` r
animate_xy(flea[,-7], col=flea$species)
```
@tourr remains a good reference for learning more about this package. The package [website](http://ggobi.github.io/tourr/) has a list of current functionality.
### Different tours
The two main components of the tour algorithm are the projection dimension which affects the choice of display to use, and the algorithm that delivers the projection sequence. The primary functions for these two parts are
1. For display of different projection dimensions:
- `display_dist()`: choice of density, histogram or average shifted histogram (ash) display of the 1D projections.
- `display_xy()`, `display_density2d()`, `display_groupxy()`, `display_pca()`, `display_sage()`, `display_slice()`, `display_trails()`: choices in display of 2D projections.
- `display_depth()`, `display_stereo()`: choices to display 3D projections.
- `display_pcp()`, `display_scatmat()`, `display_stars()`, `display_faces()`: choices for displaying three or more variables.
- `display_image()`: to use with multispectral images, where different combinations of spectral bands are displayed. See @WPS98 and @Symanzik2002NewAO for applications.
- `display_andrews()`: 1D projections as Andrews curves.
2. To change the way projections are delivered:
- `grand_tour()`: Smooth sequence of random projections to view all possible projections as quickly as possible. Good for getting an overview of the high-dimensional data, especially when you don't know what you are looking for.
- `guided_tour()`: Follow a projection pursuit optimisation to find projections that have particular patterns. This is used when you want to learn if the data has particular patterns, such as clustering or outliers. Use the `holes()` index to find projections with gaps that allow one to see clusters, or `lda_pp()` or `pda_pp()` when class labels are known and you want to find the projections where the clusters are separated.
- `little_tour()`: Smoothly interpolate between pairs of variables, to show all the marginal views of the data.
- `local_tour()`: Makes small movements around a chosen projections to explore a small neighbourhood. Very useful to learn if small distances away from a projection change the pattern substantially or not.
- `radial_tour()`: Interpolates a chosen variable out of the projection, and then back into the projection. This is useful for assessing importance of variables to pattern in a projection. If the pattern changes a lot when the variable is rotated out, then the variable is important for producing it.
- `dependendence_tour()`: Delivers two sequences of 1D grand tours, to examine associations between two sets of variables. This is useful for displaying two groups of variables as in multiple regression, or multivariate regression or canonical correlation analysis, as two independent 1D projections.
- `frozen_tour()`: This is an interesting one! it allows the coefficient for some variables to be fixed, and others to vary.
### The importance of scale
Scaling of multivariate data is really important in many ways. It affects most model fitting, and can affect the perception of patterns when data is visualised. Here we describe a few scaling issues to take control of when using tours.
**Pre-processing data**
It is generally useful to standardise your data to have mean 0 and variance-covariance equal to the identity matrix before using the tour. We use the tour to discover associations between variables. Characteristics of single variables should be examined and understood before embarking on looking for high-dimensional structure.
The `rescale` parameter in the `animate()` function will scale all variables to range between 0 and 1, prior to starting the tour. This will force all to have the same range. It is not the default, and without this data with different ranges across variable may have some strange patterns. You should set `rescale=TRUE`. If you have already scaled the data yourself, even if using a different scaling such as using standardised variables then the default `rescale=FALSE` is best.
A more severe transformation that can be useful prior to starting a tour is to **sphere** the data. This is also an option in the `animate()` function, but is `FALSE` by default. Sphering is the same as conducting a principal component analysis, and using the principal components as the variables. It removes all linear association between variables! This can be especially useful if you want to focus on finding non-linear associations, including clusters, and outliers.
**Scaling to fit into plot region**
The `half_range` parameter in most of the display types sets the range used to scale the data into the plot. It is estimated when a tour is started, but you may need to change it if you find that the data keeps escaping the plot window or is not fully using the space. Space expands exponentially as dimension increases, and the estimation takes this into account. However, different distributions of data points lead to different variance of observations in high-dimensional space. A skewed distribution will be more varied than a normal distribution. It is hard to estimate precisely how the data should be scaled so that it fits nicely into the plot space for all projections viewed.
The `center` parameter is used to centre each projection by setting the mean to be at the middle of the plot space. With different distributions the mean of the data can vary around the plot region, and this can be distracting. Fixing the mean of each projection to always be at the center of the plot space makes it easier to focus on other patterns.
### Saving your tour
The functions `save_history()` and `planned_tour()` allow the tour path to be pre-computed, and re-played in your chosen way. The tour path is saved as a list of projection vectors, which can also be passed to external software for displaying tours easily. Only a minimal set of projections is saved, by default, and a full interpolation path of projections can always be generated from it using the `interpolate()` function.
Versions and elements of tours can be saved for publication using a variety of functions:
- `render_gif()`: Save a tour as an animated gif, using the `gifski` package.
- `render_proj()`: Save an object that can be used to produce a polished rendering of a single projection, possibly with `ggplot`.
- `render_anim()`: Creates an object containing a sequence of projections that can be used with `plotly()` to produce an HTML animation, with interactive control.
### Understanding your tour path
`r ifelse(knitr::is_html_output(), '@fig-tour-paths-html', '@fig-tour-paths-pdf')` shows tour paths on 3D data spaces. For 1D projections the space of all possible projections is a $p$-dimensional sphere (@fig-tourpaths1d). For 2D projections the space of all possible projections is a $p\times 2$-dimensional torus (@fig-tourpaths2d)! The geometry is elegant.
In these figures, the space is represented by the light colour, and is constructed by simulating a large number of random projections. The two darker colours indicate paths generated by a grand tour and a guided tour. The grand tour will cover the full space of all possible projections if allowed to run for some time. The guided tour will quickly converge to an optimal projection, so will cover only a small part of the overall space.
```{r echo=knitr::is_html_output()}
#| code-summary: "Load libraries"
#| message: false
library(ferrn)
library(tourr)
library(geozoo)
library(dplyr)
library(purrr)
```
```{r}
#| echo: false
#| eval: false
flea_std <- apply(flea[,1:6], 2, function(x) (x-mean(x))/sd(x))
# 1D paths
set.seed(1101)
gt1 <- save_history(flea_std[,c(1,4,5)],
tour_path = grand_tour(1),
max_bases=30)
pp1 <- save_history(flea_std[,c(1,4,5)],
tour_path = guided_tour(holes(), d=1))
gt1i <- interpolate(gt1)
pp1i <- interpolate(pp1)
#s1 <- sphere.hollow(3, 5000)$points
s1 <- map(1:5000, ~basis_random(n = 3, d=1)) %>%
purrr::flatten_dbl() %>% matrix(ncol = 3, byrow = TRUE)
gt1_m <- apply(gt1i, 1, c)
pp1_m <- apply(pp1i, 1, c)
s1 <- bind_cols(s1, rep("sphere", 5000))
gt1_m <- bind_cols(gt1_m, rep("grand", nrow(gt1_m)))
pp1_m <- bind_cols(pp1_m, rep("guided", nrow(pp1_m)))
p1 <- rbind(s1, gt1_m, pp1_m)
colnames(p1) <- c(paste0("V", 1:3), "type")
p1$type <- factor(p1$type, levels=c("grand", "sphere", "guided"))
animate_xy(p1[,1:3], col=p1$type, palette="Teal-Rose",
cex=c(rep(0.5, 5000), rep(1, nrow(p1)-5000)),
axes="off")
render_gif(p1[,1:3],
grand_tour(),
display_xy(col=p1$type, palette="Teal-Rose",
cex=c(rep(0.5, 5000), rep(1, nrow(p1)-5000)),
axes="off"),
gif_file = "gifs/tour_paths1d.gif",
frames = 500,
width = 300,
height = 300)
# 2D paths
set.seed(1209)
gt2 <- save_history(flea_std[,c(1, 4, 5)],
tour_path = grand_tour(),
max_bases=30)
pp2 <- save_history(flea_std[,c(1, 4, 5)],
tour_path = guided_tour(holes()))
gt2i <- interpolate(gt2)
pp2i <- interpolate(pp2)
s2 <- map(1:5000, ~basis_random(n=3, d=2)) %>%
purrr::flatten_dbl() %>% matrix(ncol = 6, byrow = TRUE)
gt2_m1 <- apply(gt2i[,1,], 1, c)
gt2_m2 <- apply(gt2i[,2,], 1, c)
gt2_m <- bind_cols(gt2_m1, gt2_m2)
pp2_m1 <- apply(pp2i[,1,], 1, c)
pp2_m2 <- apply(pp2i[,2,], 1, c)
pp2_m <- bind_cols(pp2_m1, pp2_m2)
s2 <- bind_cols(s2, rep("torus", 5000))
gt2_m <- bind_cols(gt2_m, rep("grand", nrow(gt2_m)))
pp2_m <- bind_cols(pp2_m, rep("guided", nrow(pp2_m)))
p2 <- rbind(s2, gt2_m, pp2_m)
colnames(p2) <- c(paste0("V", 1:6), "type")
p2$type <- factor(p2$type, levels=c("grand", "torus", "guided"))
animate_xy(p2[,1:6], col=p2$type, palette="Teal-Rose",
cex=c(rep(0.5, 5000), rep(1, nrow(p2)-5000)),
axes="off")
render_gif(p2[,1:6],
grand_tour(),
display_xy(col=p2$type, palette="Teal-Rose",
cex=c(rep(0.5, 5000), rep(1, nrow(p2)-5000)),
axes="off"),
gif_file = "gifs/tour_paths2d.gif",
frames = 500,
width = 300,
height = 300)
```
::: {.content-visible when-format="html"}
::: {#fig-tour-paths-html layout-ncol=2}
![1D tour paths](gifs/tour_paths1d.gif){#fig-tourpaths1d width=48%}
![2D tour paths](gifs/tour_paths2d.gif){#fig-tourpaths2d width=48%}
Grand and guided tour paths of 1D and 2D projections of 3D data. The light points represent the space of all 1D and 2D projections respectively. You can see the grand tour is more comprehensively covering the space, as expected, whereas the guided tour is more focused, and quickly moves to the best projection.
:::
:::
::: {.content-visible when-format="pdf"}
::: {#fig-tour-paths-pdf layout-ncol=2}
![1D tour paths](images/tour_paths1d.png){#fig-tourpaths1d width=210}
![2D tour paths](images/tour_paths2d.png){#fig-tourpaths2d width=210}
Grand and guided tour paths of 1D and 2D projections of 3D data. The light points represent the space of all 1D and 2D projections respectively. You can see the grand tour is more comprehensively covering the space, as expected, whereas the guided tour is more focused, and quickly moves to the best projection. {{< fa play-circle >}}
:::
:::
## What not to do
### Discrete and categorical data
Tour methods are for numerical data, particularly real-valued measurements. If your data is numerical, but discrete the data can look artificially clustered. `r ifelse(knitr::is_html_output(), '@fig-discrete-data-html', '@fig-discrete-data-pdf')` shows an example. The data is numeric but discrete, so it is ok to examine it in a tour. In this example, there will be overplotting of observations and the artificial clustering (plot a). It can be helpful to jitter observations, by adding a small amount of noise (plot b). This helps to remove the artificial clustering, but preserve the main pattern which is the strong linear association. Generally, jittering is a useful tool for working with discrete data, so that you can focus on examining the multivariate association. If the data is categorical, with no natural ordering of categories, the tour is not advised.
```{r echo=knitr::is_html_output()}
#| eval: false
#| code-summary: "Discrete data code"
set.seed(430)
df <- data.frame(x1 = sample(1:6, 107, replace=TRUE)) %>%
mutate(x2 = x1 + sample(1:2, 107, replace=TRUE),
x3 = x1 - sample(1:2, 107, replace=TRUE),
x4 = sample(1:3, 107, replace=TRUE))
animate_xy(df)
render_gif(df,
grand_tour(),
display_xy(),
gif_file = "gifs/discrete_data.gif",
frames = 100,
width = 300,
height = 300)
dfj <- df %>%
mutate(x1 = jitter(x1, 2),
x2 = jitter(x2, 2),
x3 = jitter(x3, 2),
x4 = jitter(x4, 2))
animate_xy(dfj)
render_gif(dfj,
grand_tour(),
display_xy(),
gif_file = "gifs/jittered_data.gif",
frames = 100,
width = 300,
height = 300)
```
::: {.content-visible when-format="html"}
::: {#fig-discrete-data-html layout-ncol=2}
![Discrete data](gifs/discrete_data.gif){#fig-discrete width=40%}
![Jittered data](gifs/jittered_data.gif){#fig-jittered width=40%}
Discrete data can look like clusters, which is misleading. Adding a small amount of jitter (random number) can help. The noise is not meaningful but it could allow the viewer to focus on linear or non-linear association between variables without being distracted by artificial clustering.
:::
:::
::: {.content-visible when-format="pdf"}
::: {#fig-discrete-data-pdf layout-ncol=2}
![Discrete data](images/discrete_data.png){#fig-discrete width=220}
![Jittered data](images/jittered_data.png){#fig-jittered width=220}
Discrete data can look like clusters, which is misleading. Adding a small amount of jitter (random number) can help. The noise is not meaningful but it could allow the viewer to focus on linear or non-linear association between variables without being distracted by artificial clustering. {{< fa play-circle >}}
:::
:::
### Missing values
```{r echo=knitr::is_html_output()}
#| code-summary: "Code to handle missing values"
library(naniar)
library(ggplot2)
library(colorspace)
data("oceanbuoys")
ob_p <- oceanbuoys %>%
filter(year == 1993) %>%
ggplot(aes(x = air_temp_c,
y = humidity)) +
geom_miss_point() +
scale_color_discrete_divergingx(palette="Zissou 1") +
theme_minimal() +
theme(aspect.ratio=1)
ob_nomiss_below <- oceanbuoys %>%
filter(year == 1993) %>%
rename(st = sea_temp_c,
at = air_temp_c,
hu = humidity) %>%
select(st, at, hu) %>%
rowwise() %>%
mutate(anymiss = factor(ifelse(naniar:::any_na(c(st, at, hu)), TRUE, FALSE))) %>%
add_shadow(st, at, hu) %>%
impute_below_if(.predicate = is.numeric)
ob_nomiss_mean <- oceanbuoys %>%
filter(year == 1993) %>%
rename(st = sea_temp_c,
at = air_temp_c,
hu = humidity) %>%
select(st, at, hu) %>%
rowwise() %>%
mutate(anymiss = factor(ifelse(naniar:::any_na(c(st, at, hu)), TRUE, FALSE))) %>%
add_shadow(st, at, hu) %>%
impute_mean_if(.predicate = is.numeric)
ob_p_below <- ob_nomiss_below %>%
ggplot(aes(x=st, y=hu, colour=anymiss)) +
geom_point() +
scale_color_discrete_divergingx(palette="Zissou 1") +
theme_minimal() +
theme(aspect.ratio=1, legend.position = "None")
ob_p_mean <- ob_nomiss_mean %>%
ggplot(aes(x=st, y=hu, colour=anymiss)) +
geom_point() +
scale_color_discrete_divergingx(palette="Zissou 1") +
theme_minimal() +
theme(aspect.ratio=1, legend.position = "None")
```
```{r echo=knitr::is_html_output()}
#| eval: false
#| code-summary: "Code to make animation"
animate_xy(ob_nomiss_below[,1:3], col=ob_nomiss$anymiss)
render_gif(ob_nomiss_below[,1:3],
grand_tour(),
display_xy(col=ob_nomiss_below$anymiss),
gif_file = "gifs/missing_values1.gif",
frames = 100,
width = 300,
height = 300)
render_gif(ob_nomiss_mean[,1:3],
grand_tour(),
display_xy(col=ob_nomiss_mean$anymiss),
gif_file = "gifs/missing_values2.gif",
frames = 100,
width = 300,
height = 300)
```
Missing values can also pose a problem for high-dimensional visualisation, but they shouldn't just be ignored or removed. Methods used in 2D to display missings as done in the `naniar` package [@naniar] like placing them below the complete data don't translate well to high dimensions.
`r ifelse(knitr::is_html_output(), '@fig-missings-html', '@fig-missings-pdf')` illustrates this. It leads to artificial clustering of observations (@fig-below-highD). It is better to impute the values, and mark them with colour when plotting. The cases are then included in the visualisation so we can assess the multivariate relationships, and also obtain some sense of how these cases should be handled, or imputed. In the example in @fig-imputed-highD we imputed the values simply, using the mean of the complete cases. We can see this is not an ideal approach for imputation for this data because some of the imputed values are outside the domain of the complete cases.
::: {.content-visible when-format="html"}
::: {#fig-missings-html layout-ncol=2}
```{r}
#| echo: false
#| label: fig-missings-below-2D-html
#| fig-cap: "Missings below in 2D"
#| fig-width: 3
#| fig-height: 3
ob_p_below
```
![Missings below in high-D](gifs/missing_values1.gif){#fig-below-highD width=40%}
```{r}
#| echo: false
#| label: fig-missings-mean-2D-html
#| fig-cap: "Missings imputed in 2D"
#| fig-width: 3
#| fig-height: 3
ob_p_mean
```
![Missings imputed in high-D](gifs/missing_values2.gif){#fig-imputed-highD width=40%}
Ways to visualise missings for 2D don't transfer to higher dimensions. When the missings are set at 10% below the complete cases it appears to be clustered data when viewed in a tour (b). It is better to impute the value, and use colour to indicate that it is originally a missing value (d).
:::
:::
::: {.content-visible when-format="pdf"}
::: {#fig-missings-pdf layout-ncol=2}
![Missings below in 2-D](images/fig-missings-below-2D-pdf-1.png){#fig-below-2D-pdf width=210}
![Missings below in high-D](images/missing_values1.png){#fig-below-highD width=210}
![Missings imputed in 2-D](images/fig-missings-mean-2D-pdf-1.png){#fig-mean-2D-pdf width=210}
![Missings imputed in high-D](images/missing_values2.png){#fig-imputed-highD width=210}
Ways to visualise missings for 2D don't transfer to higher dimensions. When the missings are set at 10% below the complete cases it appears to be clustered data when viewed in a tour (b). It is better to impute the value, and use colour to indicate that it is originally a missing value (d). {{< fa play-circle >}}
:::
:::
### Context such as time and space
We occasionally hear statements like "time is the fourth dimension" or "space is the fifth dimension". This is not a useful way to think about dimensionality.
If you have data with spatial or temporal context, we recommend avoiding using the time index or the spatial coordinates along with the multiple variables in the tour. Time and space are different types of variables, and should not be combined with the multivariate measurements.
For multivariate temporal data, we recommend using a dependence tour, where one axis is reserved for the time index, and the other axis is used to tour on the multiple variables. For spatial data, we recommend using an image tour, where horizontal and vertical axes are used for spatial coordinates and colour of a tile is used for the tour of multiple variables.
## Tours in other software
There are tours available in various software packages. For most examples we use the `tourr` package, but the same purpose could be achieved by using other software. We also use some of the software this book, when the `tourr` package is not up for the task. For information about these packages, their websites are the best places to start
- [liminal](https://sa-lee.github.io/liminal/): to combine tours with (non-linear) dimension reduction algorithms.
- [detourr](https://casperhart.github.io/detourr/): animations for {tourr} using `htmlwidgets` for performance and portability.
- [langevitour](https://logarithmic.net/langevitour/): HTML widget that shows tours projections of a high-dimensional dataset with an animated scatterplot.
- [woylier](https://numbats.github.io/woylier/): alternative method for generating a tour path by interpolating between d-D frames in p-D space rather than d-D planes.
- [spinifex](https://nspyrison.github.io/spinifex/): manual control of dynamic projections of numeric multivariate data.
- [ferrn](https://huizezhang-sherry.github.io/ferrn/): extracts key components in the data object collected by the guided tour optimisation, and produces diagnostic plots.
## Supporting software
- [classifly](https://github.com/hadley/classifly): This package is used heavily for supervised classification.
The `explore()` function is used to explore the classification model. It will predict the class of a sample of points in the predictor space (`.TYPE=simulated`), and return this in a data frame with the observed data (`.TYPE=actual`). The variable `.BOUNDARY` indicates that a point is within a small distance of the classification boundary, when the value is `FALSE`. The variable `.ADVANTAGE` gives an indication of the confidence with which an observation is predicted, so can also be used to select simulated points near the boundary.