-
Notifications
You must be signed in to change notification settings - Fork 4
/
122-mds.Rmd
162 lines (130 loc) · 9.12 KB
/
122-mds.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
# Multidimensional scaling {#mds}
```{r include=FALSE}
library(tidyverse)
library(fields)
library(ggmap)
library(vegan)
library(ggrepel)
library(broom)
library(kableExtra)
# library(ggvegan) # remotes::install_github("gavinsimpson/ggvegan")
```
Multidimensional scaling (MDS) is another approach to ordination. The purpose is similar to the goals of PCA, but the methods are a bit different. With MDS, the starting point is a similarity, or distance matrix, providing measurements between all pairs of data. Many different [functions](https://pages.mtu.edu/~shanem/psy5220/daily/Day16/MDS.html#:~:text=Multidimentional%20scaling%20(MDS)%20is%20used,measures%20between%20objects%20or%20cases.) can be used to compute this similarity matrix. Commonly distance functions are used. If quantitative data have very skewed distributions, or there is no good way to interpret the distance between points, the data may be converted to ranks (1, 2, 3, ...) before the similarity measure is computed; then the method is called non-metric multidimensional scaling (NMDS) in recognition of the fact that the information being presented is not related to a distance.
### Cities on a map
An easy to understand example of MDS starts with a matrix giving the distance between each pair of cities in a set. The MDS visualization then scatters these points across the plane, reconstructing the geographic separation of the points.
The dataset below is built from a database of world cities, selected to have only cities of more than 5 million plus some cities in Canada. I included a maximum of 4 cities per country so that one one region of the globe was too strongly concentrated in points. You can get the data from [simplemaps](https://simplemaps.com/resources/free-country-cities) and make your own subset.
```{r include=FALSE}
# cities0 <- read_csv("~/Downloads/simplemaps_worldcities_basicv1.73/worldcities.csv", n_max = 1000) # This is the original data
# Keep the 25 largest cities, plus a few from Canada; this is the filtering code
# cities <- cities0 %>% filter(iso2 == "CA" | population > 5000000) %>% group_by(iso2) %>% filter(dense_rank(-population) < 5)
# cities <- bind_rows(cities, cities0 %>% filter(city == "Halifax", iso2 == "CA"))
# write_csv(cities, "static/selected_cities.csv")
```
```{r}
cities <- read_csv("static/selected_cities.csv")
ggplot(cities, aes(x=lng, y=lat)) +
geom_point() +
geom_text_repel(aes(label=city), size=2)
```
```{r mds-map-cache, message=FALSE, cache=TRUE, warning=FALSE}
mymap <- get_stamenmap(bbox = c(left = -150, bottom = -60, right = 175, top = 75), zoom=3, maptype = "terrain-background")
ggmap(mymap) +
geom_point(data = cities, aes(x=lng, y=lat)) +
geom_text_repel(data = cities, aes(x =lng, label=city), size=2)
```
Now we calculate the distance between all pairs of cities. I've shown two ways to do this -- using great circle distance along the surface of the Earth, and using a distance that projects the surface of the Earth onto a plane first. If the goal is to reconstruct the map shown above, use this projected distance. If you would like instead to show great-circle distances between cities in the plane, use the geodesic distance.
```{r}
city_distance <- dist(cities %>% dplyr::select(lng, lat) %>% as.matrix()) # distance in equirectangular projection
# https://en.wikipedia.org/wiki/Equirectangular_projection
```
Perform [MDS](https://en.wikipedia.org/wiki/Multidimensional_scaling) (principal coordinates analysis). The direction of the
two main axes could be reversed relative to the original map; I've reversed the x and y axes to match our customary view of the world.
```{r}
mds1 <- cmdscale(city_distance)
colnames(mds1) <- c("V1", "V2")
bind_cols(cities, as_tibble(mds1)) %>%
ggplot(aes(x = V1, y = V2)) +
geom_point() +
geom_text_repel(aes(label = city), size=2) +
scale_x_reverse() + scale_y_reverse() +
labs(title = "Map reconstructed from equirectangular distance matrix",
x = "", y = "")
```
```{r}
city_distance <- rdist.earth(cities %>% dplyr::select(lng, lat) %>% as.matrix(),
miles = FALSE) # geodesic distance
mds1 <- cmdscale(city_distance)
colnames(mds1) <- c("V1", "V2")
bind_cols(cities, as_tibble(mds1)) %>%
ggplot(aes(x = V1, y = V2)) +
geom_point() +
geom_text_repel(aes(label = city), size=2) +
scale_x_reverse() + scale_y_reverse() +
labs(title = "Map reconstructed from great-circle distance matrix",
x = "", y = "")
```
## NMDS example
[Morse code](https://en.wikipedia.org/wiki/Morse_code) is a way of sending text messages using just two symbols: dot and dash, which was designed to be transmitted by a person clicking a key to make sounds and the receiver listening to the sounds and translating the message as it arrives.
The dataset below was created as part of an experiment to measure the rate at which patterns of sounds for one symbol were confused with sounds for a different symbol. The matrix is symmetric dissimilarity measure. Rows and columns vary across the 36 symbols tested (26 letters and 10 numeric digits). All diagonals are 0s. The off diagonal values are large if the sounds are not likely to be confuesed. Smaller dissimilarities correspond to symbols that are more likely to be confused. A small excerpt of the table is shown.
```{r}
morse.dist <- read.delim('https://raw.githubusercontent.com/zdealveindy/anadat-r/master/data/morsecodes-dist.txt',
row.names = 1, head = T)
names(morse.dist) <- rownames(morse.dist)
morse.dist[1:5,1:5] %>% kable() %>% row_spec(0, monospace=TRUE) %>%
column_spec(1, monospace=TRUE) %>%
kable_styling(full_width = FALSE)
```
Use the function `metaMDS` from the package `vegan` to perform the NMDS ordination. The `ordiplot` function shows the objects from the dissimilarity matrix on a two-dimensional "ordination plot". Points that are closer together are more likely to be confused (they are less dissimilar).
```{r message=FALSE}
NMDS <- metaMDS(morse.dist, trace=0)
NMDS
ordiplot(NMDS, cex = 1.5, type = 't')
# stressplot(NMDS)
```
This next plot compares the distances in the original data to the distances as represented by the ordination. If the ordination represents the original data well, this will be close to a straight line. There is a point for every pairwise comparison. In this case there are 36 * 35 / 2 = 630 distances to compare. I have labeled some of the points with the largest distortion imposed by the ordination.
```{r}
symbols <- rownames(NMDS$points)
NMDS[c("dist", "dhat", "iidx", "jidx")] %>% as_tibble() %>%
mutate(comparison = paste0(symbols[iidx], "/", symbols[jidx]),
comparison2 = case_when( abs(dist-dhat) < 40 ~ "", TRUE ~ comparison)) %>%
ggplot(aes(dist, dhat)) +
geom_point(size=0.5, color="blue") +
geom_text_repel(aes(label = comparison2)) +
theme_bw()
```
You can also access the points from the ordination and make the "ordiplot" using ggplot. The relative position of points on the plat is the only thing that matters -- any rotation or translation of the plot contains the same information.
```{r}
NMDS$points %>% as_tibble(rownames = "Symbol") %>%
ggplot(aes(x = MDS1, y = MDS2 )) +
geom_text(aes(label=Symbol)) +
theme_bw()
# geom_point(size = 0.5) +
# geom_text_repel(aes(label = Symbol))
```
You might want to understand the NMDS analysis in terms of some properties of the Morse Code signals. This table has the length (1-5) and the ratio of short to long (dots and dashes) signals in each symbol. The `envfit` function then finds the direction each of these variables increases most rapidly across the ordination plane. The `summary` reports the direction and the correlation between these variables and the position on the ordination plot. The arrow for "length" follows the pattern in the ordination very well, while the arrow for the ratio of short to long only accounts for about half of the variation in the ordination plot.
```{r}
morse.attr <- read.delim('https://raw.githubusercontent.com/zdealveindy/anadat-r/master/data/morsecodes-attr.txt',
row.names = 1, head = T)
ef <- envfit(NMDS, morse.attr)
ef
```
The arrows show the direction of most rapid average increase of each variable.
```{r}
ordiplot(NMDS, cex = 1.5, type = 't')
plot(ef)
```
Here is a method to reproduce this plot using `ggplot`. First use `str(ef)` to examine the structure of the result from `envfit`. Then plot the points and arrows using ggplot.
```{r}
arrows1 <- ef$vectors$arrows %>% as_tibble(rownames = "code")
as_tibble(NMDS$points, rownames = "code") %>%
ggplot(aes(x = MDS1, y = MDS2, label = code)) +
geom_text() +
geom_text(data = arrows1, aes(x = 25*NMDS1, y = 25*NMDS2)) +
geom_segment(data = arrows1, aes(x = 20*NMDS1, y = 20*NMDS2, xend = 0, yend = 0)) +
theme_bw()
```
## Further reading
* [Vegan](https://cran.r-project.org/web/packages/vegan/vignettes/intro-vegan.pdf) package and the [ggvegan](https://github.com/gavinsimpson/ggvegan) package
* https://jonlefcheck.net/2012/10/24/nmds-tutorial-in-r/
* https://rpubs.com/collnellphd/GWU_nmds
* [Morse code](https://www.davidzeleny.net/anadat-r/doku.php/en:data:morse) experiment description and [analysis](https://www.davidzeleny.net/anadat-r/doku.php/en:pcoa_nmds_examples)