-
Notifications
You must be signed in to change notification settings - Fork 3
/
ggplot2-basics1.Rmd
256 lines (167 loc) · 9.25 KB
/
ggplot2-basics1.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
---
title: "Introduction to data visualization using ggplot2 (part 1)"
author: "BBL and SCP"
date: "`r format(Sys.time(), '%d %B %Y')`"
output:
html_document:
df_print: paged
toc: true
toc_float: true
code_folding: show
---
Part 2 is [here](https://rpubs.com/bpbond/727256).
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(emo) # install via devtools::install_github("hadley/emo")
```
# Topics
* Data visualization concepts
* A grammar of graphics
* An introduction to ggplot2
* The pieces of a ggplot2 plot
* Implications for data structure
* Data, aesthetics, geoms, labels, themes, facets
* Accessibility
* Saving plots
* Fancier things
* Resources
**Goal: understand the principles that ggplot is built on, and the steps needed to create a wide variety of basic plots.**
# Assumptions
<span style="color: red;">**We assume you're familiar with the basic mechanics of R:**</span>
* Starting R/RStudio
* Scripts, variables, and data frames
So _not_ at this level :)
<img src="images-ggplot2/notepad.png" width = "75%">
**This is intended to be a hands-on workshop**, so we also assume:
* You have R (and probably RStudio) installed
* You have the [ggplot2](https://cran.r-project.org/web/packages/ggplot2/index.html) package installed
# Data visualization {#dataviz}
Visualizing data is [critical](https://towardsdatascience.com/a-comprehensive-guide-to-the-grammar-of-graphics-for-effective-visualization-of-multi-dimensional-1f92b4ed4149):
![](https://miro.medium.com/max/600/1*W--cGoA3_n2ZlU6Xs4o2iQ.gif)
**The x and y mean, standard deviation, and x-y correlation are unchanged throughout this animation.**
Another example of this is [Anscombe's Quartet](https://en.wikipedia.org/wiki/Anscombe%27s_quartet):
<img src="images-ggplot2/638px-Anscombe's_quartet_3.svg.png" width = "100%">
**All four of _these_ datasets have identical `mean(x)`, `mean(y)`, `var(x)`, `var(y)`, `cor(x, y)`, and regression (intercept, slope, r-squared).** `r emo::ji("exploding_head")`
Lots of research has been done on effective data visualization with respect to science communication. Read a bit of it. [For example](https://www.sciencedirect.com/science/article/pii/S2666389920301896) here are one author's ten principles of effective data visualization:
* Diagram First: identify the information you want to share
* **Use the Right Software**
* **Use an Effective Geometry and Show Data**
* **Colors _Always_ Mean Something**
* Include Uncertainty
* **Panel, when Possible**
* Data and Models Are Different Things
* Simple Visuals, Detailed Captions
* Consider an Infographic
* Get an Opinion
To these I would only add "know your audience".
Remember, data visualization can have [consequences](https://xkcd.com/523/)!
![](https://imgs.xkcd.com/comics/decline.png)
## Plotting in base R
One of the simplest datasets included with R is `cars`:
```{r plot-cars, warning=FALSE}
cars
plot(cars)
```
That seems pretty good! What's the problem?
Well, what about `iris`? This is a [famous](https://rpubs.com/AjinkyaUC/Iris_DataSet) dataset; from the help (`?iris`):
>This famous (Fisher's or Anderson's) iris data set gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris. The species are _Iris setosa_, _versicolor_, and _virginica_.
<img src="images-ggplot2/iris.png" width = "100%">
```{r show-iris, warning=FALSE}
iris
```
**Note that each row of `iris` is an _individual flower_; there are four observations per row.** We'll come back to this structural point later.
Let's plot two of its columns against each other, coloring by species:
```{r plot-iris-base}
plot(iris$Sepal.Length, iris$Sepal.Width, col = iris$Species)
legend(7, 4.3,
unique(iris$Species),
col = 1:length(iris$Species),
pch = 1)
```
This is a bunch of code for such a simple plot; note that:
* The `plot` code understands numeric vectors, so we need to repeatedly specify `iris$<column>`
* This means the default axis labels are ugly (though they can be changed)
* The legend is _totally disconnected_ from the plot: we have to do everything (color
assignment, etc.) manually
Things quickly gets worse if we want more complexity or features. What's the underlying pproblem?
>Without a grammar, there is no underlying theory, so most graphics packages are just a big collection of special cases.
From the [ggplot2 book](https://ggplot2-book.org/introduction.html).
# A grammar of graphics
Above we made some scatterplots, perhaps the simplest graph type.
>What precisely is a scatterplot? You have seen many before and have probably even drawn some by hand. A scatterplot represents each observation as a point, positioned according to the value of two variables. As well as a horizontal and vertical position, each point also has a size, a colour and a shape. These attributes are called aesthetics, and are the properties that can be perceived on the graphic. Each aesthetic can be mapped to a variable, or set to a constant value.
<img src="images-ggplot2/wickham-2010.png" width = "100%">
This insight had been made before Hadley Wickham's [original paper](https://vita.had.co.nz/papers/layered-grammar.pdf), but in the context of R it laid the ground for ggplot2:
>To be precise, the layered grammar defines the components of a plot as:
>
>* a default dataset and set of mappings from variables to aesthetics,
>* one or more layers, with each layer having one geometric object, one statistical transformation, one position adjustment, and optionally, one dataset and set of aesthetic mappings,
>* one scale for each aesthetic mapping used,
>* a coordinate system,
>* the facet specification.
We are learning about (a subset of) these steps today.
# Steps to a ggplot2 plot
Say we have a plot we want to make, a slightly more complicated version of Wickham (2010) Figure 2 above:
<img src="images-ggplot2/layers-final-plot.png" width = "100%">
In the grammar of graphics / ggplot2 system, plots are built up from sequential
layers: these are procedural steps, but also literal visual _layers_,
the net result of which is the final plot. Later steps can modify and
override what's 'presented' by previous layers.
Visually:
<img src="images-ggplot2/layers-all.png" width = "100%">
We're going to walk through these layers, one by one.
## 7. The dataset
<img src="images-ggplot2/layers-7-data.png" width = "100%">
The first (or in back-to-front numbering, as in the image above,
the seventh) step involves our data.
As noted above, the _structure_ of our data has implications for how we plot it; more precisely, to effectively use ggplot2 we want our data to be structured a certain way. But again `r emo::ji("smile")` let's come back to that point.
Generally, our data for plotting should be in **tabular** format, with rows and named columns. In R this is typically a `data.frame` or a `tibble`.
## 6. The ggplot call
<img src="images-ggplot2/layers-6-ggplot.png" width = "100%">
Hey, `iris` is a data frame. Let's call `ggplot()` on it!
```{r ggplot-call, warning=FALSE}
library(ggplot2)
ggplot(iris)
```
Well, that was disappointing.
Remember how easy `plot(cars)` was above...why didn't anything happen here? Well, `ggplot()` doesn't know how to map our plot _aesthetics_ to our _data_, and it doesn't know what _geom_ to use for subsequent visualization.
## 5. Aesthetics mapping
<img src="images-ggplot2/layers-5-aesthetics.png" width = "100%">
As we said above, the _aesthetics_ of each layer in our plot can either be
* constant, or
* mapped to a column of data
Inverting this statement means that
* Any non-constant aesthetic has to be _its own column_ in the data
This idea of mapping aesthetics to columns thus has implications for our the _structure_ of our data.
## Interlude: data structure
Remember what `iris` looks like:
```{r show-iris-again, warning=FALSE, echo=FALSE}
iris
```
This is problematic. What if we wanted an aesthetic like `color` to depend on what dimension or organ we're measuring?
**`iris` is structured in a form convenient for humans, but not one
particularly handy for computers.**
In general it's best to start with your data in ["tidy" form](https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html), a.k.a. long form,
when preparing to use ggplot2. This means that every row contains exactly **one**
observation; specifically:
* Each _variable_ forms a column.
* Each _observation_ forms a row.
* Each type of observational unit forms a table.
### Long (tidy) data
With all this in mind, it's clear we need to _reshape_ our data. Let's assume,
for the rest of this workshop, that we're particularly interested in comparing
observations of _petals_ versus those of _sepals_:
```{r}
# Here we use base R's "reshape" function
# There are many alternatives; in particular, check out
# the powerful "tidyr" package
iris_long <- reshape(iris,
varying = c("Sepal.Length",
"Sepal.Width",
"Petal.Length",
"Petal.Width"),
timevar = "dimension",
direction = "long")
iris_long
```
**Note that this is _not_ strictly "tidy data", per the definition above. Why not?**
With this reshaping, we can proceed to map _aesthetics_ to _columns_.