forked from rstudio-education/stat545
-
Notifications
You must be signed in to change notification settings - Fork 0
/
06_dplyr-intro.Rmd
207 lines (129 loc) · 9.14 KB
/
06_dplyr-intro.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
# Introduction to dplyr {#dplyr-intro}
```{r include = FALSE}
source("common.R")
```
<!--Original content: https://stat545.com/block009_dplyr-intro.html-->
## Intro
[dplyr][dplyr-web] is a package for data manipulation, developed by Hadley Wickham and Romain Francois. It is built to be fast, highly expressive, and open-minded about how your data is stored. It is installed as part of the [tidyverse][tidyverse-web] meta-package and, as a core package, it is among those loaded via `library(tidyverse)`.
dplyr's roots are in an earlier package called [plyr][plyr-web], which implements the ["split-apply-combine" strategy for data analysis][split-apply-combine] [@wickham2011a]. Where plyr covers a diverse set of inputs and outputs (e.g., arrays, data frames, lists), dplyr has a laser-like focus on data frames or, in the tidyverse, "tibbles". dplyr is a package-level treatment of the `ddply()` function from plyr, because "data frame in, data frame out" proved to be so incredibly important.
Have no idea what I'm talking about? Not sure if you care? If you use these base R functions: `subset()`, `apply()`, `[sl]apply()`, `tapply()`, `aggregate()`, `split()`, `do.call()`, `with()`, `within()`, then you should keep reading. Also, if you use `for()` loops a lot, you might enjoy learning other ways to iterate over rows or groups of rows or variables in a data frame.
### Load dplyr and gapminder
I choose to load the tidyverse, which will load dplyr, among other packages we use incidentally below.
```{r start_dplyr}
library(tidyverse)
```
Also load gapminder.
```{r message = FALSE, warning = FALSE}
library(gapminder)
```
### Say hello to the `gapminder` tibble
The `gapminder` data frame is a special kind of data frame: a tibble.
```{r}
gapminder
```
It's tibble-ness is why we get nice compact printing. For a reminder of the problems with base data frame printing, go type `iris` in the R Console or, better yet, print a data frame to screen that has lots of columns.
Note how `gapminder`'s `class()` includes `tbl_df`; the "tibble" terminology is a nod to this.
```{r}
class(gapminder)
```
There will be some functions, like `print()`, that know about tibbles and do something special. There will others that do not, like `summary()`. In which case the regular data frame treatment will happen, because every tibble is also a regular data frame.
To turn any data frame into a tibble use `as_tibble()`:
```{r}
as_tibble(iris)
```
## Think before you create excerpts of your data ...
If you feel the urge to store a little snippet of your data:
```{r}
(canada <- gapminder[241:252, ])
```
Stop and ask yourself ...
> Do I want to create mini datasets for each level of some factor (or unique combination of several factors) ... in order to compute or graph something?
If YES, __use proper data aggregation techniques__ or faceting in ggplot2 -- __don’t subset the data__. Or, more realistic, only subset the data as a temporary measure while you develop your elegant code for computing on or visualizing these data subsets.
If NO, then maybe you really do need to store a copy of a subset of the data. But seriously consider whether you can achieve your goals by simply using the `subset =` argument of, e.g., the `lm()` function, to limit computation to your excerpt of choice. Lots of functions offer a `subset =` argument!
Copies and excerpts of your data clutter your workspace, invite mistakes, and sow general confusion. Avoid whenever possible.
Reality can also lie somewhere in between. You will find the workflows presented below can help you accomplish your goals with minimal creation of temporary, intermediate objects.
## Use `filter()` to subset data row-wise
`filter()` takes logical expressions and returns the rows for which all are `TRUE`.
```{r}
filter(gapminder, lifeExp < 29)
filter(gapminder, country == "Rwanda", year > 1979)
filter(gapminder, country %in% c("Rwanda", "Afghanistan"))
```
Compare with some base R code to accomplish the same things:
```{r eval = FALSE}
gapminder[gapminder$lifeExp < 29, ] ## repeat `gapminder`, [i, j] indexing is distracting
subset(gapminder, country == "Rwanda") ## almost same as filter; quite nice actually
```
Under no circumstances should you subset your data the way I did at first:
```{r eval = FALSE}
excerpt <- gapminder[241:252, ]
```
Why is this a terrible idea?
* It is not self-documenting. What is so special about rows 241 through 252?
* It is fragile. This line of code will produce different results if someone changes the row order of `gapminder`, e.g. sorts the data earlier in the script.
```{r eval = FALSE}
filter(gapminder, country == "Canada")
```
This call explains itself and is fairly robust.
## Meet the new pipe operator
Before we go any further, we should exploit the new pipe operator that the tidyverse imports from the [magrittr][magrittr-web] package by Stefan Bache. This is going to change your data analytical life. You no longer need to enact multi-operation commands by nesting them inside each other, like so many [Russian nesting dolls][wiki-nesting-dolls]. This new syntax leads to code that is much easier to write and to read.
Here's what it looks like: `%>%`. The RStudio keyboard shortcut: Ctrl+Shift+M (Windows), Cmd+Shift+M (Mac).
Let's demo then I'll explain.
```{r}
gapminder %>% head()
```
This is equivalent to `head(gapminder)`. The pipe operator takes the thing on the left-hand-side and __pipes__ it into the function call on the right-hand-side -- literally, drops it in as the first argument.
Never fear, you can still specify other arguments to this function! To see the first 3 rows of `gapminder`, we could say `head(gapminder, 3)` or this:
```{r}
gapminder %>% head(3)
```
I've advised you to think "gets" whenever you see the assignment operator, `<-`. Similarly, you should think "then" whenever you see the pipe operator, `%>%`.
You are probably not impressed yet, but the magic will soon happen.
## Use `select()` to subset the data on variables or columns.
Back to dplyr....
Use `select()` to subset the data on variables or columns. Here's a conventional call:
```{r}
select(gapminder, year, lifeExp)
```
And here's the same operation, but written with the pipe operator and piped through `head()`:
```{r}
gapminder %>%
select(year, lifeExp) %>%
head(4)
```
Think: "Take `gapminder`, then select the variables year and lifeExp, then show the first 4 rows."
## Revel in the convenience
Here's the data for Cambodia, but only certain variables:
```{r}
gapminder %>%
filter(country == "Cambodia") %>%
select(year, lifeExp)
```
and what a typical base R call would look like:
```{r end_dplyr}
gapminder[gapminder$country == "Cambodia", c("year", "lifeExp")]
```
## Pure, predictable, pipeable
We've barely scratched the surface of dplyr but I want to point out key principles you may start to appreciate. If you're new to R or "programming with data", feel free skip this section and [move on](#dplyr-single).
dplyr's verbs, such as `filter()` and `select()`, are what's called [pure functions][wiki-pure-fxns]. To quote from Wickham's [Advanced R Programming book][adv-r-fxns] [-@wickham2015a]:
> The functions that are the easiest to understand and reason about are pure functions: functions that always map the same input to the same output and have no other impact on the workspace. In other words, pure functions have no side effects: they don’t affect the state of the world in any way apart from the value they return.
In fact, these verbs are a special case of pure functions: they take the same flavor of object as input and output. Namely, a data frame or one of the other data receptacles dplyr supports.
And finally, the data is __always__ the very first argument of the verb functions.
This set of deliberate design choices, together with the new pipe operator, produces a highly effective, low friction [domain-specific language][adv-r-dsl] for data analysis.
Go to the next Chapter, [dplyr functions for a single dataset](#dplyr-single), for more dplyr!
## Resources
dplyr official stuff:
* Package home [on CRAN][dplyr-cran].
- Note there are several vignettes, with the [introduction][dplyr-vignette-intro] being the most relevant right now.
- The [one on window functions][dplyr-vignette-window-fxns] will also be interesting to you now.
* Development home [on GitHub][dplyr-github].
* [Tutorial HW delivered][useR-2014-dropbox] (note this links to a DropBox folder) at useR! 2014 conference.
[RStudio Data Transformation Cheat Sheet][rstudio-dplyr-cheatsheet-download], covering dplyr. Remember you can get to these via *Help > Cheatsheets.*
[Data transformation][r4ds-transform] chapter of [R for Data Science][r4ds] [@wickham2016].
<!--TODO: This should probably be updated with something more recent-->
[Excellent slides][tj-mahr-slides] on pipelines and dplyr by TJ Mahr, talk given to the Madison R Users Group.
<!--TODO: This should probably be updated with something more recent-->
Blog post [Hands-on dplyr tutorial for faster data manipulation in R][dataschool-dplyr] by Data School, that includes a link to an R Markdown document and links to videos.
Chapter \@ref(join-cheatsheet): cheatsheet I made for dplyr join functions (not relevant yet but soon).
```{r links, child="links.md"}
```