-
Notifications
You must be signed in to change notification settings - Fork 0
/
12 tidy data.Rmd
389 lines (283 loc) · 8.51 KB
/
12 tidy data.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
---
title: "Tidy Data with TidyR.R"
author: "Russ Conte"
date: "9/9/2018"
output: html_document
---
Let's look at the same data four different ways (actually five ways if you count 4a and 4b as separate ways)
```{r}
library(tidyverse)
table1 #this is the only one of the sets here that is tidy
table2
table3
table4a
table4b
```
What is tidydata? From the text:
There are three interrelated rules which make a dataset tidy:
Each variable must have its own column.
Each observation must have its own row.
Each value must have its own cell.
An example of working with tidydata (notice how easy this is to work with)
Calculate cases per 100,000 population:
```{r}
table1 %>%
mutate(rate=cases/population * 100000)
```
Compute cases per year:
```{r}
table1 %>%
count(year, wt=cases)
```
Visualize changes over time:
```{r}
library(ggplot2)
ggplot(table1, aes(year, cases)) +
geom_line(aes(group=country), color="grey50") +
geom_point(aes(color=country))
```
##12.2.1 Exercises
1. Using prose, describe how the variables and observations are organised in each of the sample tables.
Table 1 is tidydata - each variable is a column, each observation is its own row, each value is in its own cell
Table 2 combines type into one variable (types and cases)
Table 3 has rates as fractions, not values
Table 4a summarizes data but does not state what is being summarized. There are only values, no units
Table 4b similar to Table 4a, summaries but no units, only values
2. Compute the rate for table2, and table4a + table4b. You will need to perform four operations:
Extract the number of TB cases per country per year.
Extract the matching population per country per year.
Divide cases by population, and multiply by 10000.
Store back in the appropriate place.
Which representation is easiest to work with? Which is hardest? Why?
They are all challenging in separate ways, table1 is the easiest by far.
Here is one way to calculate rates in table
```{r}
table1
mutate(table1, rate=cases/population*10000)
```
## 12.3.1, Gathering
The issue here is that sometimes column names are not column names, but <i>values</i> in the dataset.
Let's look at table4a as an example of this problem:
```{r}
table4a
```
The solution is to apply gathering to table4a. The process (from the text):
The set of columns that represent values, not variables. In this example, those are the columns 1999 and 2000.
The name of the variable whose values form the column names. I call that the key, and here it is year.
The name of the variable whose values are spread over the cells. I call that value, and here it’s the number of cases.
```{r}
table4a %>%
gather('1999', '2000', key="year", value="cases")
```
Much better in the gathered version!
Let's look at table4b, and see how to gather the data in this table:
```{r}
table4b
```
Now let's gather the data to transform it into tidy data (note the values in table4b are populations of each country):
```{r}
table4b %>%
gather(`1999`,`2000`, key="year", value="population")
```
Now we combine tables 4a and 4b into one table. This actually uses left_join, which we will learn about later:
```{r}
tidy4a <- table4a %>%
gather(`1999`,`2000`,key="year", value="cases")
tidy4b <- table4b %>%
gather(`1999`, `2000`, key="year", value="population")
left_join(tidy4a, tidy4b)
```
## 12.3.2 Spreading
From the text: "Spreading is the opposite of gathering. You use it when an observation is scattered across multiple rows."
Let's start by looking at table2:
```{r}
table2
```
The issue is that each observation is spread across two rows! We need to spread the data to make it tidy!
```{r}
spread(table2, key=type, value=count)
```
## 12.3.3 Exercises
Why are gather and spread not perfectly symmetrical? (the column names are switched, not sure what other
asymmetries exist here)
```{r}
stocks <- tibble(
year=c(2015,2015,2016,2016),
half=c(1,2,1,2),
return=c(1.88, 0.59, 0.92, 0.17)
)
stocks
stocks %>%
spread(year, return) %>%
gather("year", "return", `2015`:`2016`)
```
Why does the code below fail? Let's start by looking at table4a:
```{r}
table4a
```
What's wrong with this code?
```{r}
table4a %>%
gather(1999,2000, key="year", value="cases")
```
To fix the code, put 1999 and 2000 within `` marks:
```{r}
table4a %>%
gather(`1999`,`2000`, key="year", value="cases")
```
3. Why does spreading this tibble fail?
```{r}
people <- tribble(
~name, ~key, ~value,
#-----------------|--------|------
"Phillip Woods", "age", 45,
"Phillip Woods", "height", 186,
"Phillip Woods", "age", 50,
"Jessica Cordero", "age", 37,
"Jessica Cordero", "height", 156
)
people
```
Let's try to spread the 'people' data set:
```{r}
people %>%
spread(key="key", value="value")
```
we can fix this by adding a Year variable
4. Tidy this simple tibble. Do you need to spread or gather it? What are the variables?
```{r}
preg <- tribble(
~pregnant, ~male, ~female,
"yes", NA, 10,
"no",20,12
)
preg
```
Let's see what happens when we gether the preg data set:
```{r}
tidy5 <- preg %>%
gather('male', 'female', gender='male', value='female')
tidy5
```
## 12.4 Separate and unite (pull):
### 12.4.1 - Separate pulls apart one column into multiple columns.
```{r}
table3
```
```{r}
table3 %>%
separate(rate, into = c("cases", "population"))
```
We can specify the charater to separate by (note that 'cases' and 'population' are character variables):
```{r}
table3 %>%
separate(rate, into=c("cases", "population"), sep="/")
```
we can ask separate to convert the character columns into numbers:
```{r}
table3 %>%
separate(rate, into=c("cases", "population"), sep="/",convert = TRUE)
```
There are other ways to separate, such as by digit location (giving the last two years of the year)
```{r}
table3 %>%
separate(year, into=c("Century", "Year"), sep=2)
```
We can use Unite to combine columns, as follows:
```{r}
table5 %>%
unite(new, century, year)
```
Let's get rid of the _ in the number:
```{r}
table5 %>%
unite(new, century, year, sep="")
```
## 12.5 Missing Values
From the text:
Changing the representation of a dataset brings up an important subtlety of missing values. Surprisingly, a value can be missing in one of two possible ways:
Explicitly, i.e. flagged with NA.
Implicitly, i.e. simply not present in the data.
Example (note both kinds of NA are here - there is no value for Q1 2016, and we have an NA):
```{r}
stocks <- tibble(
year=c(2015,2015,2015,2015,2016,2016,2016),
qtr=c(1,2,3,4,2,3,4),
return=c(1.88, 0.59, 0.35, NA, 0.92, 0.17, 2.66)
)
stocks
```
We can make the implicit missing value (Q1 2016) explicit by spreading the data:
```{r}
stocks %>%
spread(year, return)
```
It's possible to turn explicit missing values into implicit missing values, using na.rm=TRUE:
```{r}
stocks %>%
spread(year, return) %>%
gather(year, return,`2015`, `2016`, na.rm=TRUE)
```
Another way to make missing values explicit is to use complete(): (I like this solution!)
```{r}
stocks %>%
complete(year, qtr)
```
## Case Study (I like this example a lot! WHO data on tuberculosis)
Let's start by looking at the data - it's a World Health Organization data set, 7,240 rows and 60 columns, with lots of missing data:
```{r}
who
View(who)
```
Let's start by gathering the number of cases by year, country and iso:
```{r}
who1 <- who %>%
gather(
new_sp_m014:newrel_f65, key="key",
value="cases",
na.rm=TRUE
)
who1
```
We can get some idea of the structure of the data by counting the cases:
```{r}
who1 %>%
count(key)
```
```{r}
who2 <- who1 %>%
mutate(key=stringr::str_replace(key, "newrel", "rew_rel"))
who2
```
We can split the cases using separate (this is really cool and super easy to do!)
```{r}
who3 <- who2 %>%
separate(key, c("new", "type", "sexage"), sep="_")
who3
```
This counts 'new', but I'm not sure why there are two rows of data reported here - what's the difference???
```{r}
who3 %>%
count(new)
```
We can drop redundant columns! :)
```{r}
who4 <- who3 %>%
select(-new, -iso2, -iso3)
who4
```
Now let's split sexage by splitting after the m or f character:
```{r}
who5 <- who4 %>%
separate(sexage, c("sex", "age"), se=1)
who5
```
Hadley and Garrett do this in one (big?) connected set of pipe commands:
```{r}
who %>%
gather(key, value, new_sp_m014:newrel_f65, na.rm = TRUE) %>%
mutate(key = stringr::str_replace(key, "newrel", "new_rel")) %>%
separate(key, c("new", "var", "sexage")) %>%
select(-new, -iso2, -iso3) %>%
separate(sexage, c("sex", "age"), sep = 1)
```