-
Notifications
You must be signed in to change notification settings - Fork 0
/
3_Data_Transformation_with_dplyr.Rmd
460 lines (322 loc) · 9.67 KB
/
3_Data_Transformation_with_dplyr.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
---
title: "3. Data Transformation with dplyr"
author: "Russ Conte"
date: "8/5/2018"
output: html_document
---
##5.1.1 Prerequisites
These lines of code will setup our dataset, nycflights13, and tidyverse which will include dplyr.
```{r, echo=TRUE}
library(nycflights13)
library(tidyverse)
```
Let's have a look at the flights data set (note - this is a tibble, more on that later):
```{r}
flights
```
If you want to see the entire data set, use View(dataset), for example:
```{r}
View(flights)
```
## 5.1.3 dplyr basics: Select, Filter, Mutate, Arrange, Summarize (memorize these five terms!)
Pick observations by their values (filter()).
Reorder the rows (arrange()).
Pick variables by their names (select()).
Create new variables with functions of existing variables (mutate()).
Collapse many values down to a single summary (summarise()).
Note - these can <i>all</i> be used in conjunction with group_by()
How this works:
1. The first argument is the data frame
2. The second argument is the action (filter, arrange, etc)
3. The result is a new (<i>not saved</i>) data frame
Let's filter the flights to show flights on January 1, for all years:
```{r}
filter(flights, month==1, day==1)
```
Note that dplyr <i><b>never</i></b> saves the output, so you'll have to save it specifically if you want it later:
```{r}
jan1 <- filter(flights, day==1, day==1)
jan1
```
To both print out the results AND save them to a variable, wrap them in parentheses:
```{r}
(jan10 <- filter(flights, day==10, month==1))
```
## 5.2.4 Exercises
1. Find all flights that had:
A. Had an arrival delay of two or more hours (120 minutes)
```{r}
filter(flights, arr_delay>120, dep_delay>120)
```
B. Flew to Houston (IAH or HOU)
```{r}
filter(flights, dest %in% c("HOU", "IAH"))
```
3. Were operated by United, American, or Delta
```{r}
filter(flights, carrier %in% c("UA", "AA", "DL"))
```
4. Departed in summer (July, August, and September)
```{r}
filter(flights, month %in% c(7,8,9))
```
Arrived more than two hours late, but didn’t leave late
```{r}
filter(flights, arr_delay>120, dep_delay<=0)
```
Were delayed by at least an hour, but made up over 30 minutes in flight
```{r}
filter(flights, dep_delay>=60, arr_delay<=30)
```
Departed between midnight and 6am (inclusive)
```{r}
filter(flights, between(dep_time, 0000, 0600))
```
2. Another useful dplyr filtering helper is between(). What does it do? Can you use it to simplify the code needed to answer the previous challenges?
Between is useful for finding values between to given values, such as x is between 2 and 5
```{r}
filter(flights, between(dep_time, 0000,0600))
```
3. How many flights have a missing dep_time?
```{r}
sum(is.na(flights$dep_time))
```
What other variables are missing? What might these rows represent?
```{r}
colnames(flights)[colSums(is.na(flights)) > 0]
sum(is.na(flights$dep_time))
sum(is.na(flights$dep_delay))
sum(is.na(flights$arr_time))
sum(is.na(flights$arr_delay))
sum(is.na(flights$air_time))
```
These might show any number of issues, such as staff shortages, data entry problems, etc.
## 5.3 Arrange rows with Arrange
```{r}
arrange(flights, year, month, day)
```
use desc() to list in descending order:
```{r}
arrange(flights, desc(dep_delay))
```
## 5.3.1 Exercises
How could you use arrange() to sort all missing values to the start? (Hint: use is.na()).
```{r}
arrange(flights,desc(is.na(dep_delay)),desc(is.na(dep_time)), desc(is.na(arr_time)), desc(is.na(arr_delay)),
desc(is.na(air_time)))
```
Sort flights to find the most delayed flights.
```{r}
arrange(flights, desc(dep_delay))
```
Find the flights that left earliest.
```{r}
deptime <- filter(flights, dep_time>0) #this eliminates the NAs
arrange(deptime, dep_time)
```
Sort flights to find the fastest flights.
```{r}
arrange(flights, arr_delay)
```
Which flights travelled the longest?
```{r}
arrange(flights, desc(air_time))
```
Which flights traveled the shortest?
```{r}
arrange(flights, air_time)
```
Which travelled the shortest?
```{r}
arrange(flights, distance)
```
## 5.4 Select columns with Select
```{r}
select(flights, year, day, month)
```
Select all columns between year and day (inclusive), note the use of the colon symbol.
```{r}
select(flights, year:day)
```
Select all columns <i><b>except</i></b> those between year and day
```{r}
select(flights, -(year:day))
```
rename is a very good function to rename variables. Do NOT use select to rename variables because select
drops all the other variables in the data set.
```{r}
rename(flights, tail_num=tailnum)
```
To move specific columns to the start of a data frame and keep everything, use everything():
```{r}
select(flights, time_hour, air_time, everything())
```
## 5.4.1 Exercises
1. Brainstorm as many ways as possible to select dep_time, dep_delay, arr_time, and arr_delay from flights.
```{r}
library(tidyverse)
library(nycflights13)
select(flights, dep_time, dep_delay, arr_time, arr_delay )
select(flights, dep_time, dep_delay, arr_time, arr_delay, everything())
```
What happens if you include the name of a variable multiple times in a select() call?
Let's find out!! :)
```{r}
select(flights, day,day,day,day,day)
```
The variable only shows up one time. Nice! :)
3.What does the one_of() function do? Why might it be helpful in conjunction with this vector?
vars <- c("year", "month", "day", "dep_delay", "arr_delay")
```{r}
vars <- c("year", "month", "day", "dep_delay", "arr_delay")
select(flights, one_of(vars))
```
4. Does the result of running the following code surprise you?
```{r}
select(flights, contains("TIME"))
```
yes it does surprise me, usually R is extremely case sensitive, but this seems to not be case sensitive.
How do the select helpers deal with case by default? How can you change that default?
```{r}
library(tidyverse)
?contains
```
From the help:
ignore.case If TRUE, the default, ignores case when matching names.
Thus to change the default, ignore.case=FALSE:
```{r}
select(flights, contains("TIME",ignore.case = FALSE))
```
## 5.5 Add new variable with mutate()
```{r}
flights_sml <- select(flights,
year:day,
ends_with("delay"),
distance,
air_time
)
mutate(flights_sml,
gain = dep_delay - arr_delay,
speed = distance / air_time * 60
)
```
It is possible to refer to columns that were just created:
```{r}
mutate(flights_sml,
gain = dep_delay - arr_delay,
hours = air_time / 60,
gain_per_hour = gain / hours
)
```
Use transmute to only keep the new columns:
```{r}
transmute(flights,
gain = dep_delay - arr_delay,
hours = air_time / 60,
gain_per_hour = gain / hours
)
```
## 5.5.1, Userful creation functions
Modular arithmetic breaks numbers up into pieces:
%/% is interger division
%% gives the remainder
```{r}
transmute(flights,
dep_time,
hour = dep_time %/% 100,
minute = dep_time %% 100
)
```
Logs (made famous by John Napier in the 19th century)
We can use log(), log10(), log2(), etc. Hadley recommends log2() because it's easy to interpret: A difference of 1
means doubling the value, a difference of -1 means halving the value. For example:
```{r}
x=2
print(paste0("x= ",x))
print(paste0("log2(x) =",log2(x)))
```
lead() and lag()
These are offsets, as follows:
```{r}
x=1:42
lead(x)
lag(x)
```
Cumulative and rolling averages - I can see lots of ways these can be used in analysis!
```{r}
x
cumsum(x)
library(RcppRoll)
library(dplyr)
cummean(x)
```
Ranking examples:
```{r}
y <- c(1,2,2,NA,3,4,5,6)
min_rank(y)
min_rank(desc(y))
```
## 5.5.2 Exercises
Currently dep_time and sched_dep_time are convenient to look at, but hard to compute with because they’re not really continuous numbers. Convert them to a more convenient representation of number of minutes since midnight.
```{r}
transmute(flights,
dep_time,
dep_hour = dep_time %/% 100,
dep_minute = dep_time %% 100,
dep_time_total=dep_hour*60+dep_minute,
sched_dep_time,
sched_dep_time_hour = sched_dep_time %/% 100,
sched_dep_time_min = sched_dep_time %% 100,
sched_dep_time_total=sched_dep_time_hour*60+ sched_dep_time_min
)
```
Compare air_time with arr_time - dep_time. What do you expect to see? What do you see? What do you need to do to fix it?
```{r}
transmute(flights, air_time, arr_time-dep_time)
```
I expect to see the same values, it seems odd the values are not equal. A set of flights from Boston was created to greatly simplify the problem, since there are no time zone or change of day issues with virtually all flights
```{r}
boston <- filter(flights, dest=="BOS")
select(boston, air_time, arr_time, arr_delay, dep_time, dep_delay)
```
Compare dep_time, sched_dep_time, and dep_delay. How would you expect those three numbers to be related?
I would expect those to be related as follows:
dep_time = sched_dep_time + dep_delay. Let's take a look using only flights from Boston:
```{r}
transmute(boston, dep_time, new1=sched_dep_time + dep_delay)
```
Find the 10 most delayed flights using a ranking function. How do you want to handle ties? Carefully read the documentation for min_rank().
```{r}
?ranking # to give information about the ranking functions
flights %>% filter(between(row_number(flights$arr_delay), 1, 10))
```
What does 1:3 + 1:10 return? Why? Let's find out!
```{r}
1:3 + 1:10
```
What trigonometric functions does R provide?
```{r}
?trigonometry
```
cos(x)
sin(x)
tan(x)
acos(x)
asin(x)
atan(x)
atan2(y, x)
cospi(x)
sinpi(x)
tanpi(x)
## 5.6 Grouped summaries with summarise
As Hadley writes, "The last key verb is summarise(). It collapses a data frame to a single row:"
```{r}
summarise(flights, delay=mean(dep_delay, na.rm=TRUE))
```
Here is how to group data, for example by month:
```{r}
by_month <- group_by(flights, year, month)
summarise(by_month, delay=mean(dep_delay, na.rm=TRUE))
```
Let's compare that with flights from Boston:
by_month1