3_Data_Transformation_with_dplyr.Rmd

---
title: "3. Data Transformation with dplyr"
author: "Russ Conte"
date: "8/5/2018"
output: html_document
---

##5.1.1 Prerequisites

These lines of code will setup our dataset, nycflights13, and tidyverse which will include dplyr.
```{r, echo=TRUE}
library(nycflights13)
library(tidyverse)
```

Let's have a look at the flights data set (note - this is a tibble, more on that later):

```{r}
flights
```

If you want to see the entire data set, use View(dataset), for example:

```{r}
View(flights)
```

## 5.1.3 dplyr basics: Select, Filter, Mutate, Arrange, Summarize (memorize these five terms!)


Pick observations by their values (filter()).
Reorder the rows (arrange()).
Pick variables by their names (select()).
Create new variables with functions of existing variables (mutate()).
Collapse many values down to a single summary (summarise()).
Note - these can <i>all</i> be used in conjunction with group_by()

How this works:

1. The first argument is the data frame
2. The second argument is the action (filter, arrange, etc)
3. The result is a new (<i>not saved</i>) data frame

Let's filter the flights to show flights on January 1, for all years:

```{r}
filter(flights, month==1, day==1)
```

Note that dplyr <i><b>never</i></b> saves the output, so you'll have to save it specifically if you want it later:

```{r}
jan1 <- filter(flights, day==1, day==1)
jan1
```

To both print out the results AND save them to a variable, wrap them in parentheses:

```{r}
(jan10 <- filter(flights, day==10, month==1))
```

## 5.2.4 Exercises

1. Find all flights that had:

A. Had an arrival delay of two or more hours (120 minutes)

```{r}
filter(flights, arr_delay>120, dep_delay>120)
```

B. Flew to Houston (IAH or HOU)

```{r}
filter(flights, dest %in% c("HOU", "IAH"))
```

3. Were operated by United, American, or Delta

```{r}
filter(flights, carrier %in% c("UA", "AA", "DL"))
```

4. Departed in summer (July, August, and September)

```{r}
filter(flights, month %in% c(7,8,9))
```

Arrived more than two hours late, but didn’t leave late

```{r}
filter(flights, arr_delay>120, dep_delay<=0)
```

Were delayed by at least an hour, but made up over 30 minutes in flight

```{r}
filter(flights, dep_delay>=60, arr_delay<=30)
```

Departed between midnight and 6am (inclusive)

```{r}
filter(flights, between(dep_time, 0000, 0600))
```

2. Another useful dplyr filtering helper is between(). What does it do? Can you use it to simplify the code needed to answer the previous challenges?

Between is useful for finding values between to given values, such as x is between 2 and 5

```{r}
filter(flights, between(dep_time, 0000,0600))
```

3. How many flights have a missing dep_time?

```{r}
sum(is.na(flights$dep_time))
```

What other variables are missing? What might these rows represent?

```{r}
colnames(flights)[colSums(is.na(flights)) > 0]
sum(is.na(flights$dep_time))
sum(is.na(flights$dep_delay))
sum(is.na(flights$arr_time))
sum(is.na(flights$arr_delay))
sum(is.na(flights$air_time))
```

These might show any number of issues, such as staff shortages, data entry problems, etc.

## 5.3 Arrange rows with Arrange

```{r}
arrange(flights, year, month, day)
```

use desc() to list in descending order:

```{r}
arrange(flights, desc(dep_delay))
```

## 5.3.1 Exercises

How could you use arrange() to sort all missing values to the start? (Hint: use is.na()).

```{r}
arrange(flights,desc(is.na(dep_delay)),desc(is.na(dep_time)), desc(is.na(arr_time)), desc(is.na(arr_delay)),
        desc(is.na(air_time)))
```

Sort flights to find the most delayed flights. 

```{r}
arrange(flights, desc(dep_delay))
```

Find the flights that left earliest.

```{r}
deptime <- filter(flights, dep_time>0) #this eliminates the NAs
arrange(deptime, dep_time)
```

Sort flights to find the fastest flights.

```{r}
arrange(flights, arr_delay)
```

Which flights travelled the longest?

```{r}
arrange(flights, desc(air_time))
```

Which flights traveled the shortest?

```{r}
arrange(flights, air_time)
```


Which travelled the shortest?

```{r}
arrange(flights, distance)
```

## 5.4 Select columns with Select

```{r}
select(flights, year, day, month)
```

Select all columns between year and day (inclusive), note the use of the colon symbol.

```{r}
select(flights, year:day)
```

Select all columns <i><b>except</i></b> those between year and day

```{r}
select(flights, -(year:day))
```

rename is a very good function to rename variables. Do NOT use select to rename variables because select
drops all the other variables in the data set.

```{r}
rename(flights, tail_num=tailnum)
```

To move specific columns to the start of a data frame and keep everything, use everything():
```{r}
select(flights, time_hour, air_time, everything())
```

## 5.4.1 Exercises

1. Brainstorm as many ways as possible to select dep_time, dep_delay, arr_time, and arr_delay from flights.

```{r}
library(tidyverse)
library(nycflights13)
select(flights, dep_time, dep_delay, arr_time, arr_delay )
select(flights, dep_time, dep_delay, arr_time, arr_delay, everything())
```

What happens if you include the name of a variable multiple times in a select() call?

Let's find out!! :)

```{r}
select(flights, day,day,day,day,day)
```

The variable only shows up one time. Nice! :)

3.What does the one_of() function do? Why might it be helpful in conjunction with this vector?

vars <- c("year", "month", "day", "dep_delay", "arr_delay")

```{r}
vars <- c("year", "month", "day", "dep_delay", "arr_delay")
select(flights, one_of(vars))
```

4. Does the result of running the following code surprise you? 

```{r}
select(flights, contains("TIME"))
```

yes it does surprise me, usually R is extremely case sensitive, but this seems to not be case sensitive.

How do the select helpers deal with case by default? How can you change that default?

```{r}
library(tidyverse)
?contains
```

From the help:
ignore.case	 If TRUE, the default, ignores case when matching names.

Thus to change the default, ignore.case=FALSE:

```{r}
select(flights, contains("TIME",ignore.case = FALSE))
```

## 5.5 Add new variable with mutate()

```{r}
flights_sml <- select(flights, 
  year:day, 
  ends_with("delay"), 
  distance, 
  air_time
)
mutate(flights_sml,
  gain = dep_delay - arr_delay,
  speed = distance / air_time * 60
)
```

It is possible to refer to columns that were just created:

```{r}
mutate(flights_sml,
  gain = dep_delay - arr_delay,
  hours = air_time / 60,
  gain_per_hour = gain / hours
)
```

Use transmute to only keep the new columns:

```{r}
transmute(flights,
  gain = dep_delay - arr_delay,
  hours = air_time / 60,
  gain_per_hour = gain / hours
)
```

## 5.5.1, Userful creation functions

Modular arithmetic breaks numbers up into pieces:
  %/% is interger division
  %% gives the remainder

```{r}
transmute(flights,
  dep_time,
  hour = dep_time %/% 100,
  minute = dep_time %% 100
)
```

Logs (made famous by John Napier in the 19th century)
We can use log(), log10(), log2(), etc. Hadley recommends log2() because it's easy to interpret: A difference of 1
means doubling the value, a difference of -1 means halving the value. For example:

```{r}
x=2
print(paste0("x= ",x))
print(paste0("log2(x) =",log2(x)))
```

lead() and lag()

These are offsets, as follows:

```{r}
x=1:42
lead(x)
lag(x)
```

Cumulative and rolling averages - I can see lots of ways these can be used in analysis!

```{r}
x
cumsum(x)
library(RcppRoll)
library(dplyr)
cummean(x)

```

Ranking examples:

```{r}
y <- c(1,2,2,NA,3,4,5,6)
min_rank(y)
min_rank(desc(y))

```

## 5.5.2 Exercises

Currently dep_time and sched_dep_time are convenient to look at, but hard to compute with because they’re not really continuous numbers. Convert them to a more convenient representation of number of minutes since midnight.

```{r}
transmute(flights,
  dep_time,
  dep_hour = dep_time %/% 100,
  dep_minute = dep_time %% 100,
  dep_time_total=dep_hour*60+dep_minute,
  
  sched_dep_time,
  sched_dep_time_hour = sched_dep_time %/% 100,
  sched_dep_time_min = sched_dep_time %% 100,
  sched_dep_time_total=sched_dep_time_hour*60+ sched_dep_time_min
  )

```

Compare air_time with arr_time - dep_time. What do you expect to see? What do you see? What do you need to do to fix it?

```{r}
transmute(flights, air_time, arr_time-dep_time)
```

I expect to see the same values, it seems odd the values are not equal. A set of flights from Boston was created to greatly simplify the problem, since there are no time zone or change of day issues with virtually all flights

```{r}
boston <- filter(flights, dest=="BOS")

select(boston, air_time, arr_time, arr_delay, dep_time, dep_delay)
```

Compare dep_time, sched_dep_time, and dep_delay. How would you expect those three numbers to be related?

I would expect those to be related as follows:

dep_time = sched_dep_time + dep_delay. Let's take a look using only flights from Boston:

```{r}
transmute(boston, dep_time, new1=sched_dep_time + dep_delay)
```

Find the 10 most delayed flights using a ranking function. How do you want to handle ties? Carefully read the documentation for min_rank().

```{r}
?ranking # to give information about the ranking functions
flights %>% filter(between(row_number(flights$arr_delay), 1, 10))
```

What does 1:3 + 1:10 return? Why? Let's find out!

```{r}
1:3 + 1:10
```

What trigonometric functions does R provide?

```{r}
?trigonometry
```

cos(x)
sin(x)
tan(x)

acos(x)
asin(x)
atan(x)
atan2(y, x)

cospi(x)
sinpi(x)
tanpi(x)

## 5.6 Grouped summaries with summarise

As Hadley writes, "The last key verb is summarise(). It collapses a data frame to a single row:"

```{r}
summarise(flights, delay=mean(dep_delay, na.rm=TRUE))
```

Here is how to group data, for example by month:

```{r}
by_month <- group_by(flights, year, month)
summarise(by_month, delay=mean(dep_delay, na.rm=TRUE))
```

Let's compare that with flights from Boston:

by_month1