02-Transform.Rmd

---
title: "Data Transformation"
output: html_notebook
editor_options: 
  chunk_output_type: inline
---

<!-- This file by Jake Thompson is licensed under a Creative Commons Attribution 4.0 International License, adapted from the orignal work at https://github.com/rstudio/master-the-tidyverse by RStudio. -->

```{r setup, include = FALSE}
library(tidyverse)
library(babynames)
library(nycflights13)
library(skimr)
```


## Babynames

```{r}
babynames
skim(babynames)
```


## Your Turn 1

Run the `skim_with()` command, and then try skimming babynames again to see how the output is different

```{r}
skim_with(integer = list(p25 = NULL, p75=NULL))
```


## Your Turn 2

Alter the code to select just the `n` column:

```{r}
select(babynames, name, prop)
```


## `select()` helpers

```{r}
select(storms, name:pressure)
select(storms, -c(name, pressure))
select(storms, starts_with("w"))
select(storms, ends_with("e"))
select(storms, contains("d"))
select(storms, matches("^.{4}$"))
select(storms, one_of(c("name", "names", "Name")))
select(storms, num_range("x", 1:5))
```


## Consider

Which of these is NOT a way to select the `name` and `n` columns together?

```{r}
select(babynames, -c(year, sex, prop))
select(babynames, name:n)
select(babynames, starts_with("n"))
select(babynames, ends_with("n"))
```


## Your Turn 3

Show:

* All of the names where prop is greater than or equal to 0.08  
* All of the children named "Daenerys"  
* All of the names that have a missing value for `n`  

```{r}

```


## Your Turn 4

Use Boolean operators to alter the code below to return only the rows that contain:

* Girls named Sea 
* Names that were used by exactly 5 or 6 children in 1880  
* Names that are one of Acura, Lexus, or Yugo

```{r}
filter(babynames, name == "Sea" | name == "Anemone")
```


## Your Turn 5

Arrange babynames by `n`. Add `prop` as a second (tie breaking) variable to arrange on. Can you tell what the smallest value of `n` is?

```{r}

```


## Your Turn 6

* Use `desc()` to find the names with the highest prop.
* Then, use `desc()` to find the names with the highest n.

```{r}

```


## Steps and the pipe

```{r}
babynames %>%
  filter(year == 2015, sex == "M") %>%
  select(name, n) %>%
  arrange(desc(n))
```


## Your Turn 7

Use `%>%` to write a sequence of functions that: 

1. Filter babynames to just the girls that were born in 2015  
2. Select the `name` and `n` columns  
3. Arrange the results so that the most popular names are near the top.

```{r}

```


## Your Turn 8

1. Trim `babynames` to just the rows that contain your `name` and your `sex`  
2. Trim the result to just the columns that will appear in your graph (not strictly necessary, but useful practice)  
3. Plot the results as a line graph with `year` on the x axis and `prop` on the y axis, colored by `sex`

```{r}

```


## Your Turn 9

Use `summarize()` to compute three statistics about the data:

1. The first (minimum) year in the dataset  
2. The last (maximum) year in the dataset  
3. The total number of children represented in the data

```{r}

```


## Your Turn 10

Extract the rows where `name == "Khaleesi"`. Then use `summarize()` to find:

1. The total number of children named Khaleesi
2. The first year Khaleesi appeared in the data

```{r}

```


## Toy data for transforming

```{r}
# Toy dataset to use
pollution <- tribble(
       ~city,   ~size, ~amount, 
  "New York", "large",      23,
  "New York", "small",      14,
    "London", "large",      22,
    "London", "small",      16,
   "Beijing", "large",      121,
   "Beijing", "small",      56
)
```


## Summarize

```{r}
pollution %>% 
 summarize(mean = mean(amount), sum = sum(amount), n = n())
```

```{r}
pollution %>% 
  group_by(city) %>%
  summarize(mean = mean(amount), sum = sum(amount), n = n())
```


## Your Turn 11

Use `group_by()`, `summarize()`, and `arrange()` to display the ten most popular baby names. Compute popularity as the total number of children of a single gender given a name.

```{r}

```


## Your Turn 12

* Use grouping to calculate and then plot the number of children born each year over time.
* Plot the results as a line graph.

```{r}

```


## Mutate

```{r}
babynames %>%
  mutate(percent = round(prop * 100, 2))
```


## Your Turn 13

Use `min_rank()` and `mutate()` to rank each row in `babynames` from largest `prop` to lowest `prop`.

```{r}

```


## Your Turn 14

* Compute each name's rank _within its year and sex_. 
* Then compute the median rank _for each combination of name and sex_, and arrange the results from highest median rank to lowest.

```{r}

```


## Flights data
```{r}
flights
skim(flights)
```


## Toy data

```{r}
band <- tribble(
   ~name,     ~band,
  "Mick",  "Stones",
  "John", "Beatles",
  "Paul", "Beatles"
)

instrument <- tribble(
    ~name,   ~plays,
   "John", "guitar",
   "Paul",   "bass",
  "Keith", "guitar"
)

instrument2 <- tribble(
  ~artist,   ~plays,
   "John", "guitar",
   "Paul",   "bass",
  "Keith", "guitar"
)
```


## Mutating joins

```{r}
band %>% left_join(instrument, by = "name")
```


## Your Turn 15

Which airlines had the largest arrival delays? Complete the code below.

1. Join `airlines` to `flights`
2. Compute and order the average arrival delays by airline. Display full names, no codes.

```{r}
flights %>%
  drop_na(arr_delay) %>%
  ___ %>%
  group_by(___) %>%
  ___ %>%
  arrange(___) 
```


## Different names

```{r}
band %>% left_join(instrument2, by = c("name" = "artist"))
```


## Your Turn 16

How many airports in `airports` are serviced by flights originating in New York (i.e. flights in our dataset?) Notice that the column to join on is named `faa` in the **airports** data set and `dest` in the **flights** data set.


```{r}
___ %>%
  ___(___, by = ___) %>%
  distinct(___)
```


***

# Take aways

* Extract variables with `select()`  
* Extract cases with `filter()`  
* Arrange cases, with `arrange()`  

* Make tables of summaries with `summarize()`  
* Make new variables, with `mutate()`  
* Do groupwise operations with `group_by()`

* Connect operations with `%>%`  

* Use `left_join()`, `right_join()`, `full_join()`, or `inner_join()` to join datasets
* Use `semi_join()` or `anti_join()` to filter datasets against each other