12 tidy data.Rmd

---
title: "Tidy Data with TidyR.R"
author: "Russ Conte"
date: "9/9/2018"
output: html_document
---

Let's look at the same data four different ways (actually five ways if you count 4a and 4b as separate ways)

```{r}
library(tidyverse)
table1 #this is the only one of the sets here that is tidy
table2
table3
table4a
table4b
```

What is tidydata? From the text:

There are three interrelated rules which make a dataset tidy:

    Each variable must have its own column.
    Each observation must have its own row.
    Each value must have its own cell.


An example of working with tidydata (notice how easy this is to work with)
Calculate cases per 100,000 population:

```{r}
table1 %>% 
  mutate(rate=cases/population * 100000)
```

Compute cases per year:

```{r}
table1 %>% 
  count(year, wt=cases)
```

Visualize changes over time:

```{r}
library(ggplot2)
ggplot(table1, aes(year, cases)) +
  geom_line(aes(group=country), color="grey50") +
  geom_point(aes(color=country))
```

##12.2.1 Exercises

1. Using prose, describe how the variables and observations are organised in each of the sample tables.

Table 1 is tidydata - each variable is a column, each observation is its own row, each value is in its own cell
Table 2 combines type into one variable (types and cases)
Table 3 has rates as fractions, not values
Table 4a summarizes data but does not state what is being summarized. There are only values, no units
Table 4b similar to Table 4a, summaries but no units, only values

2. Compute the rate for table2, and table4a + table4b. You will need to perform four operations:

    Extract the number of TB cases per country per year.
    Extract the matching population per country per year.
    Divide cases by population, and multiply by 10000.
    Store back in the appropriate place.

Which representation is easiest to work with? Which is hardest? Why?
They are all challenging in separate ways, table1 is the easiest by far.
Here is one way to calculate rates in table


```{r}
table1
mutate(table1, rate=cases/population*10000)

```

## 12.3.1, Gathering

The issue here is that sometimes column names are not column names, but <i>values</i> in the dataset.
Let's look at table4a as an example of this problem:

```{r}
table4a
```


The solution is to apply gathering to table4a. The process (from the text):

    The set of columns that represent values, not variables. In this example, those are the columns 1999 and 2000.
    The name of the variable whose values form the column names. I call that the key, and here it is year.
    The name of the variable whose values are spread over the cells. I call that value, and here it’s the number of cases.

```{r}
table4a %>% 
  gather('1999', '2000', key="year", value="cases")
```

Much better in the gathered version!

Let's look at table4b, and see how to gather the data in this table:

```{r}
table4b
```

Now let's gather the data to transform it into tidy data (note the values in table4b are populations of each country):

```{r}
table4b %>% 
  gather(`1999`,`2000`, key="year", value="population")
```

Now we combine tables 4a and 4b into one table. This actually uses left_join, which we will learn about later:

```{r}
tidy4a <- table4a %>% 
  gather(`1999`,`2000`,key="year", value="cases")
tidy4b <- table4b %>% 
  gather(`1999`, `2000`, key="year", value="population")
left_join(tidy4a, tidy4b)
```

## 12.3.2 Spreading

From the text: "Spreading is the opposite of gathering. You use it when an observation is scattered across multiple rows."

Let's start by looking at table2:
```{r}
table2
```

The issue is that each observation is spread across two rows! We need to spread the data to make it tidy!

```{r}
spread(table2, key=type, value=count)
```

## 12.3.3 Exercises

Why are gather and spread not perfectly symmetrical? (the column names are switched, not sure what other
asymmetries exist here)

```{r}
stocks <- tibble(
  year=c(2015,2015,2016,2016),
  half=c(1,2,1,2),
  return=c(1.88, 0.59, 0.92, 0.17)
)
stocks
stocks %>% 
  spread(year, return) %>% 
  gather("year", "return", `2015`:`2016`)

```

Why does the code below fail? Let's start by looking at table4a:

```{r}
table4a
```

What's wrong with this code?

```{r}
table4a %>% 
  gather(1999,2000, key="year", value="cases")
```

To fix the code, put 1999 and 2000 within `` marks:

```{r}
table4a %>% 
  gather(`1999`,`2000`, key="year", value="cases")

```

3. Why does spreading this tibble fail?

```{r}
people <- tribble(
  ~name,             ~key,    ~value,
  #-----------------|--------|------
  "Phillip Woods",   "age",       45,
  "Phillip Woods",   "height",   186,
  "Phillip Woods",   "age",       50,
  "Jessica Cordero", "age",       37,
  "Jessica Cordero", "height",   156
)
people
```

Let's try to spread the 'people' data set:

```{r}
people %>% 
  spread(key="key", value="value")
```

we can fix this by adding a Year variable

4. Tidy this simple tibble. Do you need to spread or gather it? What are the variables?

```{r}
preg <- tribble(
  ~pregnant, ~male, ~female,
  "yes", NA, 10,
  "no",20,12
)
preg
```

Let's see what happens when we gether the preg data set:

```{r}
tidy5 <- preg %>%
  gather('male', 'female', gender='male', value='female')
tidy5
```

## 12.4 Separate and unite (pull):

### 12.4.1 - Separate pulls apart one column into multiple columns.

```{r}
table3
```

```{r}
table3 %>% 
  separate(rate, into = c("cases", "population"))
```

We can specify the charater to separate by (note that 'cases' and 'population' are character variables):

```{r}
table3 %>% 
  separate(rate, into=c("cases", "population"), sep="/")
```

we can ask separate to convert the character columns into numbers:

```{r}
table3 %>% 
  separate(rate, into=c("cases", "population"), sep="/",convert = TRUE)
```


There are other ways to separate, such as by digit location (giving the last two years of the year)

```{r}
table3 %>% 
  separate(year, into=c("Century", "Year"), sep=2)
```

We can use Unite to combine columns, as follows:

```{r}
table5 %>% 
  unite(new, century, year)
```

Let's get rid of the _ in the number:

```{r}
table5 %>% 
  unite(new, century, year, sep="")
```

## 12.5 Missing Values

From the text:
Changing the representation of a dataset brings up an important subtlety of missing values. Surprisingly, a value can be missing in one of two possible ways:

    Explicitly, i.e. flagged with NA.
    Implicitly, i.e. simply not present in the data.

Example (note both kinds of NA are here - there is no value for Q1 2016, and we have an NA):

```{r}
stocks <- tibble(
  year=c(2015,2015,2015,2015,2016,2016,2016),
  qtr=c(1,2,3,4,2,3,4),
  return=c(1.88, 0.59, 0.35, NA, 0.92, 0.17, 2.66)
)
stocks
```

We can make the implicit missing value (Q1 2016) explicit by spreading the data:

```{r}
stocks %>% 
  spread(year, return)
```

It's possible to turn explicit missing values into implicit missing values, using na.rm=TRUE:
```{r}
stocks %>% 
  spread(year, return) %>% 
  gather(year, return,`2015`, `2016`, na.rm=TRUE)
```

Another way to make missing values explicit is to use complete(): (I like this solution!)

```{r}
stocks %>% 
  complete(year, qtr)
```

## Case Study (I like this example a lot! WHO data on tuberculosis)

Let's start by looking at the data - it's a World Health Organization data set, 7,240 rows and 60 columns, with lots of missing data:

```{r}
who
View(who)
```

Let's start by gathering the number of cases by year, country and iso:

```{r}
who1 <- who %>% 
  gather(
    new_sp_m014:newrel_f65, key="key",
    value="cases",
    na.rm=TRUE
  )
who1
```

We can get some idea of the structure of the data by counting the cases:

```{r}
who1 %>% 
  count(key)
```

```{r}
who2 <- who1 %>% 
  mutate(key=stringr::str_replace(key, "newrel", "rew_rel"))
who2
```

We can split the cases using separate (this is really cool and super easy to do!)

```{r}
who3 <- who2 %>% 
  separate(key, c("new", "type", "sexage"), sep="_")
who3
```

This counts 'new', but I'm not sure why there are two rows of data reported here - what's the difference???

```{r}
who3 %>% 
  count(new)
```

We can drop redundant columns! :)

```{r}
who4 <- who3 %>% 
  select(-new, -iso2, -iso3)
who4
```

Now let's split sexage by splitting after the m or f character:

```{r}
who5 <- who4 %>% 
  separate(sexage, c("sex", "age"), se=1)
who5
```

Hadley and Garrett do this in one (big?) connected set of pipe commands:

```{r}
who %>%
  gather(key, value, new_sp_m014:newrel_f65, na.rm = TRUE) %>% 
  mutate(key = stringr::str_replace(key, "newrel", "new_rel")) %>%
  separate(key, c("new", "var", "sexage")) %>% 
  select(-new, -iso2, -iso3) %>% 
  separate(sexage, c("sex", "age"), sep = 1)
```