forked from dmi3kno/R-tidyverse-20180215
-
Notifications
You must be signed in to change notification settings - Fork 0
/
SWC_R_2018-02-22_purrr.Rmd
140 lines (104 loc) · 5.92 KB
/
SWC_R_2018-02-22_purrr.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
---
title: "Functional programming in R"
output:
html_document:
toc: true
toc_float:
collapsed: false
smooth_scroll: false
html_notebook: default
date: "18 February 2018"
---
## Functional programming
In the `tidyr` lesson we were manually downloading spreadsheets to join to `gapminder`. But what if someone wants to rerun our script? They wouldnt know which files to download or where to find those files. And what if we need to rerun our analysis for some other metrics or import 100 new columns from gapminder spreadsheets?
There's a saying among programmers: "Whenever you repeat code more than twice, write a function". So what are functions?
Functions are reusable piece of code that encapsulates logic that can be applied to different objects repeatedly. In R everything is a function (even `+`). Functions take **arguments** and return an object (functions can return only one object, however complex it is). Here's an example of a simple function:
```{r}
add_noise <- function(x){
len <- length(x)
x + rnorm(len)
}
add_noise(1:10)
```
What does it do? Well, it takes a vector and adds normally distributed random values to each of its elements. Notice there's an argument `x` and the result of the last operation, namely `x + rnorm(len)` is returned as the output. How can you break this function, i.e. how can you trick it to return error?
How do you go about writing a function? It may be difficult to start from scratch, so here's an advice: "Do it for one, turn it into a function to run it for any". In other words, dont focus on writing a function until you have a piece of code that works for one element. Then think how the code can be generalized, by distilling some "arguments" and thinking of the return objects and how those may be used by subsequent operations. We are going to write our first function very soon, but for now remember this: "Whenever you are tempted to copy-paste, you should probably write a function instead".
There's a lot to learn about writing good function, including how to provide error handling and foresee problems your users may run into. See respective section of [STAT545 course](http://stat545.com/cm102_writing-functions.html).
## Mission: Getting the data out of googlesheets
```{r}
library(tidyverse)
gapminder <- gapminder::gapminder
gapminder_datasources <- read_csv("gapminder_datasources.csv")
```
Lets go back to the table and note how url is built. Then we will build a template for url ()
```{r}
library(glue)
library(readxl)
# lets pick a key
key="0ArfEDsV3bBwCcGhBd2NOQVZ1eWowNVpSNjl1c3lRSWc"
file_url <- glue("https://docs.google.com/spreadsheet/pub?key={k}&output=xlsx", k=key)
download.file(file_url, "temp.xlsx", mode="wb")
sheet_df <- read_excel("temp.xlsx")
file.remove("temp.xlsx")
```
## **Challenge 1.** {.tabset .tabset-fade .tabset-pills}
### Assignment
> 1. Turn the code above into your first function. It should take a google sheet key, and return a raw data frame from gapminder. Test it on one of the keys you can find on gapminder page.
>
> 2. While we are at it, generalize code we wrote in the last session to reformat these types of dataframes. Add the necessary post-processing code and create new function which takes a google sheet key and returns a cleaned up dataframe.
>
> TIP: You will have a challenge of naming columns. One solution could be to not "tidy"" the tables up completely, but rather keep them in the "long" format (with variable name in a separate column).
### Solution
```{r}
read_gapminder_sheet1 <- function(key){
file_url <- glue("https://docs.google.com/spreadsheet/pub?key={k}&output=xlsx", k=key)
download.file(file_url, "temp.xlsx", mode="wb")
raw_df <- read_excel("temp.xlsx", sheet = "Data")
file.remove("temp.xlsx")
raw_df
}
read_gapminder_sheet2 <- function(key){
file_url <- glue("https://docs.google.com/spreadsheet/pub?key={k}&output=xlsx", k=key)
download.file(file_url, "temp.xlsx", mode="wb")
raw_df <- read_excel("temp.xlsx", sheet = "Data")
file.remove("temp.xlsx")
# saves away cleaned up name of first column and renames it to `country`
tbl_name <- names(raw_df)[1]
names(raw_df)[1] <- "country"
# tidy the table using generic column name
long_df <- raw_df %>%
gather(key=year, value=value, -country) %>%
mutate(year=as.integer(year),
value=as.numeric(value),
var_name = make.names(tbl_name))
long_df
}
```
## Scaling it up with purrr
Lets find the spreadsheets we were downloading manually last time:
```{r}
sheets_of_interest <- gapminder_datasources %>%
filter(str_detect(indicatorName, "[Tt]otal [Ff]ertility|[Ii]nfant [Mm]ortality"), is.na(subcategory)) %>%
pull(sheet_id)
```
So now if we now map this function to our `sheets_of_interest` we should be able to get a list of cleaned up datasets. This list of tibbles of similar structure is very typical type of animal in R. Since the column names are matching, we could probably just append them together. In `purrr` package there's a function `map_df` which outputs a dataframe by first producing a list of similar dataframes and then binding them together.
```{r}
gapminder_web_long <- map_df(sheets_of_interest, read_gapminder_sheet2)
#View(gapminder_web_long)
```
## **Bonus Challenge 2.** {.tabset .tabset-fade .tabset-pills}
### Assignment
> 1. The only thing left is just to put the column names back where they belong - to column headers. Take the "long" dataframe we just created and convert it to `tidy` format
>
> 2. Join it to gapminder dataset, preserving 1704 rows in original gapminder dataframe. What type of join makes sense here? Save your complete dataset in "gapminder_plus2.csv"
>
> TIP: you will have to create a unique index for groups you are trying to collapse together
### Solution
```{r}
gapminder_web_long %>%
group_by(country, var_name) %>%
mutate(country_year_id=1:n()) %>%
spread(key=var_name, value=value) %>%
select(-country_year_id) %>%
right_join(gapminder) %>%
write_csv("gapminder_plus2.csv")
```