-
Notifications
You must be signed in to change notification settings - Fork 24
/
SWC_R_2018-02-15_vectors.Rmd
142 lines (106 loc) · 5.59 KB
/
SWC_R_2018-02-15_vectors.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
---
title: "Vectors"
output:
html_document:
toc: true
toc_float:
collapsed: false
smooth_scroll: false
html_notebook: default
date: "18 February 2018"
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
## Strings
In this lesson we will have a look at some of the functions for manipulating strings. For more details, please review Chapter 14 of [R for Data Science](http://r4ds.had.co.nz/strings.html) and Working with Strings [Cheatsheet](https://github.com/rstudio/cheatsheets/raw/master/strings.pdf).
Strings are everywhere: web pages, text databases, data labels, factors. A vector each element of which is a string is called a character vector. Awesome `stringr` package (built upon `stringi` package) is an awesome Tidyverse toolbox for operating on strings.With functions in this package you can:
- Detect matches
- Subset strings
- Manager strings' lengths
- Mutate (alter) strings
- Join and split strings
- Order strings
- Provide powerful framework for applying regular expressions for strings operations.
Lets use some of the functions found in the `stringr` package to fix some of the country names in gapminder. For example, there are country names were the word "Republic" is abbreviated to "Rep." and place at the end of the string. Lets find those countries:
```{r}
suppressPackageStartupMessages(library(tidyverse))
gapminder <- gapminder::gapminder
gapminder %>% filter(str_detect(country, "Rep")) %>% distinct(country)
```
Note the `str_detect()` which takes a pattern and returns a `TRUE/FALSE`. Most of the functions in `stringr` are vectorized so you can safely use them in dataframes. How would we replace the abbreviations of interest with the proper values?
```{r}
library(stringr)
dr_string <- "Congo, Dem. Rep."
dr_pattern <- ", Dem\\. Rep\\.$"
str_detect(dr_string, dr_pattern)
str_replace(dr_string, dr_pattern, "")
str_c("Republic of ",str_replace(dr_string, dr_pattern, ""))
str_locate(dr_string, dr_pattern)
str_sub(dr_string, 1, str_locate(dr_string, dr_pattern)[1,1]-1)
str_sub(dr_string, str_locate(dr_string, dr_pattern)[1,2]+1, str_length(dr_string))
```
## **Challenge 1.** {.tabset .tabset-fade .tabset-pills}
### Assignment
> 1. Use read_lines() to import the content of the [table of data sheets](https://www.gapminder.org/wp-content/themes/gapminder/cronJobs/json.js) from GAPMINDER website and remove any text preceeding first curly brackets`{}`.
>
> 2. Enclose remaining text into square brackets and read it as list using jsonlite::fromJSON(). Inspect resulting object in object explorer (double-click in the Global Environment). How many elements does the list have?
TIP: You may need to run `install.packages("jsonlite")` to be able to work with json format
### Solution
```{r}
json_txt <- read_lines("https://www.gapminder.org/wp-content/themes/gapminder/cronJobs/json.js")
sources_raw <- jsonlite::fromJSON(paste0("[",str_sub(json_txt, start=20),"]"),
simplifyVector = FALSE)
#View(sources_raw)
sources <- sources_raw[[1]]
#View(sources)
```
## Subsetting lists
List are special types of non-atomic vectors that can contain any types of data including other lists (recurring vectors). We can extract list elements by using conventional subsetting methods:
```{r}
sources[1] #does not "peel off" layers of lists, only reduces number of elements retrieved
sources[[1]] # "peels off" one layer - returns a list corresponding to first entry
sources[[1]]$indicatorName # returns category name for first entry
sources[[1]]$dataprovider
sources[[2]]$indicatorName # second entry (second line) in table
```
How do we get all values of `indicatorName` at once? Effectively we need to "iterate" over elements, visiting each element one of the time and extracting `indicatorName` each element in sources?
```{r}
library(purrr)
map(sources, `[[`, "indicatorName") %>% simplify() %>% head() # named vector
map(sources, "indicatorName") %>% simplify() %>% head() # named vector
```
This pattern of "iteration" with subsequent "simplification" of flat list into vector is suppoted in purrr with a family of functions `map_chr()`, `map_dbl()`, `map_int()`, etc. These functions are type-stable: they return what they promise or fail with error.
```{r}
map_chr(sources, "indicatorName") %>% head()
```
## **Challenge 2.** {.tabset .tabset-fade .tabset-pills}
### Assignment
> 1. Make new data frame with the following fields:
> - indicatorName
> - category
> - subcategory
> - dataprovider
> - dataprovider_link
> Populate it by pulling respective values from the list we just created. How would you extract sheet_id, which is encoded as a list element name?
>
> 2. Save cleaned up dataframe into a csv file
### Solution
```{r}
sources_df <- tibble(indicatorName = map_chr(sources, 1),
category = map(sources, "category") %>% simplify(),
subcategory = map_chr(sources, `[[`, "subcategory"),
dataprovider = map_chr(sources, "dataprovider"),
dataprovider_link = map_chr(sources, "dataprovider_link"),
sheet_id=names(sources)
)
sources_df
write_csv(sources_df, "gapminder_datasources.csv")
# alternative solution with list-columns to be covered later
#enframe(sources) %>%
# mutate(value=map(value, enframe)) %>%
# unnest() %>%
# mutate(value=simplify(value)) %>%
# spread(key=name1, value=value)
```
Generally speaking, `map_*` takes list and iterates over each element applying a function. Whenever a function is `[[` (i.e. "extract"), it can be omitted, providing only the argument (i.e. the name (or the number) of the element you want to extract)