forked from rstudio-education/stat545
-
Notifications
You must be signed in to change notification settings - Fork 0
/
11_character-vectors.Rmd
397 lines (275 loc) · 16.9 KB
/
11_character-vectors.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
# Character vectors {#character-vectors}
```{r include = FALSE}
source("common.R")
```
<!--Original content: https://stat545.com/block028_character-data.html-->
## Character vectors: where they fit in
We've spent a lot of time working with big, beautiful data frames. That are clean and wholesome, like the Gapminder data.
But real life will be much nastier. You will bring data into R from the outside world and discover there are problems. You might think: how hard can it be to deal with character data? And the answer is: it can be very hard!
* [Stack Exchange outage][stackexchange-outage]
* [Regexes to validate/match email addresses][email-regex]
* [Fixing an Atom bug][fix-atom-bug]
Here we discuss common remedial tasks for cleaning and transforming character data, also known as "strings". A data frame or tibble will consist of one or more *atomic vectors* of a certain class. This lesson deals with things you can do with vectors of class `character`.
## Resources
I start with this because we cannot possibly do this topic justice in a short amount of time. Our goal is to make you aware of broad classes of problems and their respective solutions. Once you have a character problem in real life, these resources will be extremely helpful as you delve deeper.
### Manipulating character vectors
* [stringr package][stringr-web].
- A core package in the `tidyverse.` It is installed via `install.packages("tidyverse")` and also loaded via `library(tidyverse)`. Of course, you can also install or load it individually.
- Main functions start with `str_`. Auto-complete is your friend.
- Replacements for base functions re: string manipulation and regular expressions (see below).
- Main advantages over base functions: greater consistency about inputs and outputs. Outputs are more ready for your next analytical task.
* [tidyr package][tidyr-web].
- Especially useful for functions that split one character vector into many and *vice versa*: `separate()`, `unite()`, `extract()`.
* Base functions: `nchar()`, `strsplit()`, `substr()`, `paste()`, `paste0()`.
* The [glue package][glue-web] is fantastic for string interpolation. If `stringr::str_interp()` doesn't get your job done, check out the glue package.
### Regular expressions resources
A God-awful and powerful language for expressing patterns to match in text or for search-and-replace. Frequently described as "write only", because regular expressions are easier to write than to read/understand. And they are not particularly easy to write.
* We again prefer the [stringr package][stringr-cran] over base functions. Why?
- Wraps [stringi][stringi-cran], which is a great place to look if stringr isn't powerful enough.
- Standardized on [ICU regular expressions][icu-regex], so you can stop toggling `perl = TRUE/FALSE` at random.
- Results come back in a form that is much friendlier for downstream work.
* The [Strings chapter][r4ds-strings] of [R for Data Science][r4ds] [@wickham2016] is a great resource.
* Older STAT 545 lessons on regular expressions have some excellent content. This lesson draws on them, but makes more rigorous use of stringr and uses example data that is easier to support long-term.
- [2014 Intro to regular expressions](#oldies) by TA Gloria Li (Appendix \@ref(oldies)).
- [2015 Regular expressions and character data in R](#oldies) by TA Kieran Samuk (Appendix \@ref(oldies)).
* RStudio Cheat Sheet on [Regular Expressions in R][rstudio-regex-cheatsheet].
* Regex testers:
- [regex101.com][regex101]
- [regexr.com][regexr]
* [`rex` R package][rex-github]: make regular expression from human readable expressions.
* Base functions: `grep()` and friends.
### Character encoding resources
* [Strings subsection of data import chapter][r4ds-readr-strings] in [R for Data Science][r4ds] [@wickham2016].
* Screeds on the Minimum Everyone Needs to Know about encoding:
- [The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)][unicode-no-excuses]
- [What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text][programmers-encoding]
* Chapter \@ref(character-encoding) - I've translated this blog post [Guide to fixing encoding problems in Ruby][encoding-probs-ruby] into R as the first step to developing a lesson.
### Character vectors that live in a data frame
* Certain operations are facilitated by tidyr. These are described below.
* For a general discussion of how to work on variables that live in a data frame, see [Vectors versus tibbles](#oldies) (Appendix \@ref(oldies)).
## Load the tidyverse, which includes stringr
```{r start_char_vectors}
library(tidyverse)
```
## Regex-free string manipulation with stringr and tidyr
Basic string manipulation tasks:
* Study a single character vector
- How long are the strings?
- Presence/absence of a literal string
* Operate on a single character vector
- Keep/discard elements that contain a literal string
- Split into two or more character vectors using a fixed delimiter
- Snip out pieces of the strings based on character position
- Collapse into a single string
* Operate on two or more character vectors
- Glue them together element-wise to get a new character vector.
*`fruit`, `words`, and `sentences` are character vectors that ship with stringr for practicing.*
### Detect or filter on a target string
Determine presence/absence of a literal string with `str_detect()`. Spoiler: later we see `str_detect()` also detects regular expressions.
Which fruits actually use the word "fruit"?
```{r}
str_detect(fruit, pattern = "fruit")
```
What's the easiest way to get the actual fruits that match? Use `str_subset()` to keep only the matching elements. Note we are storing this new vector `my_fruit` to use in later examples!
```{r}
(my_fruit <- str_subset(fruit, pattern = "fruit"))
```
### String splitting by delimiter
Use `stringr::str_split()` to split strings on a delimiter. Some of our fruits are compound words, like "grapefruit", but some have two words, like "ugli fruit". Here we split on a single space `" "`, but show use of a regular expression later.
```{r}
str_split(my_fruit, pattern = " ")
```
It's bummer that we get a *list* back. But it must be so! In full generality, split strings must return list, because who knows how many pieces there will be?
If you are willing to commit to the number of pieces, you can use `str_split_fixed()` and get a character matrix. You're welcome!
```{r}
str_split_fixed(my_fruit, pattern = " ", n = 2)
```
If the to-be-split variable lives in a data frame, `tidyr::separate()` will split it into 2 or more variables.
```{r}
my_fruit_df <- tibble(my_fruit)
my_fruit_df %>%
separate(my_fruit, into = c("pre", "post"), sep = " ")
```
### Substring extraction (and replacement) by position
Count characters in your strings with `str_length()`. Note this is different from the length of the character vector itself.
```{r}
length(my_fruit)
str_length(my_fruit)
```
You can snip out substrings based on character position with `str_sub()`.
```{r}
head(fruit) %>%
str_sub(1, 3)
```
The `start` and `end` arguments are vectorised. __Example:__ a sliding 3-character window.
```{r}
tibble(fruit) %>%
head() %>%
mutate(snip = str_sub(fruit, 1:6, 3:8))
```
Finally, `str_sub()` also works for assignment, i.e. on the left hand side of `<-`.
```{r}
(x <- head(fruit, 3))
str_sub(x, 1, 3) <- "AAA"
x
```
### Collapse a vector
You can collapse a character vector of length `n > 1` to a single string with `str_c()`, which also has other uses (see the [next section](#catenate-vectors)).
```{r}
head(fruit) %>%
str_c(collapse = ", ")
```
### Create a character vector by catenating multiple vectors {#catenate-vectors}
If you have two or more character vectors of the same length, you can glue them together element-wise, to get a new vector of that length. Here are some ... awful smoothie flavors?
```{r}
str_c(fruit[1:4], fruit[5:8], sep = " & ")
```
Element-wise catenation can be combined with collapsing.
```{r}
str_c(fruit[1:4], fruit[5:8], sep = " & ", collapse = ", ")
```
If the to-be-combined vectors are variables in a data frame, you can use `tidyr::unite()` to make a single new variable from them.
```{r}
fruit_df <- tibble(
fruit1 = fruit[1:4],
fruit2 = fruit[5:8]
)
fruit_df %>%
unite("flavor_combo", fruit1, fruit2, sep = " & ")
```
### Substring replacement
You can replace a pattern with `str_replace()`. Here we use an explicit string-to-replace, but later we revisit with a regular expression.
```{r}
str_replace(my_fruit, pattern = "fruit", replacement = "THINGY")
```
A special case that comes up a lot is replacing `NA`, for which there is `str_replace_na()`.
```{r}
melons <- str_subset(fruit, pattern = "melon")
melons[2] <- NA
melons
str_replace_na(melons, "UNKNOWN MELON")
```
If the `NA`-afflicted variable lives in a data frame, you can use `tidyr::replace_na()`.
```{r}
tibble(melons) %>%
replace_na(replace = list(melons = "UNKNOWN MELON"))
```
And that concludes our treatment of regex-free manipulations of character data!
## Regular expressions with stringr
```{r echo = FALSE, fig.cap = "From [\\@ThePracticalDev](https://twitter.com/ThePracticalDev/status/774309983467016193)", out.width = "50%"}
knitr::include_graphics("img/regexbytrialanderror-big-smaller.png")
```
### Load gapminder
The country names in the `gapminder` dataset are convenient for examples. Load it now and store the `r nlevels(gapminder::gapminder$country)` unique country names to the object `countries`.
```{r}
library(gapminder)
countries <- levels(gapminder$country)
```
### Characters with special meaning
Frequently your string tasks cannot be expressed in terms of a fixed string, but can be described in terms of a **pattern**. Regular expressions, aka "regexes", are the standard way to specify these patterns. In regexes, specific characters and constructs take on special meaning in order to match multiple strings.
The first metacharacter is the period `.`, which stands for any single character, except a newline (which by the way, is represented by `\n`). The regex `a.b` will match all countries that have an `a`, followed by any single character, followed by `b`. Yes, regexes are case sensitive, i.e. "Italy" does not match.
```{r}
str_subset(countries, pattern = "i.a")
```
Notice that `i.a` matches "ina", "ica", "ita", and more.
**Anchors** can be included to express where the expression must occur within the string. The `^` indicates the beginning of string and `$` indicates the end.
Note how the regex `i.a$` matches many fewer countries than `i.a` alone. Likewise, more elements of `my_fruit` match `d` than `^d`, which requires "d" at string start.
```{r}
str_subset(countries, pattern = "i.a$")
str_subset(my_fruit, pattern = "d")
str_subset(my_fruit, pattern = "^d")
```
The metacharacter `\b` indicates a **word boundary** and `\B` indicates NOT a word boundary. This is our first encounter with something called "escaping" and right now I just want you at accept that we need to prepend a second backslash to use these sequences in regexes in R. We'll come back to this tedious point later.
```{r}
str_subset(fruit, pattern = "melon")
str_subset(fruit, pattern = "\\bmelon")
str_subset(fruit, pattern = "\\Bmelon")
```
### Character classes
Characters can be specified via classes. You can make them explicitly "by hand" or use some pre-existing ones. The [2014 STAT 545 regex lesson](#oldies) (Appendix \@ref(oldies)) has a good list of character classes. Character classes are usually given inside square brackets, `[]` but a few come up so often that we have a metacharacter for them, such as `\d` for a single digit.
Here we match `ia` at the end of the country name, preceded by one of the characters in the class. Or, in the negated class, preceded by anything but one of those characters.
```{r}
## make a class "by hand"
str_subset(countries, pattern = "[nls]ia$")
## use ^ to negate the class
str_subset(countries, pattern = "[^nls]ia$")
```
Here we revisit splitting `my_fruit` with two more general ways to match whitespace: the `\s` metacharacter and the POSIX class `[:space:]`. Notice that we must prepend an extra backslash `\` to escape `\s` and the POSIX class has to be surrounded by two sets of square brackets.
```{r}
## remember this?
# str_split_fixed(fruit, pattern = " ", n = 2)
## alternatives
str_split_fixed(my_fruit, pattern = "\\s", n = 2)
str_split_fixed(my_fruit, pattern = "[[:space:]]", n = 2)
```
Let's see the country names that contain punctuation.
```{r}
str_subset(countries, "[[:punct:]]")
```
### Quantifiers
You can decorate characters (and other constructs, like metacharacters and classes) with information about how many characters they are allowed to match.
| quantifier | meaning | quantifier | meaning |
|------------|-----------|------------|----------------------------|
| * | 0 or more | {n} | exactly n |
| + | 1 or more | {n,} | at least n |
| ? | 0 or 1 | {,m} | at most m |
| | | {n,m} | between n and m, inclusive |
Explore these by inspecting matches for `l` followed by `e`, allowing for various numbers of characters in between.
`l.*e` will match strings with 0 or more characters in between, i.e. any string with an `l` eventually followed by an `e`. This is the most inclusive regex for this example, so we store the result as `matches` to use as a baseline for comparison.
```{r}
(matches <- str_subset(fruit, pattern = "l.*e"))
```
Change the quantifier from `*` to `+` to require at least one intervening character. The strings that no longer match: all have a literal `le` with no preceding `l` and no following `e`.
```{r}
list(match = intersect(matches, str_subset(fruit, pattern = "l.+e")),
no_match = setdiff(matches, str_subset(fruit, pattern = "l.+e")))
```
Change the quantifier from `*` to `?` to require at most one intervening character. In the strings that no longer match, the shortest gap between `l` and following `e` is at least two characters.
```{r}
list(match = intersect(matches, str_subset(fruit, pattern = "l.?e")),
no_match = setdiff(matches, str_subset(fruit, pattern = "l.?e")))
```
Finally, we remove the quantifier and allow for no intervening characters. The strings that no longer match lack a literal `le`.
```{r}
list(match = intersect(matches, str_subset(fruit, pattern = "le")),
no_match = setdiff(matches, str_subset(fruit, pattern = "le")))
```
### Escaping
You've probably caught on by now that there are certain characters with special meaning in regexes, including `$ * + . ? [ ] ^ { } | ( ) \`.
What if you really need the plus sign to be a literal plus sign and not a regex quantifier? You will need to *escape* it by prepending a backslash. But wait ... there's more! Before a regex is interpreted as a regular expression, it is also interpreted by R as a string. And backslash is used to escape there as well. So, in the end, you need to preprend two backslashes in order to match a literal plus sign in a regex.
This will be more clear with examples!
#### Escapes in plain old strings
Here is routine, non-regex use of backslash `\` escapes in plain vanilla R strings. We intentionally use `cat()` instead of `print()` here.
* To escape quotes inside quotes:
```{r}
cat("Do you use \"airquotes\" much?")
```
Sidebar: eliminating the need for these escapes is exactly why people use double quotes inside single quotes and *vice versa*.
* To insert newline (`\n`) or tab (`\t`):
```{r}
cat("before the newline\nafter the newline")
cat("before the tab\tafter the tab")
```
#### Escapes in regular expressions
Examples of using escapes in regexes to match characters that would otherwise have a special interpretation.
We know several `gapminder` country names contain a period. How do we isolate them? Although it's tempting, this command `str_subset(countries, pattern = ".")` won't work!
```{r}
## cheating using a POSIX class ;)
str_subset(countries, pattern = "[[:punct:]]")
## using two backslashes to escape the period
str_subset(countries, pattern = "\\.")
```
A last example that matches an actual square bracket.
```{r end_char_vectors}
(x <- c("whatever", "X is distributed U[0,1]"))
str_subset(x, pattern = "\\[")
```
### Groups and backreferences
Your first use of regex is likely to be simple matching: detecting or isolating strings that match a pattern.
But soon you will want to use regexes to transform the strings in character vectors. That means you need a way to address specific parts of the matching strings and to operate on them.
You can use parentheses inside regexes to define *groups* and you can refer to those groups later with *backreferences*.
For now, this lesson will refer you to other place to read up on this:
* STAT 545 [2014 Intro to regular expressions](#oldies) by TA Gloria Li (Appendix \@ref(oldies)).
* The [Strings chapter][r4ds-strings] of [R for Data Science][r4ds] [@wickham2016].
```{r links, child="links.md"}
```