-
Notifications
You must be signed in to change notification settings - Fork 0
/
01_read_data_wrangle_strings.rmd
501 lines (351 loc) · 19 KB
/
01_read_data_wrangle_strings.rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
---
editor_options:
chunk_output_type: console
---
# Reading files and string manipulation
![](opening-image.png)
Load the packages for the day.
```{r load_packages_01}
library(readr)
library(stringr)
library(glue)
```
## Data import and export with `readr`
Data in the wild with which ecologists and evolutionary biologists deal is most often in the form of a text file, usually with the extensions `.csv` or `.txt`. Often, such data has to be written to file from within `R`. `readr` contains a number of functions to help with reading and writing text files.
### Reading data
Reading in a csv file with `readr` is done with the `read_csv` function, a faster alternative to the base R `read.csv`. Here, `read_csv` is applied to the `mtcars` example.
```{r}
# get the filepath of the example
some_example = readr_example("mtcars.csv")
# read the file in
some_example = read_csv(some_example)
head(some_example)
```
The `read_csv2` function is useful when dealing with files where the separator between columns is a semicolon `;`, and where the decimal point is represented by a comma `,`.
Other variants include:
- `read_tsv` for tab-separated files, and
- `read_delim`, a general case which allows the separator to be specified manually.
`readr` import function will attempt to guess the column type from the first *N* lines in the data. This *N* can be set using the function argument `guess_max`. The `n_max` argument sets the number of rows to read, while the `skip` argument sets the number of rows to be skipped before reading data.
By default, the column names are taken from the first row of the data, but they can be manually specified by passing a character vector to `col_names`.
There are some other arguments to the data import functions, but the defaults usually *just work*.
### Writing data
Writing data uses the `write_*` family of functions, with implementations for `csv`, `csv2` etc. (represented by the asterisk), mirroring the import functions discussed above. `write_*` functions offer the `append` argument, which allow a data frame to be added to an existing file.
These functions are not covered here.
### Reading and writing lines
Sometimes, there is text output generated in `R` which needs to be written to file, but is not in the form of a dataframe. A good example is model outputs. It is good practice to save model output as a text file, and add it to version control.
Similarly, it may be necessary to import such text, either for display to screen, or to extract data.
This can be done using the `readr` functions `read_lines` and `write_lines`. Consider the model summary from a simple linear model.
```{r}
# get the model
model = lm(mpg ~ wt, data = mtcars)
```
The model summary can be written to file. When writing lines to file, BE AWARE OF THE DIFFERENCES BETWEEN UNIX AND WINODWS line separators. Usually, this causes no trouble.
```{r}
# capture the model summary output
model_output = capture.output(summary(model))
# save it to file
write_lines(x = model_output,
path = "model_output.txt")
```
This model output can be read back in for display, and each line of the model output is an element in a character vector.
```{r}
# read in the model output and display
model_output = read_lines("model_output.txt")
# use cat to show the model output as it would be on screen
cat(model_output, sep = "\n")
```
These few functions demonstrate the most common uses of `readr`, but most other use cases for text data can be handled using different function arguments, including reading data off the web, unzipping
compressed files before reading, and specifying the column types to control for type conversion errors.
### Excel files {-}
Finally, data is often shared or stored by well meaning people in the form of Microsoft Excel sheets. Indeed, Excel (especially when synced regularly to remote storage) is a good way of noting down observational data in the field. The `readxl` package allows importing from Excel files, including reading in specific sheets.
## String manipulation with `stringr`
`stringr` is the tidyverse package for string manipulation, and exists in an interesting symbiosis with the `stringi` package. For the most part, stringr is a wrapper around stringi, and is almost always more than sufficient for day-to-day needs.
`stringr` functions begin with `str_`.
### Putting strings together
Concatenate two strings with `str_c`, and duplicate strings with `str_dup`. Flatten a list or vector of strings using `str_flatten`.
```{r string_joining_01, echo=TRUE}
# str_c works like paste(), choose a separator
str_c("this string", "this other string", sep = "_")
# str_dup works like rep
str_dup("this string", times = 3)
# str_flatten works on lists and vectors
str_flatten(string = as.list(letters), collapse = "_")
str_flatten(string = letters, collapse = "-")
```
`str_flatten` is especially useful when displaying the type of an object that returns a list when `class` is called on it.
```{r}
# get the class of a tibble and display it as a single string
class_tibble = class(tibble::tibble(a = 1))
str_flatten(string = class_tibble, collapse = ", ")
```
### Detecting strings
Count the frequency of a pattern in a string with `str_count`. Returns an inteegr.
Detect whether a pattern exists in a string with `str_detect`. Returns a logical and can be used as a predicate.
Both are vectorised, i.e, automatically applied to a vector of arguments.
```{r count_matches_01}
# there should be 5 a-s here
str_count(string = "ababababa", pattern = "a")
# vectorise over the input string
# should return a vector of length 2, with integers 5 and 3
str_count(string = c("ababbababa", "banana"), pattern = "a")
# vectorise over the pattern to count both a-s and b-s
str_count(string = "ababababa", pattern = c("a", "b"))
```
Vectorising over both string and pattern works as expected.
```{r}
# vectorise over both string and pattern
# counts a-s in first input, and b-s in the second
str_count(string = c("ababababa", "banana"),
pattern = c("a", "b"))
# provide a longer pattern vector to search for both a-s
# and b-s in both inputs
str_count(string = c("ababababa", "banana"),
pattern = c("a", "b",
"b", "a"))
```
`str_locate` locates the search pattern in a string, and returns the start and end as a two column matrix.
```{r}
# the behaviour of both str_locate and str_locate_all is
# to find the first match by default
str_locate(string = "banana", pattern = "ana")
```
```{r}
# str_detect detects a sequence in a string
str_detect(string = "Bananageddon is coming!",
pattern = "na")
# str_detect is also vectorised and returns a two-element logical vector
str_detect(string = "Bananageddon is coming!",
pattern = c("na", "don"))
# use any or all to convert a multi-element logical to a single logical
# here we ask if either of the patterns is detected
any(str_detect(string = "Bananageddon is coming!",
pattern = c("na", "don")))
```
Detect whether a string starts or ends with a pattern. Also vectorised.
Both have a `negate` argument, which returns the negative, i.e., returns `FALSE` if the search pattern is detected.
```{r}
# taken straight from the examples, because they suffice
fruit <- c("apple", "banana", "pear", "pineapple")
# str_detect looks at the first character
str_starts(fruit, "p")
# str_ends looks at the last character
str_ends(fruit, "e")
# an example of negate = TRUE
str_ends(fruit, "e", negate = TRUE)
```
`str_subset` [WHICH IS NOT RELATED TO `str_sub`] helps with subsetting a character vector based on a `str_detect` predicate.
In the example, all elements containing "banana" are subset.
`str_which` has the same logic except that it returns the vector position and not the elements.
```{r}
# should return a subset vector containing the first two elements
str_subset(c("banana",
"bananageddon is coming",
"applegeddon is not real"),
pattern = "banana")
# returns an integer vector
str_which(c("banana",
"bananageddon is coming",
"applegeddon is not real"),
pattern = "banana")
```
### Matching strings
`str_match` returns all positive matches of the patttern in the string.
The return type is a `list`, with one element per search pattern.
A simple case is shown below where the search pattern is the phrase "banana".
```{r}
str_match(string = c("banana",
"bananageddon",
"bananas are bad"),
pattern = "banana")
```
The search pattern can be extended to look for multiple subsets of the search pattern. Consider searching for dates and times.
Here, the search pattern is a `regex` pattern that looks for a set of four digits (`\\d{4}`) and a month name `(\\w+)` seperated by a hyphen. There's much more to be explored in dealing with dates and times in [`lubridate`](https://lubridate.tidyverse.org/), another `tidyverse` package.
The return type is a list, each element is a character matrix where the first column is the string subset matching the full search pattern, and then as many columns as there are parts to the search pattern. The parts of interest in the search pattern are indicated by wrapping them in parentheses. For example, in the case below, wrapping `[-.]` in parentheses will turn it into a distinct part of the search pattern.
```{r}
# first with [-.] treated simply as a separator
str_match(string = c("1970-somemonth-01",
"1990-anothermonth-01",
"2010-thismonth-01"),
pattern = "(\\d{4})[-.](\\w+)")
# then with [-.] actively searched for
str_match(string = c("1970-somemonth-01",
"1990-anothermonth-01",
"2010-thismonth-01"),
pattern = "(\\d{4})([-.])(\\w+)")
```
Multiple possible matches are dealt with using `str_match_all`. An example case is uncertainty in date-time in raw data, where the date has been entered as `1970-somemonth-01 or 1970/anothermonth/01`.
The return type is a list, with one element per input string. Each element is a character matrix, where each row is one possible match, and each column after the first (the full match) corresponds to the parts of the search pattern.
```{r}
# first with a single date entry
str_match_all(string = c("1970-somemonth-01 or maybe 1990/anothermonth/01"),
pattern = "(\\d{4})[\\-\\/]([a-z]+)")
# then with multiple date entries
str_match_all(string = c("1970-somemonth-01 or maybe 1990/anothermonth/01",
"1990-somemonth-01 or maybe 2001/anothermonth/01"),
pattern = "(\\d{4})[\\-\\/]([a-z]+)")
```
### Simpler pattern extraction
The full functionality of `str_match_*` can be boiled down to the most common use case, extracting one or more full matches of the search pattern using `str_extract` and `str_extract_all` respectively.
`str_extract` returns a character vector with the same length as the input string vector, while `str_extract_all` returns a list, with a character vector whose elements are the matches.
```{r}
# extracting the first full match using str_extract
str_extract(string = c("1970-somemonth-01 or maybe 1990/anothermonth/01",
"1990-somemonth-01 or maybe 2001/anothermonth/01"),
pattern = "(\\d{4})[\\-\\/]([a-z]+)")
# extracting all full matches using str_extract all
str_extract_all(string = c("1970-somemonth-01 or maybe 1990/anothermonth/01",
"1990-somemonth-01 or maybe 2001/anothermonth/01"),
pattern = "(\\d{4})[\\-\\/]([a-z]+)")
```
### Breaking strings apart
`str_split`, str_sub,
In the above date-time example, when reading filenames from a path, or when working sequences separated by a known pattern generally, `str_split` can help separate elements of interest.
The return type is a list similar to `str_match`.
```{r}
# split on either a hyphen or a forward slash
str_split(string = c("1970-somemonth-01",
"1990/anothermonth/01"),
pattern = "[\\-\\/]")
```
This can be useful in recovering simulation parameters from a filename, but may require some knowledge of `regex`.
```{r}
# assume a simulation output file
filename = "sim_param1_0.01_param2_0.05_param3_0.01.ext"
# not quite there
str_split(filename, pattern = "_")
# not really
str_split(filename,
pattern = "sim_")
# getting there but still needs work
str_split(filename,
pattern = "(sim_)|_*param\\d{1}_|(.ext)")
```
`str_split_fixed` split the string into as many pieces as specified, and can be especially useful dealing with filepaths.
```{r}
# split on either a hyphen or a forward slash
str_split_fixed(string = "dir_level_1/dir_level_2/file.ext",
pattern = "/",
n = 2)
```
### Replacing string elements
`str_replace` is intended to replace the search pattern, and can be co-opted into the task of recovering simulation parameters or other data from regularly named files. `str_replace_all` works the same way but replaces all matches of the search pattern.
```{r}
# replace all unwanted characters from this hypothetical filename with spaces
filename = "sim_param1_0.01_param2_0.05_param3_0.01.ext"
str_replace_all(filename,
pattern = "(sim_)|_*param\\d{1}_|(.ext)",
replacement = " ")
```
`str_remove` is a wrapper around `str_replace` where the replacement is set to `""`. This is not covered here.
Having replaced unwanted characters in the filename with spaces, `str_trim` offers a way to remove leading and trailing whitespaces.
```{r}
# trim whitespaces from this filename after replacing unwanted text
filename = "sim_param1_0.01_param2_0.05_param3_0.01.ext"
filename_with_spaces = str_replace_all(filename,
pattern = "(sim_)|_*param\\d{1}_|(.ext)",
replacement = " ")
filename_without_spaces = str_trim(filename_with_spaces)
filename_without_spaces
# the result can be split on whitespaces to return useful data
str_split(filename_without_spaces, " ")
```
### Subsetting within strings
When strings are highly regular, useful data can be extracted from a string using `str_sub`. In the date-time example, the year is always represented by the first four characters.
```{r}
# get the year as characters 1 - 4
str_sub(string = c("1970-somemonth-01",
"1990-anothermonth-01",
"2010-thismonth-01"),
start = 1, end = 4)
```
Similarly, it's possible to extract the last few characters using negative indices.
```{r}
# get the day as characters -2 to -1
str_sub(string = c("1970-somemonth-01",
"1990-anothermonth-21",
"2010-thismonth-31"),
start = -2, end = -1)
```
Finally, it's also possible to replace characters within a string based on the position. This requires using the assignment operator `<-`.
```{r}
# replace all days in these dates to 01
date_times = c("1970-somemonth-25",
"1990-anothermonth-21",
"2010-thismonth-31")
# a strictly necessary use of the assignment operator
str_sub(date_times,
start = -2, end = -1) <- "01"
date_times
```
### Padding and truncating strings
Strings included in filenames or plots are often of unequal lengths, especially when they represent numbers. `str_pad` can pad strings with suitable characters to maintain equal length filenames, with which it is easier to work.
```{r}
# pad so all values have three digits
str_pad(string = c("1", "10", "100"),
width = 3,
side = "left",
pad = "0")
```
Strings can also be truncated if they are too long.
```{r}
str_trunc(string = c("bananas are great and wonderful
and more stuff about bananas and
it really goes on about bananas"),
width = 27,
side = "right", ellipsis = "etc. etc.")
```
### Stringr aspects not covered here
Some `stringr` functions are not covered here. These include:
- `str_wrap` (of dubious use),
- `str_interp`, `str_glue*` (better to use `glue`; see below),
- `str_sort`, `str_order` (used in sorting a character vector),
- `str_to_case*` (case conversion), and
- `str_view*` (a graphical view of search pattern matches).
- `word`, `boundary` etc. The use of word is covered below.
[`stringi`](https://cran.r-project.org/web/packages/stringi/), of which `stringr` is a wrapper, offers a lot more flexibility and control.
## String interpolation with `glue`
The idea behind string interpolation is to procedurally generate new complex strings from pre-existing data.
`glue` is as simple as the example shown.
```{r}
# print that each car name is a car model
cars = rownames(head(mtcars))
glue('The {cars} is a car model')
```
This creates and prints a vector of car names stating each is a car model.
The related `glue_data` is even more useful in printing from a dataframe.
In this example, it can quickly generate command line arguments or filenames.
```{r}
# use dataframes for now
parameter_combinations = data.frame(param1 = letters[1:5],
param2 = 1:5)
# for command line arguments or to start multiple job scripts on the cluster
glue_data(parameter_combinations,
'simulation-name {param1} {param2}')
# for filenames
glue_data(parameter_combinations,
'sim_data_param1_{param1}_param2_{param2}.ext')
```
Finally, the convenient `glue_sql` and `glue_data_sql` are used to safely write SQL queries where variables from data are appropriately quoted. This is not covered here, but it is good to know it exists.
`glue` has some more functions --- `glue_safe`, `glue_collapse`, and `glue_col`, but these are infrequently used. Their functionality can be found on the `glue` github page.
## Strings in `ggplot`
`ggplot` has two `geoms` (wait for the `ggplot` tutorial to understand more about geoms) that work with text: `geom_text` and `geom_label`. These geoms allow text to be pasted on to the main body of a plot.
Often, these may overlap when the data are closely spaced. The package `ggrepel` offers another `geom`, `geom_text_repel` (and the related `geom_label_repel`) that help arrange text on a plot so it doesn't overlap with other features. This is *not perfect*, but it works more often than not.
More examples can be found on the [ggrepl website](https://github.com/slowkow/ggrepel).
Here, the arguments to `geom_text_repel` are taken both from the mtcars data (position), as well as from the car brands extracted using the `stringr::word` (labels), which tries to separate strings based on a regular pattern.
The details of `ggplot` are covered in a later tutorial.
```{r}
library(ggplot2)
library(ggrepel)
# prepare car labels using word function
car_labels = word(rownames(mtcars))
ggplot(mtcars,
aes(x = wt, y = mpg,
label = rownames(mtcars)))+
geom_point(colour = "red")+
geom_text_repel(aes(label = car_labels),
direction = "x",
nudge_x = 0.2,
box.padding = 0.5,
point.padding = 0.5)
```
This is not a good looking plot, because it breaks other rules of plot design, such as whether this sort of plot should be made at all. Labels and text need to be applied sparingly, for example drawing attention or adding information to outliers.