-
Notifications
You must be signed in to change notification settings - Fork 10
/
Copy pathstring-theory.rmd
430 lines (280 loc) · 22.3 KB
/
string-theory.rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
# String Theory
## Basic data types
R has several core data structures:
- Vectors
- Factors
- Lists
- Matrices/arrays
- Data frames
<span class="objclass">Vectors</span> form the basis of R data structures. There are two main types- <span class="objclass">atomic</span> and <span class="objclass">lists</span>. All elements of an atomic vector are the same type.
Examples include:
- character
- numeric (double)
- integer
- logical
### Character strings
When dealing with text, objects of class character are what you'd typically be dealing with.
```{r create_a_char, eval=F}
x = c('... Of Your Fake Dimension', 'Ephemeron', 'Dryswch', 'Isotasy', 'Memory')
x
```
Not much to it, but be aware there is no real limit to what is represented as a character vector. For example, in a data frame, you could have a column where each entry is one of the works of Shakespeare.
### Factors
Although not exactly precise, one can think of factors as integers with labels. So, the underlying representation of a variable for <span class="objclass">sex</span> is 1:2 with labels 'Male' and 'Female'. They are a special class with attributes, or metadata, that contains the information about the <span class="objclass">levels</span>.
```{r factor_atts}
x = factor(rep(letters[1:3], e=10))
attributes(x)
```
While the underlying representation is numeric, it is important to remember that factors are *categorical*. They can't be used as numbers would be, as the following demonstrates.
```{r factor_sum, eval=TRUE, error=TRUE}
as.numeric(x)
sum(x)
```
Any numbers could be used, what we're interested in are the labels, so a 'sum' doesn't make any sense. All of the following would produce the same factor.
```{r factor_rep, eval=FALSE}
factor(c(1, 2, 3), labels=c('a', 'b', 'c'))
factor(c(3.2, 10, 500000), labels=c('a', 'b', 'c'))
factor(c(.49, 1, 5), labels=c('a', 'b', 'c'))
```
Because of the integer+metadata representation, factors are actually smaller than character strings, often notably so.
```{r size_comparison}
x = sample(state.name, 10000, replace=T)
format(object.size(x), units='Kb')
format(object.size(factor(x)), units='Kb')
format(object.size(as.integer(factor(x))), units='Kb')
```
However, if memory is really a concern, it's probably not that using factors will help, but rather better hardware.
### Analysis
It is important to know that raw text cannot be analyzed quantitatively. There is no magic that takes a categorical variable with text labels and estimates correlations among words and other words or numeric data. *Everything* that can be analyzed must have some numeric representation first, and this is where factors come in. For example, here is a data frame with two categorical predictors (`factor*`), a numeric predictor (`x`), and a numeric target (`y`). What follows is what it looks like if you wanted to run a regression model in that setting.
```{r dummy, eval=-3}
df =
crossing(factor_1 = c('A', 'B'),
factor_2 = c('Q', 'X', 'J')) %>%
mutate(x=rnorm(6),
y=rnorm(6))
df
model.matrix(lm(y ~ x + factor_1 + factor_2, data=df))
```
```{r dummy_pretty, echo=FALSE}
model.matrix(lm(y ~ x + factor_1 + factor_2, data=df)) %>%
kable_df()
```
The <span class="func">model.matrix</span> function exposes the underlying matrix that is actually used in the regression analysis. You'd get a coefficient for each column of that matrix. As such, even the intercept must be represented in some fashion. For categorical data, the default coding scheme is <span class="emph">dummy coding</span>. A reference category is arbitrarily chosen (it doesn't matter which, and you can always change it), while the other categories are represented by indicator variables, where a 1 represents the corresponding label and everything else is zero. For details on this coding scheme or others, consult any basic statistical modeling book.
In addition, you'll note that in all text-specific analysis, the underlying information is numeric. For example, with topic models, the base data structure is a document-term matrix of counts.
### Characters vs. Factors
The main thing to note is that factors are generally a statistical phenomenon, and are required to do statistical things with data that would otherwise be a simple character string. If you know the relatively few levels the data can take, you'll generally want to use factors, or at least know that statistical packages and methods will require them. In addition, factors allow you to easily overcome the silly default alphabetical ordering of category levels in some very popular visualization packages.
For other things, such as text analysis, you'll almost certainly want character strings instead, and in many cases it will be required. It's also worth noting that a lot of base R and other behavior will coerce strings to factors. This made a lot more sense in the early days of R, but is not really necessary these days.
For more on this stuff see the following:
- http://adv-r.had.co.nz/Data-structures.html
- http://forcats.tidyverse.org/
- http://r4ds.had.co.nz/factors.html
- https://simplystatistics.org/2015/07/24/stringsasfactors-an-unauthorized-biography/
- http://notstatschat.tumblr.com/post/124987394001/stringsasfactors-sigh
## Basic Text Functionality
### Base R
A lot of folks new to R are not aware of just how much basic text processing R comes with out of the box. Here are examples of note.
- <span class="func">paste</span>: glue text/numeric values together
- <span class="func">substr</span>: extract or replace substrings in a character vector
- <span class="func">grep</span> family: use regular expressions to deal with patterns of text
- <span class="func">strsplit</span>: split strings
- <span class="func">nchar</span>: how many characters in a string
- <span class="func">as.numeric</span>: convert a string to numeric if it can be
- <span class="func">strtoi</span>: convert a string to integer if it can be (faster than as.integer)
- <span class="func">adist</span>: string distances
I probably use <span class="func">paste</span>/<span class="func">paste0</span> more than most things when dealing with text, as string concatenation comes up so often. The following provides some demonstration.
```{r paste}
paste(c('a', 'b', 'cd'), collapse='|')
paste(c('a', 'b', 'cd'), collapse='')
paste0('a', 'b', 'cd') # shortcut to collapse=''
paste0('x', 1:3)
```
Beyond that, use of regular expression and functionality included in the <span class="func">grep</span> family is a major way to save a lot of time during data processing. I leave that to its own section later.
### Useful packages
A couple packages will probably take care of the vast majority of your standard text processing needs. Note that even if they aren't adding anything to the functionality of the base R functions, they typically will have been optimized in some fashion, particularly with regard to speed.
- <span class="pack">stringr</span>/<span class="pack">stringi</span>: More or less the same stuff you'll find with <span class="func">substr</span>, <span class="func">grep</span> etc. except easier to use and/or faster. They also add useful functionality not in base R (e.g. <span class="func">str_to_title</span>). The <span class="pack">stringr</span> package is mostly a wrapper for the <span class="pack">stringi</span> functions, with some additional functions.
- <span class="pack">tidyr</span>: has functions such as <span class="func">unite</span>, <span class="func">separate</span>, <span class="func">replace_na</span> that can often come in handy when working with data frames.
- <span class="pack">glue</span>: a newer package that can be seen as a fancier <span class="func">paste</span>. Most likely it will be useful when creating functions or shiny apps in which variable text output is desired.
One issue I have with both packages and base R is that often they return a list object, when it should be simplifying to the vector format it was initially fed. This sometimes requires an additional step or two of further processing that shouldn't be necessary, so be prepared for it[^str_all].
### Other
In this section, I'll add some things that come to mind that might come into play when you're dealing with text.
#### Dates
Dates are not character strings. Though they may start that way, if you actually want to treat them as dates you'll need to convert the string to the appropriate date class. The <span class="pack">lubridate</span> package makes dealing with dates much easier. It comes with conversion, extraction and other functionality that will be sure to save you some time.
```{r lubridate}
library(lubridate)
today()
today() + 1
today() + dyears(1)
leap_year(2016)
span = interval(ymd("2017-07-01"), ymd("2017-07-04"))
span
as.duration(span)
span %/% minutes(1)
```
This package makes dates so much easier, you should always use it when dealing with them.
#### Categorical Time
In regression modeling with few time points, one often has to decide on whether to treat the year as categorical (factor) or numeric (continuous). This greatly depends on how you want to tell your data story or other practical concerns. For example, if you have five years in your data, treating <span class="objclass">year</span> as categorical means you are interested in accounting for unspecified things that go on in a given year. If you treat it as numeric, you are more interested in trends. Either is fine.
#### Web
A major resource for text is of course the web. Packages like <span class="pack">rvest</span>,<span class="pack">httr</span>, <span class="pack">xml2</span>, and many other packages specific to website <span class="emph">APIs</span> are available to help you here. See the [R task view for web technologies](https://cran.r-project.org/web/views/WebTechnologies.html) as a starting point.
##### Encoding
Encoding can be a sizable PITA sometimes, and will often come up when dealing with webscraping and other languages. The <span class="pack">rvest</span> and <span class="pack">stringr</span> packages may be able to get you past some issues at least. See their respective functions <span class="func">repair_encoding</span> and <span class="func">str_conv</span> as starting points on this issue.
### Summary of basic text functionality
Being familiar with commonly used string functionality in base R and packages like <span class="pack">stringr</span> can save a ridiculous amount of time in your data processing. The more familiar you are with them the easier time you'll have with text.
## Regular Expressions
A <span class="emph">regular expression</span>, regex for short, is a sequence of characters that can be used as a search pattern for a string. Common operations are to merely detect, extract, or replace the matching string. There are actually many different flavors of regex for different programming languages, which are all flavors that originate with the Perl approach, or can enable the Perl approach to be used. However, knowing one means you pretty much know the others with only minor modifications if any.
To be clear, not only is regex another language, it's nigh on indecipherable. You will not learn much regex, but what you do learn will save a potentially enormous amount of time you'd otherwise spend trying to do things in a more haphazard fashion. Furthermore, practically every situation that will come up has already been asked and answered on [Stack Overflow](https://stackoverflow.com/questions/tagged/regex), so you'll almost always be able to search for what you need.
Here is an example:
`^r.*shiny[0-9]$`
What is *that* you may ask? Well here is an example of strings it would and wouldn't match.
```{r regex_intro_ex}
string = c('r is the shiny', 'r is the shiny1', 'r shines brightly')
grepl(string, pattern='^r.*shiny[0-9]$')
```
What the regex is esoterically attempting to match is any string that starts with 'r' and ends with 'shiny_' where _ is some single digit. Specifically, it breaks down as follows:
- **^** : starts with, so ^r means starts with r
- **.** : any character
- **\*** : match the preceding zero or more times
- **shiny** : match 'shiny'
- **[0-9]** : any digit 0-9 (note that we are still talking about strings, not actual numbered values)
- **$** : ends with preceding
### Typical Uses
None of it makes sense, so don't attempt to do so. Just try to remember a couple key approaches, and search the web for the rest.
Along with ^ . * [0-9] $, a couple more common ones are:
- **[a-z]** : letters a-z
- **[A-Z]** : capital letters
- **+** : match the preceding one or more times
- **()** : groupings
- **|** : logical or e.g. [a-z]|[0-9] (a lower-case letter or a number)
- **?** : preceding item is optional, and will be matched at most once. Typically used for 'look ahead' and 'look behind'
- **\\** : escape a character, like if you actually wanted to search for a period instead of using it as a regex pattern, you'd use \\., though in R you need \\\\, i.e. double slashes, for escape.
In addition, in R there are certain predefined characters that can be called:
- **[:punct:]** : punctuation
- **[:blank:]** : spaces and tabs
- **[:alnum:]** : alphanumeric characters
Those are just a few. The key functions can be found by looking at the help file for the <span class="func">grep</span> function (`?grep`). However, the <span class="pack">stringr</span> package has the same functionality with perhaps a slightly faster processing (though that's due to the underlying <span class="pack">stringi</span> package).
See if you can guess which of the following will turn up `TRUE`.
```{r quick_regex_exercise, eval=FALSE}
grepl(c('apple', 'pear', 'banana'), pattern='a')
grepl(c('apple', 'pear', 'banana'), pattern='^a')
grepl(c('apple', 'pear', 'banana'), pattern='^a|a$')
```
Scraping the web, munging data, just finding things in your scripts ... you can potentially use this all the time, and not only with text analysis, as we'll now see.
### dplyr helper functions
The <span class="pack">dplyr</span> package comes with some poorly documented[^poordoc] but quite useful helper functions that essentially serve as human-readable regex, which is a very good thing. These functions allow you to select variables[^helperrows] based on their names. They are usually just calling base R functions in the end.
- <span class="func">starts_with</span>: starts with a prefix (same as regex '^blah')
- <span class="func">ends_with</span>: ends with a prefix (same as regex 'blah$')
- <span class="func">contains</span>: contains a literal string (same as regex 'blah')
- <span class="func">matches</span>: matches a regular expression (put your regex here)
- <span class="func">num_range</span>: a numerical range like x01, x02, x03. (same as regex 'x[0-9][0-9]')
- <span class="func">one_of</span>: variables in character vector. (if you need to quote variable names, e.g. within a function)
- <span class="func">everything</span>: all variables. (a good way to spend time doing something only to accomplish what you would have by doing nothing, or a way to reorder variables)
<div class='note'>
For more on using <span class="pack">stringr</span> and regular expressions in R, you may find [this cheatsheet](https://github.com/rstudio/cheatsheets/raw/master/strings.pdf) useful.
<img class='img-note' src="img/R.ico" style="display:block; margin: 0 auto;">
</div>
## Text Processing Examples
### Example 1
Let's say you're dealing with some data that has been handled typically, that is to say, poorly. For example, you have a variable in your data representing whether something is from the north or south region.
```{r label_problem, echo=FALSE}
df = data_frame(
id = 1:500,
x = round(rnorm(500), 2),
region = sample(c('north', 'north ', 'south', 'South', ' South', 'North ', 'North'), 500, replace=T)
)
DT::datatable(df,
rownames=F,
options=list(dom='t',
autoWidth = TRUE,
columnDefs = list(list(width = '50px', targets = 0),
list(className = 'dt-center', targets = 0:1),
list(className = 'dt-right', targets = 2))),
width='300px'
)
```
<br>
It might seem okay until...
```{r label_problem2, echo=1, eval=2}
table(df$region)
kable_df(table(df$region))
```
Even if you spotted the casing issue, there is still a white space problem[^excel]. Let's say you want this to be capitalized 'North' and 'South'. How might you do it? It's actually quite easy with the <span class="pack">stringr</span> tools.
```{r label_problem3, eval=FALSE}
library(stringr)
df %>%
mutate(region = str_trim(region),
region = str_to_title(region))
```
The <span class="func">str_trim</span> function trims white space from either side (or both), while <span class="func">str_to_title</span> converts everything to first letter capitalized.
```{r label_problem4, echo=2, eval=c(1,3)}
df_corrected = df %>%
mutate(region = str_trim(region),
region = str_to_title(region))
table(df_corrected$region)
kable_df(table(df_corrected$region))
```
Compare that to how you would have done it before knowing how to use text processing tools. One might have spent several minutes with some find and replace approach in a spreadsheet, or maybe even several `if... else` statements in R until all problematic cases were taken care of. Not very efficient.
### Example 2
Suppose you import a data frame, and the data was originally in wide format, where each column represented a year of data collection for the individual. Since it is bad form for data columns to have numbers for names, when you import it, the result looks like the following.
```{r rename_chunk, echo=FALSE}
df = data.frame(id=1:20, round(matrix(rnorm(100), ncol=5), 2))
DT::datatable(df, rownames=F,
options=list(dom='tp',
autoWidth = TRUE,
columnDefs = list(list(width = '50px', targets = 0),
list(className = 'dt-center', targets = 1)))
)
```
<br>
So, the problem now is to change the names to be Year_1, Year_2, etc. You might think you might have to use <span class="func">colnames</span> and manually create a string of names to replace the current ones.
```{r rename_chunk2, eval=FALSE}
colnames(df)[-1] = c('Year_1', 'Year_2', 'Year_3', 'Year_4', 'Year_5')
```
Or perhaps you're thinking of the paste0 function, which works fine and saves some typing.
```{r rename_chunk3, eval=FALSE}
colnames(df)[-1] = paste0('Year_', 1:5)
```
However, data sets may be hundreds of columns, and the columns of data may have the same pattern but not be next to one another. For example, the first few dozen columns are all data that belongs to the first wave, etc. It is tedious to figure out which columns you don't want, but even then you're resulting to using magic numbers with the above approach, and one column change to data will mean that redoing the name change will fail.
However, the following accomplishes what we want, and is reproducible regardless of where the columns are in the data set.
```{r rename_chunk4}
df %>%
rename_at(vars(num_range('X', 1:5)),
str_replace, pattern='X', replacement='Year_') %>%
head()
```
Let's parse what it's specifically doing.
- <span class="func">rename_at</span> allows us to rename specific columns
- Which columns? X1 through X:5. The <span class="func">num_range</span> helper function creates the character strings X1, X2, X3, X4, and X5.
- Now that we have the names, we use vars to tell <span class="func">rename_at</span> which ones. It would have allowed additional sets of variables as well.
- <span class="func">rename_at</span> needs a function to apply to each of those column names. In this case the function is <span class="func">str_replace</span>, to replace patterns of strings with some other string
- The specific arguments to <span class="func">str_replace</span> (pattern to be replaced, replacement pattern) are also supplied.
So in the end we just have to use the <span class="func">num_range</span> helper function within the function that tells <span class="func">rename_at</span> what it should be renaming, and let <span class="func">str_replace</span> do the rest.
## Exercises
1. In your own words, state the difference between a character string and a factor variable.
<br>
2. Consider the following character vector.
```{r ex_paste}
x = c('A', '1', 'Q')
```
How might you paste the elements together so that there is an underscore `_` between characters and no space ("A_1_Q")? If you highlight the next line you'll see the hint.
<span style="color:#fffff8">Revisit how we used the collapse argument within paste. `paste(..., collapse=?)`</span>
Paste Part 2: The following application of paste produces this result.
```{r ex2_paste}
paste(c('A', '1', 'Q'), c('B', '2', 'z'))
```
Now try to produce `"A - B" "1 - 2" "Q - z"`. To do this, note that one can paste any number of things together (i.e. more than two). So try adding ' - ' to it.
<br>
3. Use regex to grab the Star Wars names that have a number. Use both <span class="func">grep</span> and <span class="func">grepl</span> and compare the results
```{r ex_regex, eval=FALSE, echo=-1}
# grep(starwars$name, pattern = '[0-9]', value=T)
grep(starwars$name, pattern = ?)
```
Now use your hacking skills to determine which one is the tallest.
<br>
4. Load the <span class="pack">dplyr</span> package, and use the its [helper functions][dplyr helper functions] to grab all the columns in the <span class="objclass">starwars</span> data set (comes with the package) with `color` in the name but without referring to them directly. The following shows a generic example. There are several ways to do this. Try two if you can.
```{r ex_dplyr, eval=FALSE}
starwars %>%
select(helper_function('pattern'))
```
[^poordoc]: At least they're exposed now.
[^excel]: This is a very common issue among Excel users, and just one of the many reasons not to use it.
[^helperrows]: For rows, you'll have to use a <span class="func">grepl</span>/<span class="func">str_detect</span> approach. For example, `filter(grepl(col1, pattern='^X'))` would subset to only rows where col1 starts with X.
[^str_all]: I also don't think it necessary to have separate functions for str_* functions in <span class="pack">stringr</span> depending on whether, e.g. I want 'all' matches (practically every situation) or just the first (very rarely). It could have just been an additional argument with default `all=TRUE`.