-
Notifications
You must be signed in to change notification settings - Fork 31
/
Copy pathstringr_matt_lucich.rmd
214 lines (127 loc) · 6.81 KB
/
stringr_matt_lucich.rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
---
title: 'Tidyverse: using stringr, dplyr, and tibble to clean up catch phrases'
author: "Matthew Lucich"
date: "4/2/2021"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(tidyverse)
```
## Cleaning up catch phrases from classic movies
Source: https://www.kaggle.com/thomaskonstantin/150-famous-movie-catchphrases-with-context?select=Catchphrase.csv
Chose a text only dataset in order to demonstrate efficient string manipulating functions from stringr. Additionally, most examples contain data stored in a tibble and data management functions from dplyr.
```{r, warning=FALSE}
# Load data as tibble dataframe (directly from Github)
catch_phrases <- read_csv("https://gist.githubusercontent.com/mattlucich/afc4b9c362e303c1f6ba8880877f0b60/raw/a08a96361705b00c4ee8f4ec0b3d324f864ae419/catchphrase.csv") %>%
rename(catchphrase = Catchphrase,
movie_name = `Movie Name`,
context = Context)
```
## 1: How do I remove repeated extraneous characters from my data?
**Answer**: Use stringr's str_replace_all() function which replaces all matches of the character/pattern of interest.
```{r}
# Remove extraneous line breaks
catch_phrases$catchphrase <- catch_phrases$catchphrase %>% str_replace_all("\n" , "")
head(catch_phrases)
```
## 2: How do I convert uppercase text to title case?
**Answer**: Use stringr's str_to_title() to convert text to title case, then use dplyr's mutate() to perform the transformation on each row of the dataframe, replacing the previous value.
```{r}
# Use stringr and dplyr to convert all of the movie names to capital case
catch_phrases <- catch_phrases %>% mutate(movie_name = str_to_title(movie_name))
head(catch_phrases)
```
## 3: How do I filter for rows in a tibble containing certain characters/patterns?
**Answer**: Use stringr's str_detect() to detect the character/pattern of interest, then use dplyr's filter() to return only rows where the str_detect() condition is true.
```{r}
# Filter for the high energy quotes (i.e. ones with exclamation points)
exclamation_points <- catch_phrases %>% filter(str_detect(catchphrase, '!') )
# Percent of quotes that have exclamation points
dim(exclamation_points)[1] / dim(catch_phrases)[1]
```
## 4: How do I count how many matches a string has with a particular character/pattern (and filter out tibble rows with zero matches)?
**Answer**: Use stringr's str_count() to count the number of matches for the character/pattern of interest. Use dplyr's mutate() to perform the transformation on each row of the dataframe, replacing the previous value. Then, use dplyr's filter to only return rows with at least one match. Use dplyr's select() to return only the columns of interest and arrange() to sort the data in descending order by match count.
```{r}
# Filter for the high energy quotes (i.e. ones with exclamation points)
catch_phrases %>% mutate(exc_count = str_count(catchphrase, '!')) %>%
filter(exc_count > 0) %>%
select(catchphrase, exc_count) %>%
arrange(desc(exc_count))
```
## 5: How do I concatenate columns?
**Answer**: Use stringr's str_glue_data() function to combine multiple columns, separated by strings before, between or after the columns. The below example selects all columns, but returns only movie_name and catchphrase, separated by a dash.
```{r}
# Combine into one string
cp_glue <- catch_phrases %>% str_glue_data("{rownames(.)} {movie_name} - {catchphrase}")
head(cp_glue)
```
## 6: How do I order a vector alphabetically?
**Answer**: Use dplyr's pull() function to extract the catchphrase column from the catch_phrases tibble, converting it into a vector. Use stringr's str_sort() function to order the vector alphabetically.
```{r}
# Convert catchphrase column to vector
cp_vec <- catch_phrases %>% pull(catchphrase)
# Sort catchphrase in alphabetical order by the letter beginning the phrase
cp_sort <- str_sort(cp_vec)
head(cp_sort)
```
## 7: How do I filter out catchphrases pertaining to a subject?
**Answer**: Use `pull()` to convert column to vector, and `str_match()` to return matches of words in a "subject" vector:
```{r}
# Convert catchphrase column to vector
cp_vec <- catch_phrases %>% pull(catchphrase)
# define regex for words pertaining to a subject like "time"
time <- "(?i).*(yesterday|today|tomorrow|day|time|hour|morning|afternoon|night|later|always).*"
# match catchphrase column to subject regex
cp_time <- cp_vec %>% str_match(time)
# return phrases in a 'subject' vector
cp_time <- cp_time[,1][is.na(cp_time[,1]) == FALSE]
cp_time
```
=======
### Extension by Daniel Moscoe
## 7: How do I sort a vector by length?
**Answer**: Use dplyr's mutate() function together with str_length() to compute the length of each catchphrase. Use dplyr's arrange() function to sort the table by the new column of lengths.
```{r}
cp_by_len <- catch_phrases %>%
mutate("length" = str_length(catchphrase)) %>%
arrange(length)
head(cp_by_len)
```
## 8: How do I wrap a string to create paragraphs with a maximum line width?
**Answer**: Use stringr's str_wrap() function to insert line breaks along a string. Line breaks always occur between words so that no line exceeds the given width.
```{r}
cp_wrap <- catch_phrases %>%
mutate("wrapped" = str_wrap(catchphrase, width = 30))
head(cp_wrap$wrapped)
```
## 9: How can I extract text that matches a regexp?
**Answer**: Use str_extract with the regexp you wish to match. For this data, we can use this strategy (with moderate success) to extract the name of the character who delivers the catchphrase.
```{r}
cp_char <- catch_phrases %>%
mutate("character" = str_extract(catch_phrases$context, "^[^,]+(?=[,\\'])"))
head(cp_char)
```
### extension by Daniel Sullivan
## 10: how can I make sure all of my strings are a uniform length.
**Answer**: Use str_pad to pad strings with a selected character designating whether you want the padding at the begining, end or both. For this data we can pad the catchphrase column
```{r}
cp_pad<- catch_phrases%>%
mutate("pad_catchphrase" = str_pad(catchphrase,276, pad = " ", side = "left"))
head(cp_pad$pad_catchphrase)
```
## 11: How do I trim whitespace from entries in a column.
**Answer**: Use str_trim() denoting which side or both that you want to remove white space. For this we can clean the column we mad above using pad
```{r}
cp_pad$pad_catchphrase<- cp_pad$pad_catchphrase%>%
str_trim(side = "left")
head(cp_pad$pad_catchphrase)
```
## 12: How do i cap the charecter length and add elipsis to the end that was truncated.
**Answer** Use the str_trunc() function specifying the sting/strings you want to truncate, the side you want, and the string you want for an elipsis.
```{r}
cp_trunc<- catch_phrases$catchphrase%>%
str_trunc(40, side="right", ellipsis = "...")
head(cp_trunc)
```
=======