forked from dgrtwo/tidy-text-mining
-
Notifications
You must be signed in to change notification settings - Fork 0
/
07-topic-models.Rmd
283 lines (202 loc) · 10.6 KB
/
07-topic-models.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
# Topic modeling {#topicmodeling}
```{r echo = FALSE}
library(knitr)
opts_chunk$set(message = FALSE, warning = FALSE, cache = TRUE)
options(width = 100, dplyr.width = 150)
library(ggplot2)
theme_set(theme_light())
```
Topic modeling is a method for unsupervised classification of documents, by modeling each document as a mixture of topics and each topic as a mixture of words. [Latent Dirichlet allocation](https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation) is a particularly popular method for fitting a topic model.
We can use tidy text principles, as described in [Chapter 2](#tidytext), to approach topic modeling using consistent and effective tools. In particular, we'll be using tidying functions for LDA objects from the [topicmodels package](https://cran.r-project.org/package=topicmodels).
## The great library heist
Suppose a vandal has broken into your study and torn apart four of your books:
* *Great Expectations* by Charles Dickens
* *The War of the Worlds* by H.G. Wells
* *Twenty Thousand Leagues Under the Sea* by Jules Verne
* *Pride and Prejudice* by Jane Austen
This vandal has torn the books into individual chapters, and left them in one large pile. How can we restore these disorganized chapters to their original books? This is a challenging problem since the individual chapters are **unlabeled**: we don't know what words might distinguish them into groups. We'll thus use topic modeling to discover how chapters cluster into distinct topics, each of them representing one of the words.
We'll retrieve the text of these four books using the gutenbergr package:
```{r titles}
library(dplyr)
titles <- c("Twenty Thousand Leagues under the Sea", "The War of the Worlds",
"Pride and Prejudice", "Great Expectations")
```
```{r eval = FALSE}
library(gutenbergr)
books <- gutenberg_works(title %in% titles) %>%
gutenberg_download(meta_fields = "title")
```
```{r topic_books, echo = FALSE}
load("data/books.rda")
```
As pre-processing, we divide these into chapters, use tidytext's `unnest_tokens` to separate them into words, then remove `stop_words`, similar to the analyses we've done in Chapter \@ref{sentiment}. We're treating every chapter as a separate "document", each with a name like `Great Expectations_1` or `Pride and Prejudice_11`. (In practice, each document might be one newspaper article, or one blog post).
```{r word_counts, dependson = "topic_books"}
library(tidytext)
library(stringr)
library(tidyr)
# Divide into documents, each representing one chapter
by_chapter <- books %>%
group_by(title) %>%
mutate(chapter = cumsum(str_detect(text, regex("^chapter ", ignore_case = TRUE)))) %>%
ungroup() %>%
filter(chapter > 0) %>%
unite(document, title, chapter)
# Split into words
by_chapter_word <- by_chapter %>%
unnest_tokens(word, text)
# Find document-word counts
word_counts <- by_chapter_word %>%
anti_join(stop_words) %>%
count(document, word, sort = TRUE) %>%
ungroup()
word_counts
```
word_counts
## Latent Dirichlet allocation with the topicmodels package
Right now this data frame is in a tidy form, with one-term-per-document-per-row. However, the topicmodels package requires a `DocumentTermMatrix` (from the tm package). As described in Chapter \@ref{#dtm}, we can cast a one-token-per-row table into a `DocumentTermMatrix` with tidytext's `cast_dtm`:
```{r chapters_dtm}
chapters_dtm <- word_counts %>%
cast_dtm(document, word, n)
chapters_dtm
```
We're now ready ready to use the [topicmodels](https://cran.r-project.org/package=topicmodels) package, specifically the `LDA` function, to create a four topic Latent Dirichlet Allocation model. (In this case we know there are four topics because there are four books: in other problems we may need to try a few different values of `k`).
```{r chapters_lda}
library(topicmodels)
# setting a seed so that the output is predictable
chapters_lda <- LDA(chapters_dtm, k = 4, control = list(seed = 1234))
chapters_lda
class(chapters_lda)
```
This produces an LDA model of class `LDA_VEM`, containing the
The tidytext package now gives us the option of *returning* to a tidy analysis, using the `tidy` and `augment` verbs borrowed from the [broom package](https://github.com/dgrtwo/broom). Note the similarity with the tidying functions in Chapter \@ref{#casting}: each . In particular, we start with the `tidy` verb.
```{r chapters_lda_td}
chapters_lda_td <- tidy(chapters_lda)
chapters_lda_td
```
Notice that this has turned the model into a one-topic-per-term-per-row format. For each combination the model has $\beta$, the probability of that term being generated from that topic.
We could use dplyr's `top_n` to find the top 5 terms within each topic:
```{r top_terms}
top_terms <- chapters_lda_td %>%
group_by(topic) %>%
top_n(5, beta) %>%
ungroup() %>%
arrange(topic, -beta)
top_terms
```
This model lends itself to a visualization:
```{r top_terms_plot, fig.height=6, fig.width=7}
library(ggplot2)
top_terms %>%
mutate(term = reorder(term, beta)) %>%
ggplot(aes(term, beta, fill = factor(topic))) +
geom_bar(alpha = 0.8, stat = "identity", show.legend = FALSE) +
facet_wrap(~ topic, scales = "free") +
coord_flip()
```
These topics are pretty clearly associated with the four books! There's no question that the topic of "nemo", "sea", and "nautilus" belongs to *Twenty Thousand Leagues Under the Sea*, and that "jane", "darcy", and "elizabeth" belongs to *Pride and Prejudice*. We see "pip" and "joe" from *Great Expectations* and "martians", "black", and "night" from *The War of the Worlds*.
## Per-document classification
Each chapter was a "document" in this analysis. Thus, we may want to know which topics are associated with each document. Can we put the chapters back together in the correct books?
```{r chapters_lda_gamma_raw}
chapters_lda_gamma <- tidytext:::tidy.LDA(chapters_lda, matrix = "gamma")
chapters_lda_gamma
```
Setting `matrix = "gamma"` returns a tidied version with one-document-per-topic-per-row. Now that we have these document classifiations, we can see how well our unsupervised learning did at distinguishing the four books. First we re-separate the document name into title and chapter:
```{r chapters_lda_gamma}
chapters_lda_gamma <- chapters_lda_gamma %>%
separate(document, c("title", "chapter"), sep = "_", convert = TRUE)
chapters_lda_gamma
```
Then we examine what fraction of chapters we got right for each:
```{r chapters_lda_gamma_plot, fig.width=8, fig.height=6}
ggplot(chapters_lda_gamma, aes(gamma, fill = factor(topic))) +
geom_histogram() +
facet_wrap(~ title, nrow = 2)
```
We notice that almost all of the chapters from *Pride and Prejudice*, *War of the Worlds*, and *Twenty Thousand Leagues Under the Sea* were uniquely identified as a single topic each.
```{r chapter_classifications}
chapter_classifications <- chapters_lda_gamma %>%
group_by(title, chapter) %>%
top_n(1, gamma) %>%
ungroup() %>%
arrange(gamma)
chapter_classifications
```
We can determine this by finding the consensus book for each, which we note is correct based on our earlier visualization:
```{r book_topics}
book_topics <- chapter_classifications %>%
count(title, topic) %>%
top_n(1, n) %>%
ungroup() %>%
transmute(consensus = title, topic)
book_topics
```
Then we see which chapters were misidentified:
```{r}
chapter_classifications %>%
inner_join(book_topics, by = "topic") %>%
count(title, consensus)
```
We see that only a few chapters from *Great Expectations* were misclassified. Not bad for unsupervised clustering!
## By word assignments: `augment`
One important step in the topic modeling expectation-maximization algorithm is assigning each word in each document to a topic. The more words in a document are assigned to that topic, generally, the more weight (`gamma`) will go on that document-topic classification.
We may want to take the original document-word pairs and find which words in each document were assigned to which topic. This is the job of the `augment` verb.
```{r assignments}
assignments <- tidytext:::augment.LDA(chapters_lda, data = chapters_dtm)
```
We can combine this with the consensus book titles to find which words were incorrectly classified.
```{r assignments2, dependson = "assignments"}
assignments <- assignments %>%
separate(document, c("title", "chapter"), sep = "_", convert = TRUE) %>%
inner_join(book_topics, by = c(".topic" = "topic"))
assignments
```
We can, for example, create a "confusion matrix" using dplyr's `count` and tidyr's `spread`:
```{r dependson = "assignments2"}
assignments %>%
count(title, consensus, wt = count) %>%
spread(consensus, n, fill = 0)
```
We notice that almost all the words for *Pride and Prejudice*, *Twenty Thousand Leagues Under the Sea*, and *War of the Worlds* were correctly assigned, while *Great Expectations* had a fair amount of misassignment.
What were the most commonly mistaken words?
```{r wrong_words, dependson = "assignments2"}
wrong_words <- assignments %>%
filter(title != consensus)
wrong_words
wrong_words %>%
count(title, consensus, term, wt = count) %>%
ungroup() %>%
arrange(desc(n))
```
Notice the word "flopson" here; these wrong words do not necessarily appear in the novels they were misassigned to. Indeed, we can confirm "flopson" appears only in *Great Expectations*:
```{r dependson = "word_counts"}
word_counts %>%
filter(word == "flopson")
```
The algorithm is stochastic and iterative, and it can accidentally land on a topic that spans multiple books.
## Alternative LDA implementations
The `LDA` function in the topicmodels package is only one implementation of . For example, the [mallet](https://cran.r-project.org/package=mallet) package implements a wrapper around the [MALLET](http://mallet.cs.umass.edu/)
```{r mallet_model, dependson = "word_counts", results = "hide"}
library(mallet)
collapsed <- by_chapter_word %>%
anti_join(stop_words, by = "word") %>%
mutate(word = str_replace(word, "'", "")) %>%
group_by(document) %>%
summarize(text = paste(word, collapse = " "))
# The mallet package requires a file of stopwords
# Since we've already filtered them, we can give it an empty file
file.create(empty_file <- tempfile())
docs <- mallet.import(collapsed$document, collapsed$text, empty_file)
mallet_model <- MalletLDA(num.topics = 4)
mallet_model$loadDocuments(docs)
mallet_model$train(200)
```
The inputs and outputs of the tidying functions are almost identical to the tidiers described in this rest of this chapter. In particular, they take
```{r}
# word-topic pairs
tidy(mallet_model)
# document-topic pairs
tidy(mallet_model, matrix = "gamma")
# column needs to be named "term" for "augment"
word_counts <- rename(word_counts, term = word)
augment(mallet_model, word_counts)
```