07-topic-models.Rmd

# Topic modeling {#topicmodeling}

```{r echo = FALSE}
library(knitr)
opts_chunk$set(message = FALSE, warning = FALSE, cache = TRUE)
options(width = 100, dplyr.width = 150)
library(ggplot2)
theme_set(theme_light())
```

Topic modeling is a method for unsupervised classification of documents, by modeling each document as a mixture of topics and each topic as a mixture of words. [Latent Dirichlet allocation](https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation) is a particularly popular method for fitting a topic model.

We can use tidy text principles, as described in [Chapter 2](#tidytext), to approach topic modeling using consistent and effective tools. In particular, we'll be using tidying functions for LDA objects from the [topicmodels package](https://cran.r-project.org/package=topicmodels).

## The great library heist

Suppose a vandal has broken into your study and torn apart four of your books:

* *Great Expectations* by Charles Dickens
* *The War of the Worlds* by H.G. Wells
* *Twenty Thousand Leagues Under the Sea* by Jules Verne
* *Pride and Prejudice* by Jane Austen

This vandal has torn the books into individual chapters, and left them in one large pile. How can we restore these disorganized chapters to their original books? This is a challenging problem since the individual chapters are **unlabeled**: we don't know what words might distinguish them into groups. We'll thus use topic modeling to discover how chapters cluster into distinct topics, each of them representing one of the words.

We'll retrieve the text of these four books using the gutenbergr package:

```{r titles}
library(dplyr)

titles <- c("Twenty Thousand Leagues under the Sea", "The War of the Worlds",
            "Pride and Prejudice", "Great Expectations")
```

```{r eval = FALSE}
library(gutenbergr)

books <- gutenberg_works(title %in% titles) %>%
  gutenberg_download(meta_fields = "title")
```

```{r topic_books, echo = FALSE}
load("data/books.rda")
```

As pre-processing, we divide these into chapters, use tidytext's `unnest_tokens` to separate them into words, then remove `stop_words`, similar to the analyses we've done in Chapter \@ref{sentiment}. We're treating every chapter as a separate "document", each with a name like `Great Expectations_1` or `Pride and Prejudice_11`. (In practice, each document might be one newspaper article, or one blog post).

```{r word_counts, dependson = "topic_books"}
library(tidytext)
library(stringr)
library(tidyr)

# Divide into documents, each representing one chapter
by_chapter <- books %>%
  group_by(title) %>%
  mutate(chapter = cumsum(str_detect(text, regex("^chapter ", ignore_case = TRUE)))) %>%
  ungroup() %>%
  filter(chapter > 0) %>%
  unite(document, title, chapter)

# Split into words
by_chapter_word <- by_chapter %>%
  unnest_tokens(word, text)

# Find document-word counts
word_counts <- by_chapter_word %>%
  anti_join(stop_words) %>%
  count(document, word, sort = TRUE) %>%
  ungroup()

word_counts
```

word_counts

## Latent Dirichlet allocation with the topicmodels package

Right now this data frame is in a tidy form, with one-term-per-document-per-row. However, the topicmodels package requires a `DocumentTermMatrix` (from the tm package). As described in Chapter \@ref{#dtm}, we can cast a one-token-per-row table into a `DocumentTermMatrix` with tidytext's `cast_dtm`:

```{r chapters_dtm}
chapters_dtm <- word_counts %>%
  cast_dtm(document, word, n)

chapters_dtm
```

We're now ready ready to use the [topicmodels](https://cran.r-project.org/package=topicmodels) package, specifically the `LDA` function, to create a four topic Latent Dirichlet Allocation model. (In this case we know there are four topics because there are four books: in other problems we may need to try a few different values of `k`).

```{r chapters_lda}
library(topicmodels)

# setting a seed so that the output is predictable
chapters_lda <- LDA(chapters_dtm, k = 4, control = list(seed = 1234))
chapters_lda
class(chapters_lda)
```

This produces an LDA model of class `LDA_VEM`, containing the 

The tidytext package now gives us the option of *returning* to a tidy analysis, using the `tidy` and `augment` verbs borrowed from the [broom package](https://github.com/dgrtwo/broom). Note the similarity with the tidying functions in Chapter \@ref{#casting}: each . In particular, we start with the `tidy` verb.

```{r chapters_lda_td}
chapters_lda_td <- tidy(chapters_lda)
chapters_lda_td
```

Notice that this has turned the model into a one-topic-per-term-per-row format. For each combination the model has $\beta$, the probability of that term being generated from that topic.

We could use dplyr's `top_n` to find the top 5 terms within each topic:

```{r top_terms}
top_terms <- chapters_lda_td %>%
  group_by(topic) %>%
  top_n(5, beta) %>%
  ungroup() %>%
  arrange(topic, -beta)

top_terms
```

This model lends itself to a visualization:

```{r top_terms_plot, fig.height=6, fig.width=7}
library(ggplot2)

top_terms %>%
  mutate(term = reorder(term, beta)) %>%
  ggplot(aes(term, beta, fill = factor(topic))) +
  geom_bar(alpha = 0.8, stat = "identity", show.legend = FALSE) +
  facet_wrap(~ topic, scales = "free") +
  coord_flip()
```

These topics are pretty clearly associated with the four books! There's no question that the topic of "nemo", "sea", and "nautilus" belongs to *Twenty Thousand Leagues Under the Sea*, and that "jane", "darcy", and "elizabeth" belongs to *Pride and Prejudice*. We see "pip" and "joe" from *Great Expectations* and "martians", "black", and "night" from *The War of the Worlds*.

## Per-document classification

Each chapter was a "document" in this analysis. Thus, we may want to know which topics are associated with each document. Can we put the chapters back together in the correct books?

```{r chapters_lda_gamma_raw}
chapters_lda_gamma <- tidytext:::tidy.LDA(chapters_lda, matrix = "gamma")
chapters_lda_gamma
```

Setting `matrix = "gamma"` returns a tidied version with one-document-per-topic-per-row. Now that we have these document classifiations, we can see how well our unsupervised learning did at distinguishing the four books. First we re-separate the document name into title and chapter:

```{r chapters_lda_gamma}
chapters_lda_gamma <- chapters_lda_gamma %>%
  separate(document, c("title", "chapter"), sep = "_", convert = TRUE)
chapters_lda_gamma
```

Then we examine what fraction of chapters we got right for each:

```{r chapters_lda_gamma_plot, fig.width=8, fig.height=6}
ggplot(chapters_lda_gamma, aes(gamma, fill = factor(topic))) +
  geom_histogram() +
  facet_wrap(~ title, nrow = 2)
```

We notice that almost all of the chapters from *Pride and Prejudice*, *War of the Worlds*, and *Twenty Thousand Leagues Under the Sea* were uniquely identified as a single topic each.

```{r chapter_classifications}
chapter_classifications <- chapters_lda_gamma %>%
  group_by(title, chapter) %>%
  top_n(1, gamma) %>%
  ungroup() %>%
  arrange(gamma)

chapter_classifications
```

We can determine this by finding the consensus book for each, which we note is correct based on our earlier visualization:

```{r book_topics}
book_topics <- chapter_classifications %>%
  count(title, topic) %>%
  top_n(1, n) %>%
  ungroup() %>%
  transmute(consensus = title, topic)

book_topics
```

Then we see which chapters were misidentified:

```{r}
chapter_classifications %>%
  inner_join(book_topics, by = "topic") %>%
  count(title, consensus)
```

We see that only a few chapters from *Great Expectations* were misclassified. Not bad for unsupervised clustering!

## By word assignments: `augment`

One important step in the topic modeling expectation-maximization algorithm is assigning each word in each document to a topic. The more words in a document are assigned to that topic, generally, the more weight (`gamma`) will go on that document-topic classification.

We may want to take the original document-word pairs and find which words in each document were assigned to which topic. This is the job of the `augment` verb.

```{r assignments}
assignments <- tidytext:::augment.LDA(chapters_lda, data = chapters_dtm)
```

We can combine this with the consensus book titles to find which words were incorrectly classified.

```{r assignments2, dependson = "assignments"}
assignments <- assignments %>%
  separate(document, c("title", "chapter"), sep = "_", convert = TRUE) %>%
  inner_join(book_topics, by = c(".topic" = "topic"))

assignments
```

We can, for example, create a "confusion matrix" using dplyr's `count` and tidyr's `spread`:

```{r dependson = "assignments2"}
assignments %>%
  count(title, consensus, wt = count) %>%
  spread(consensus, n, fill = 0)
```

We notice that almost all the words for *Pride and Prejudice*, *Twenty Thousand Leagues Under the Sea*, and *War of the Worlds* were correctly assigned, while *Great Expectations* had a fair amount of misassignment.

What were the most commonly mistaken words?

```{r wrong_words, dependson = "assignments2"}
wrong_words <- assignments %>%
  filter(title != consensus)

wrong_words

wrong_words %>%
  count(title, consensus, term, wt = count) %>%
  ungroup() %>%
  arrange(desc(n))
```

Notice the word "flopson" here; these wrong words do not necessarily appear in the novels they were misassigned to. Indeed, we can confirm "flopson" appears only in *Great Expectations*:

```{r dependson = "word_counts"}
word_counts %>%
  filter(word == "flopson")
```

The algorithm is stochastic and iterative, and it can accidentally land on a topic that spans multiple books.

## Alternative LDA implementations

The `LDA` function in the topicmodels package is only one implementation of . For example, the [mallet](https://cran.r-project.org/package=mallet) package implements a wrapper around the  [MALLET](http://mallet.cs.umass.edu/)

```{r mallet_model, dependson = "word_counts", results = "hide"}
library(mallet)

collapsed <- by_chapter_word %>%
  anti_join(stop_words, by = "word") %>%
  mutate(word = str_replace(word, "'", "")) %>%
  group_by(document) %>%
  summarize(text = paste(word, collapse = " "))

# The mallet package requires a file of stopwords
# Since we've already filtered them, we can give it an empty file
file.create(empty_file <- tempfile())
docs <- mallet.import(collapsed$document, collapsed$text, empty_file)

mallet_model <- MalletLDA(num.topics = 4)
mallet_model$loadDocuments(docs)
mallet_model$train(200)
```

The inputs and outputs of the tidying functions are almost identical to the tidiers described in this rest of this chapter. In particular, they take

```{r}
# word-topic pairs
tidy(mallet_model)

# document-topic pairs
tidy(mallet_model, matrix = "gamma")

# column needs to be named "term" for "augment"
word_counts <- rename(word_counts, term = word)
augment(mallet_model, word_counts)
```