Fitting LDA Models in R

Wouter van Atteveldt & Kasper Welbers 2020-03

Introduction
- Inspecting LDA results
- Visualizing LDA with LDAvis

Introduction

LDA, which stands for Latent Dirichlet Allocation, is one of the most popular approaches for probabilistic topic modeling. The goal of topic modeling is to automatically assign topics to documents without requiring human supervision. Although the idea of an algorithm figuring out topics might sound close to magical (mostly because people have too high expectations of what these ‘topics’ are), and the mathematics might be a bit challenging, it is actually really simple fit an LDA topic model in R.

A good first step towards understanding what topic models are and how they can be usefull, is to simply play around with them, so that’s what we’ll do here. First, let’s create a document term matrix from the inaugural speeches in quanteda, at the paragraph level since we can expect these to be mostly about the same topic:

library(quanteda)
corp = corpus_reshape(data_corpus_inaugural, to = "paragraphs")
dfm = dfm(corp, remove_punct=T, remove=stopwords("english"))
dfm = dfm_trim(dfm, min_docfreq = 5)

To run LDA from a dfm, first convert to the topicmodels format, and then run LDA. Note the useof set.seed(.) to make sure that the analysis is reproducible.

library(topicmodels)
dtm = convert(dfm, to = "topicmodels") 
set.seed(1)
m = LDA(dtm, method = "Gibbs", k = 10,  control = list(alpha = 0.1))
m

## A LDA_Gibbs topic model with 10 topics.

Although LDA will figure out the topics, we do need to decide ourselves how many topics we want. Also, there are certain hyperparameters (alpha) that we can tinker with to have some control over the topic distributions. For now, we won’t go into details, but do note that we could also have asked for 100 topics, and our results would have been much different.

Inspecting LDA results

We can use terms to look at the top terms per topic:

terms(m, 5)

Topic 1	Topic 2	Topic 3	Topic 4	Topic 5	Topic 6	Topic 7	Topic 8	Topic 9	Topic 10
people	years	government	states	us	war	world	government	us	government
upon	world	public	united	one	nations	must	every	can	congress
shall	time	revenue	people	america	peace	can	states	new	law
may	president	business	now	every	foreign	peace	union	world	shall
public	new	must	great	must	united	freedom	people	let	upon

The posterior function gives the posterior distribution of words and documents to topics, which can be used to plot a word cloud of terms proportional to their occurrence:

topic = 6
words = posterior(m)$terms[topic, ]
topwords = head(sort(words, decreasing = T), n=50)
head(topwords)

##        war    nations      peace    foreign     united     states 
## 0.02292936 0.02202018 0.01838349 0.01420129 0.01129194 0.01092827

Now we can plot these words:

library(wordcloud)
wordcloud(names(topwords), topwords)

We can also look at the topics per document, to find the top documents per topic:

topic.docs = posterior(m)$topics[, topic] 
topic.docs = sort(topic.docs, decreasing=T)
head(topic.docs)

##   1813-Madison.5 1889-Harrison.18   1813-Madison.7   1949-Truman.39 
##        0.9608696        0.8741935        0.8714286        0.8714286 
##     1909-Taft.21   1949-Truman.25 
##        0.8310345        0.8294118

Given the document ids of the top documents, we can look up the text in the corp corpus

topdoc = names(topic.docs)[1]
topdoc_corp = corp[docnames(corp) == topdoc]
texts(topdoc_corp)

##                                                                                                                                                                                                                                                                                                                                                                                                       1813-Madison 
## "As the war was just in its origin and necessary and noble in its objects, we can reflect with a proud satisfaction that in carrying it on no principle of justice or honor, no usage of civilized nations, no precept of courtesy or humanity, have been infringed. The war has been waged on our part with scrupulous regard to all these obligations, and in a spirit of liberality which was never surpassed."

Finally, we can see which president preferred which topics:

docs = docvars(dfm)[match(rownames(dtm), docnames(dfm)),]
tpp = aggregate(posterior(m)$topics, by=docs["President"], mean)
rownames(tpp) = tpp$President
heatmap(as.matrix(tpp[-1]))

As you can see, the topics form a sort of ‘block’ distribution, with more modern presidents and older presidents using quite different topics. So, either the role of presidents changed, or language use changed, or (probably) both.

To get a better fit of such temporal dynamics, see the session on structural topic models, which allow you to condition topic proportions and/or contents on metadata covariates such as source or date.

Visualizing LDA with LDAvis

LDAvis is a nice interactive visualization of LDA results. It needs the LDA and DTM information in a slightly different format than what’s readily available, but you can use the code below to create that format from the lda model m and the dtm. If you don’t have it yet, you’ll have to install the LDAvis package, and you might also have to install the servr package.

library(LDAvis)   

dtm = dtm[slam::row_sums(dtm) > 0, ]
phi = as.matrix(posterior(m)$terms)
theta <- as.matrix(posterior(m)$topics)
vocab <- colnames(phi)
doc.length = slam::row_sums(dtm)
term.freq = slam::col_sums(dtm)[match(vocab, colnames(dtm))]

json = createJSON(phi = phi, theta = theta, vocab = vocab,
     doc.length = doc.length, term.frequency = term.freq)
serVis(json)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

r_text_lda.md

r_text_lda.md

Fitting LDA Models in R

Introduction

Inspecting LDA results

Visualizing LDA with LDAvis

Files

r_text_lda.md

Latest commit

History

r_text_lda.md

File metadata and controls

Fitting LDA Models in R

Introduction

Inspecting LDA results

Visualizing LDA with LDAvis