Here, I train some n-gram language models on WikiText-2, a corpus of high-quality Wikipedia articles. The dataset was originally introduced in the following paper: https://arxiv.org/pdf/1609.07843v1.pdf. A raw version of the data can easily be viewed here: https://github.com/pytorch/examples/tree/master/word_language_model/data/wikitext-2.
I implemented 4 types of language models: a unigram model, a smoothed unigram model, a bigram model, and a smoothed bigram model.
-
generateSentence(self)
: Return a sentence that is generated by the language model. It should be a list of the form [<s>, w(1), ..., w(n), </s>], where each w(i) is a word in the vocabulary (including <UNK> but exlcuding <s> and </s>). I assume that <s> starts each sentence (with probability$1$ ). The following words w(1), ... , w(n), </s> are generated according to language model's distribution. The number of words n is not fixed; instead, I stop the sentence as soon as I generate the stop token </s>. -
getSentenceLogProbability(self, sentence)
: Return the logarithm of the probability of sentence, which is again a list of the form [<s>, w(1), ..., w(n), </s>]. -
getCorpusPerplexity(self, testCorpus)
: Compute the perplexity (normalized inverse log probability) oftestCorpus
according to the model. For a corpus$W$ with$N$ words and a bigram model, Jurafsky and Martin tells us to compute perplexity as follows:
In order to avoid underflow, I did all of my calculations in log-space. That is, instead of multiplying probabilities, I added the logarithms of the probabilities and exponentiate the result:
See my code for more!