Skip to content

Train 4 types of language models (a unigram model, a smoothed unigram model, a bigram model, and a smoothed bigram model) on WikiText-2, a corpus of high-quality Wikipedia articles

Notifications You must be signed in to change notification settings

OlaPietka/NLP-Language-Models

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 

Repository files navigation

NLP-Language-Models

Here, I train some n-gram language models on WikiText-2, a corpus of high-quality Wikipedia articles. The dataset was originally introduced in the following paper: https://arxiv.org/pdf/1609.07843v1.pdf. A raw version of the data can easily be viewed here: https://github.com/pytorch/examples/tree/master/word_language_model/data/wikitext-2.

I implemented 4 types of language models: a unigram model, a smoothed unigram model, a bigram model, and a smoothed bigram model.

  • generateSentence(self): Return a sentence that is generated by the language model. It should be a list of the form [<s>, w(1), ..., w(n), </s>], where each w(i) is a word in the vocabulary (including <UNK> but exlcuding <s> and </s>). I assume that <s> starts each sentence (with probability $1$). The following words w(1), ... , w(n), </s> are generated according to language model's distribution. The number of words n is not fixed; instead, I stop the sentence as soon as I generate the stop token </s>.

  • getSentenceLogProbability(self, sentence): Return the logarithm of the probability of sentence, which is again a list of the form [<s>, w(1), ..., w(n), </s>].

  • getCorpusPerplexity(self, testCorpus): Compute the perplexity (normalized inverse log probability) of testCorpus according to the model. For a corpus $W$ with $N$ words and a bigram model, Jurafsky and Martin tells us to compute perplexity as follows:

$$Perplexity(W) = \Big [ \prod_{i=1}^N \frac{1}{P(w^{(i)}|w^{(i-1)})} \Big ]^{1/N}$$

In order to avoid underflow, I did all of my calculations in log-space. That is, instead of multiplying probabilities, I added the logarithms of the probabilities and exponentiate the result:

$$\prod_{i=1}^N P(w^{(i)}|w^{(i-1)}) = \exp\Big (\sum_{i=1}^N \log P(w^{(i)}|w^{(i-1)}) \Big ) $$


See my code for more!

About

Train 4 types of language models (a unigram model, a smoothed unigram model, a bigram model, and a smoothed bigram model) on WikiText-2, a corpus of high-quality Wikipedia articles

Topics

Resources

Stars

Watchers

Forks