In the order that they're used:
train_model.py
: ✅emission_probs.py
: ✅transmission_probs.py
: ✅viterbi.py
: ❌viterbi1.py
(viterbi test file): ❌accuracy.py
: ✅
This repository details the creation of a Part-of-Speech tagger using a trigram hidden Markov model (HMM) and the Viterbi algorithm to predict word tags in a word sequence. It is trained and evaluated on a real-world text called the Brown corpus, which contains approximately 1 million words from 500 texts across 15 different genres. The Brown corpus was compiled by Henry Kučera and W. Nelson Francis at Brown University in Rhode Island and was published in the United States in 1961. For more information regarding the history of the Brown corpus, click here.
To view a subset of the Brown Corpus and how it is annotated, click here.
To view the original manual and tagset used in the Brown Corpus, click here.
To view the results of the POS Tagger, click here.
For a basic rundown of the repository, click here.
To skip to the section on how to run the POS Tagger yourself, click here.
In corpus linguistics, part-of-speech tagging (POS tagging, PoS tagging, or POST), also known as "grammatical tagging," is the process of marking up words and punctuations in a text (corpus) as corresponding to particular parts of speech, based on both their definition and their context. Once performed by hand, POS tagging is now done through the use of algorithms which associate discrete terms, as well as "hidden" parts of speech, by a set of descriptive tags. This application merely scratches the surface of computational linguistics. POS-tagging algorithms fall into two distinctive categories: rule-based and stochastic. Because applying a rule-based model to predict tags in a sequence is cumbersome and restricted to a linguist's understanding of allowable sentences in the context of language productivity, a stochastic approach is instead taken to assign POS tags to words in a sequence through the use of a trigram hidden Markov model and the Viterbi algorithm.
What are Trigram Hidden Markov Models (HMMs)?
The hidden Markov model, or HMM for short, is a probabilistic sequence model that assigns a label to each unit in a sequence of observations (i.e, input sentences). The model computes a probability distribution over possible sequences of POS labels (from a training corpus) and then chooses the best label sequence that maximizes the probability of generating the observed sequence. The HMM is widely used in natural language processing since language consists of sequences at many levels such as sentences, phrases, words, or even characters. The HMM can be enhanced to incorporate not only unobservable parts-of-speech, but also observable components (i.e., the actual order of words in a sequence) through the use of a probability distribution over the set of trigrams in the given corpus. This allows our model to distinguish between the likes of homophones, or words that share the same spelling or pronunciation, but differ in meaning and parts-of-speech (i.e., "rose" as in "rose bush" (NN) and "rose" (VBD) as in the past tense of "rise").
Decoding is the task of determining which sequence of variables is the underlying source of some sequence of observations. Mathematically, we would like to find the most probable sequence of hidden states Q = q1,q2,q3,...,qN
given as input to HMM λ = (A,B)
and a sequence of observations O = o1,o2,o3,...,oN
where: A
is a transition probability matrix with each element aij representing the probability of moving from a hidden state qi to another state qj such that:
for ∀i, and B
is a matrix of emission probabilities with each element representing the probability of an observation state oi being generated from a hidden state qi. In POS tagging, each hidden state corresponds to a single tag, and each observation state corresponds to a word in a given sentence. For example, the task of the decoder may be to find the best hidden tag sequence DT NNS VB
that maximizes the probability of the observed sequence of words The boys eat
.
The task of decoding is ultimately defined as Eq. 1:
where the second equality is computed using Bayes' rule. Moreover, the denominator of the second equality in Eq. 1 can be dropped since it does not depend on q. This gives us Eq. 2:
The trigram HMM tagger makes two assumptions to simplify the computation of the first equality in Eq. 2. The first is that the emission probability of a word appearing depends only on its own tag and is independent of neighboring words and tags, which gives us Eq. 3:
The second is a Markov assumption that the transition probability of a tag is dependent only on the previous two tags rather than the entire tag sequence, which gives us Eq. 4:
where q-1 = q-2 = * is the special start symbol appended to the beginning of every tag sequence and qn+1 = STOP is the unique stop symbol marked at the end of every tag sequence.
In many cases, we have a labeled corpus of sentences paired with the correct POS tag sequences The/DT boys/NNS eat/VB
such as in the Brown corpus, so the problem of POS tagging is that of the supervised learning portion, where we can easily calculate the maximum likelihood estimate of a transition probability P(qi | qi-1, qi-2) by counting how often we see the third tag qi followed by its previous two tags qi-1 and qi-2 divided by the number of occurrences of the two tags qi-1 and qi-2 (Eq. 5):
Likewise, we compute an emission probability P(oi | qi) using Eq. 6 as follows:
Recall that the decoding task is to find:
where the argmax is taken over all sequences q1n such that qi ∈ S for i = 1,...,n and S is the set of all tags. We further assume that P(o1n,q1n) takes the form:
assuming that q-1 and q-2 = * and qn+1 = STOP. Because the argmax is taken over all different tag sequences, brute force search by computing the likelihood of the observation sequence given each possible hidden state sequence is very inefficient, as it is completed in O(|S|3) complexity. Instead, the Viterbi algorithm, a kind of dynamic programming algorithm, is used to make the search computationally more efficient.
Define n to be the length of the input sentence and Sk for k = -1, 0,...,n to be the set of possible tags at position k such that S-1 = S0 = * and Sk = Sk ∈ 1,...,n. Define
and a dynamic programming table, or a cell, to be:
which is the maximum probability of a tag sequence ending in tags u, v at position k. The Viterbi algorithm fills each cell recursively such that the most probable of the extensions of the paths that lead to the current cell at time k given that we had already computed the probability of being in every state at time k-1. The algorithm essentially works by initializing the first cell as:
and for any k ∈ 1,...,n, for any u ∈ Sk-1 and v ∈ Sk, recursively compute:
and ultimately return:
The last component of the Viterbi algorithm is backpointers. The goal of the decoder is to not only produce a probability of the most probable tag sequence but also the resulting tag sequence itself. The best state sequence is computed by keeping track of the path of hidden state that led to each state and backtracing the best path in reverse from the end to the start.
Though utilizing a hidden Markov model in conjunction with the Viterbi algorithm can produce tagging results with approximately 50-70% accuracy on a test corpus, this is still well below the human agreement upper bound of 97% for POS tagging. To be able to approach a higher accuracy rate for POS tagging, two additional features are utitlized to enhance my POS tagging model. These features are detailed below:
Normally, transition probabilities are calculated using Equation 5 above. However, these counts may result in a returned value of zero using a training corpus which erroneously predicts that a given tag sequence will never occur at all. A common, effective solution to this division by zero error is to estimate a trigram transition probability by aggregating weaker, yet more robust estimators such as bigram and unigram probabilities. For instance, assume we have never seen the tag sequence DT NNS VB
in a training corpus, so the trigram transition probability P(VB ∣ DT,NNS) = 0 but it may still be possible to compute the bigram transition probability P(VB | NNS) as well as the unigram probability P(VB).
More generally, the maximum likelihood estimates of the following transition probabilities can be computed using counts from a training corpus and subsequenty setting them to zero if the denominator happens to be zero:
where N is the total number of tokens, not unique words, in the training corpus. The final trigram probability estimate P~(qi | qi-1, qi-2) is calculated by a weighted sum of the trigram, bigram, and unigram probability estimates above:
under the constraint λ1 + λ2 + λ3 = 1. These values of λs are generally set using the algorithm called deleted interpolation which is conceptually similar to leave-one-out cross-validation LOOCV
in that each trigram is successively deleted from the training corpus and the λs are chosen to maximize the likelihood of the rest of the corpus. The deletion mechanism thereby helps set the λs so as to not overfit the training corpus and aid in generalization. The λ values are experimentally determined in my repository.
In linguistics, Hockett's Design Features are a set of features that characterize human language and set it apart from animal communication. Of these features, one of the most important is "productivity," which refers to the idea that language-users can produce and understand an unlimited amount of novel utterances. Also related to productivity is the concept of grammatical patterning, which facilitates the use and comprehension of language. Language is not stagnant, but is constantly changing. Thus, in all human languages, new words and phrases are constantly being coined and added to a dictionary. Updating a dictionary of vocabularies is, however, too cumbersome and takes an unreasonable amount of effort. Because of this, it is important to have a good model for dealing with unknown words found in a test corpus to achieve a high accuracy with a trigram HMM POS tagger.
Utilizing RARE is a simple way to replace every word or token with the special symbol _RARE_
whose frequency in the training set is less than or equal to 5. Without this process, words like person names and places that do not appear in the training set but are seen in the test set would have their maximum likelihood estimates of P(qi ∣ oi) (i.e., the emission probabilities) undefined.
Morphosyntactic subcategorization is a modification of RARE that serves as a better alternative in that every word token whose frequency is less than or equal to 5 in the training set is replaced by further subcategorization based on a set of morphological characteristics (i.e., affixation). For example, we know that words with suffixes like -ion
, -ment
, -ence
, or -ness
, just to name a few, will be a noun, and that adjectives may possess un-
and in-
as prefixes or -ious
and -ble
as suffixes.
The Trigram HMM POS tagger is trained on a subset of the Brown corpus, which contains nearly 27500 tagged sentences in total. The training set contains approximately 80% of the entire corpus, with the remainder 20% being utilized as the test set. The accuracy of the tagger is measured by comparing the predicted tags in the test set with the true tags of the test set already provided in the corpus. Thus, the percentage of tags that the model gets correct is defined as the accuracy.
Using a combination of deleted interpolation with morphosyntactic subcategorization, my POS tagger achieves an overall accuracy of: IN PROGRESS
Each file is briefly explained in the order that they were created and used:
train_model.py
is used to: get two separate lists containing cleaned sentences and POS sequences, respectively, from the training corpus, create dictionaries of POS unigrams, bigrams, and trigrams, and build a sample file of what these dictionaries contain at first glance for the sole purpose of visualization.emission_probs.py
is used to apply morphosyntactic subcategorization to the sentences list to be able to calculate the emission probabilities for each word/POS pair as well as get a list of "known words" from the sentences lists.transmission_probs.py
is used to experimentally calculate the λ values for deleted interpolation, find the interpolated transmission probabilities for each ngram, and ultimtately deduce the overall accuracy of the POS tagger using necessary components retrieved fromtrain_model.py
andemission_probs.py
.viterbi.py
is used to apply the Viterbi algorithm to retrieve the most probabilistic sequence of POS tags for each sentence in the test set.accuracy.py
is used to find the accuracy of the Viterbi algorithm by comparing the calculated POS sequences for each test sentence to the actual POS sequences for each test sentence.
Each file contains an in-depth description of how they work and the purpose of all functions that are within them. While looking at each file in order, you will also notice how, and the order in which, I calculate all necessary pieces of data (found in each of my main drivers) to ultimately create my POS tagger. Lastly, you will notice that I modularize all functions in the event that they needed to be imported and reused in different files.
THIS SECTION WILL CONTINUE TO BE MODIFIED UNTIL REPOSITORY IS FINALIZED
SECTION INFORMATION WILL BE ADDED WHEN REPOSITORY IS FINALIZED
TO BE COMPILED