Hidden Markov Model (HMM)
This repository implements a Hidden Markov Model (HMM) for performing Parts-of-Speech (POS) Tagging on Assamese-English code-mixed texts.
PoS tagging is the process of identifying and labeling grammatical roles of words in texts, supporting applications like machine translation and sentiment analysis. While different languages may have their own PoS tags, I have used my own custom PoS tags for this model. The Table below defines the custom PoS tags used in this model-
About Hidden Markov Model (HMM)
The HMM is a statistical model that assumes that a system transitions between a series of hidden states based on probabilities. It is efficient in the case of sequential data and POS tagging.
The key components involved in its working are:
- States: The hidden states represent the POS tags. These are underlying variables that can generate the observed data but are directly observable.
- Observations: These are the observed tokens in a sentence or the variables that can be measured and observed.
- Transition probabilities: It describes the probability of moving from one POS tag to another or from one hidden state to another.
- Emission probabilities: It gives the probability of a word being associated with a specific POS tag; thus, it describes the probability of observing an output given in a hidden state.
Algorithm:
- The model imports the libraries and reads the dataset.
- The transition matrix is initialised to store the transition probabilities and the emission matrix is initialised to store the emission probabilities.
- Count frequencies for Transition and Emission matrices and update the transition and emission matrices.
- Convert counts to probabilities
- For each tag in the transition matrix, normalise the counts by dividing each count by the total count of transitions from that tag, resulting in a probability distribution.
- For each tag in the emission matrix, normalise the counts by dividing each count by the total count of emissions for that tag.
- Use a Viterbi algorithm to predict the POS tags by setting up the starting probabilities, then working through the sentence and lastly, back-tracing the best path to get the best sequence of tags.
- For words not in the emission matrix (words not seen in training), assign a small fixed probability. This helps avoid errors for unknown words.
I used Google Colab for this Model.
- Create a new notebook (or file) on Google Colab.
- Paste the code.
- Upload your CSV dataset file to Google Colab.
- Please make sure that you update the "path for the CSV" part of the code based on your CSV file name and file path.
- Run the code.
- The output will be displayed and saved as a different CSV file.
You can also VScode or any other platform (this code is just a python code)
- In this case, you will have to make sure you have the necessary libraries installed and datasets loaded correctly.
- Run the program for the output.
If you need any help or questions, feel free to reach out to me in the comments or via my socials. My socials are:
- Discord: jessicasaikia
- Instagram: jessicasaikiaa
- LinkedIn: jessicasaikia (www.linkedin.com/in/jessicasaikia-787a771b2)
Additionally, you can find the custom dictionaries that I have used in this project and the dataset in their respective repositories on my profile. Have fun coding and good luck! :D