LexiNet is a project focused on building and evaluating n-gram language models for word prediction and guessing games. This repository includes code for training models, simulating games, and evaluating model performance.
Training on train set
Validatiaon on val set
Validation Set Results
Total Games: 170671
Games Won: 109689
Accuracy: 64.27 %
~/Garage/lexinet $ tree -L 2
.
├── README.md
├── data
│ ├── test
│ └── train
├── game_results.csv
├── notebooks
│ └── EDA.ipynb
├── perplexity.md
├── requirements.txt
├── results
│ └── models
└── src
├── __init__.py
├── data_preparation.py
├── documentation.md
├── evaluate.py
├── game_simulator.py
├── player_agent.py
└── train.py
-
Purpose: The train_gold method trains an N-gram model with additional masking techniques to handle cases where parts of the N-gram are unknown or need to be generalized.
-
Padded Word: Each word is padded with special tokens (
'<s>'
and'</s>'
) to account for the start and end of the word. The padding ensures that the model learns how words typically begin and end. The padding length is n-1 on both sides. -
N-gram Generation: The method generates N-grams (sequences of n consecutive items) from the padded word. Each N-gram is split into a prefix (all items except the last) and a suffix (the last item).
-
Skipping End Tokens: If the suffix is the end token
'</s>'
, it is skipped, as the model doesn’t need to predict an end token. -
Masking Technique:
• The method identifies positions within the N-gram that are not padding tokens (
'<s>'
or'</s>'
).• It creates masked versions of the N-gram by replacing some of these positions with an underscore ('_'). This process is done iteratively for different combinations of masked positions.
• The number of positions masked is determined by the cnt-1 and n-2 logic, ensuring that at least one position remains unmasked in N-grams longer than 2.
-
Updating Counts: The counts of these N-grams and their masked variants are updated in the self.ngrams dictionary. This helps the model learn to generalize when certain parts of the N-gram are missing or uncertain.
-
Purpose: The train_reverse_gold method is similar to train_gold, but it trains a reverse N-gram model. This model predicts the preceding character in a word given the succeeding context.
-
Padded Word: As in train_gold, each word is padded with
'<s>'
and'</s>'
to handle the start and end of the word. -
N-gram Generation: N-grams are generated, but here the prefix is the first item in the N-gram, and the suffix is the rest of the sequence.
-
Skipping Start Tokens: If the prefix is the start token
'<s>'
, it is skipped, as the model doesn’t need to predict anything before the start of a word. -
Masking Technique:
-
Similar to train_gold, positions within the suffix that are not padding tokens are identified.
-
Masked versions of the suffix are generated by replacing some positions with an underscore.
-
This helps the model handle cases where certain parts of the context might be unknown.
-
-
Updating Counts: The counts of these reverse N-grams and their masked variants are updated in the self.ngrams_rev dictionary. This allows the model to predict the prefix (preceding letter) based on the following context.
Recreate Hangman Challenge.
Max lives: 6
Find game simulator documentation here