The measurement of an AI's intelligence often hinges on its ability to compress text without losing information. This concept revolves around predictability—the more predictable a text is deemed by an AI, the more words it can omit and intelligently guess.
This notion spurred the creation of the Hutter Prize, a competition focused on compressing text to its utmost limit. Contestants aim to compress a given text, such as a 1GB file extracted from a Wikipedia snapshot, with the current record achieving compression down to about 115MB.
One of the challenges lies in the requirement that the decompression program must be included as part of the compressed file. This necessitates an efficient program, contributing to the overall challenge of reducing file size.
This project mimics this concept on a smaller scale, utilizing neural language models to predict and guess missing words for text compression.
A train/dev/test split corpus of text from Wikipedia was provided, consisting of single sentences. Each sentence was on a separate line, and the tokens were already tokenised with space separation, ensuring that splitting by space yielded the tokens. Additionally, the text had already been lowercased.
The main objective revolved around compressing the text losslessly, ensuring it could be decompressed back to the original string:
No further pre-processing was allowed on the text (such as stemming) to prevent potential unrecoverable information loss. The test set, treated as a single large string, was what we aimed to compress without any further processing for use in compression/decompression algorithms.
- File Usage:
- train.txt: Was used for training the model.
- dev.txt: Was utilized to validate the model.
- test.txt: Reserved for testing the model.
- Data Handling:
- Loaded
train.txt
anddev.txt
into separate lists, tokenising sentences by space separation. - The test set wasn't loaded per the instructions.
- Loaded
- Vocabulary Criteria:
- Extracted tokens occurring at least three times in the train set.
- Included edge token, unknown token, and pad token in the vocabulary.
- Results:
- Vocabulary size: 7874.
- The most frequent token in the train set was 'the'.
- Language Model Preparation:
- Processed loaded token sequences for the train and dev sets using the created vocabulary.
- Incorporated edge tokens, unknown tokens, and pad tokens as suitable.
- This step wasn't performed for the test set.
- Test Set Handling:
- The test set text file was loaded as a single string and stored in a variable.
- This set wasn't subjected to further processing for language model suitability.
This phase required the establishment of an evaluation function for language models and the creation of a mock model for testing purposes prior to the development of the actual language model.
The language model function was expected to adhere to the following signature:
-
Input Parameter:
x_indexes
- A tensor that provided the model's input token indexes for a batch of sentences, commencing with the edge token.
- Type:
int64
, Shape:(batch size, time steps)
-
Returns:
- A tensor of logits predicting the possible next token after each token in
x_indexes
. - Type:
float32
, Shape:(batch size, time steps, vocab size)
- A tensor of logits predicting the possible next token after each token in
- Objective: Creating a mock model designed to mimic a language model's behavior based on straightforward rules.
- Usage: This mock model was utilized to assess the evaluation, compression, and decompression functions before the development of the actual language model.
Utilizing this mock model enabled a thorough validation of the evaluation function's functionality before the actual language model was constructed.
A MockModel
module was devised to predict the subsequent token based on the following rules:
- If the actual previous token (not the predicted one) was 'the', the model predicted that the current token would be 'dog'.
- Otherwise, the model predicted that the current token would be 'the'.
This mock model operates by assigning logits to tokens, specifically giving the predicted token a logit of 2 and all other tokens a logit of 0. The class name associated with this model is MockModel
.
Logits vs. Probabilities:
- Logits: These represent the raw predictions generated by a classification model before normalization.
- Probabilities: This denotes the normalized output of a classification model after being processed by a softmax function.
A function was developed to gauge the perplexity of a language model on a dev set. This function took a model and a dataset of token indexes, returning the perplexity across the entire dataset.
Important Considerations:
- Perplexity computation involved incorporating the probability of the edge token at the end of each sentence.
- Pad tokens were disregarded in the computation.
Evaluation of Mock Model:
- The function was utilized to determine the perplexity of the mock model on the dev set, resulting in a value equal to
7062.2
.
This section introduced the implementation of the code responsible for the compression and decompression of text.
The compression algorithm operated in the following steps:
- Began with a string of text,
text
, and a language model,model
. - Extracted a list of tokens from
text
, known astokens
, and a corresponding list of token indexes, known asindexes
. - Applied the
model
toindexes
to generatepredicted
, a list of predicted next tokens based on the model. - Replaced tokens in
tokens
with the letter 'X' if they corresponded to tokens inpredicted
, indicating predictability by the model. This 'X' denoted that a token should be predicted here, potentially shortening the text if 'X' was shorter than the replaced token. - Tokens correctly predicted by the model were replaced with 'X', while unpredicted tokens remained unchanged.
- Returned the modified
tokens
as a space-separated string.
The decompression algorithm operated as follows:
- Utilized a string of compressed text named
text
and a language modelmodel
. - Extracted a list of tokens from
text
, namedtokens
. - Iterated through
tokens
from the start and halted at the first occurrence of 'X'. - Converted all tokens preceding the 'X' to token indexes termed
indexes
. - Utilized
model
to predict the most probable token at the end ofindexes
. - Replaced 'X' in
tokens
with this predicted token. - Repeated this process for every 'X' found.
- Once all 'X' had been replaced in
tokens
, returned the modifiedtokens
as a space-separated string.
The compression function was designed to handle input text consisting of sentences separated by new lines and space-separated tokens. Its objective was to return a single string with each line in the input text being compressed. It was crucial to ensure that all out-of-vocabulary tokens remained unchanged in the output, explicitly preventing any unknown tokens in the result.
- The function didn't need to strictly adhere to the previously described algorithm, allowing flexibility in variable names and operations.
- Special attention was given to handling out-of-vocabulary tokens, replacing them with the unknown token when creating token indexes but ensuring they weren't present in the compressed output.
- The prediction of the most probable token index for all token positions at once was facilitated using
.argmax(1)
based on the logits. - Comparisons between predictions and string tokens in the uncompressed sentence were utilized to avoid the unknown token from being considered a correct prediction.
The compression of the sentence:
the dog bit the cat sensually .
Resulted in the compressed form:
X X bit X cat sensually .
The decompression function was designed to handle input text comprising sentences separated by new lines and space-separated tokens, including 'X' tokens. Its purpose was to return a single big string where each line in the compressed text could be decompressed back into the original input line.
- The function couldn't utilize the language model once to predict all 'X's at once due to the requirement that each sentence prefix leading up to the 'X' should not contain another 'X'. As a result, individual language model predictions had to be made for every 'X', using only the tokens preceding 'X' as input to the model (along with the edge token at the beginning).
- Ensuring that the input to the language model excluded 'X's was essential, replacing these 'X's with their predicted tokens when constructing the language model input.
The decompression of the compressed text:
X X bit X cat sensually .
Resulted in the decompressed form:
the dog bit the cat sensually .
The space-saving amount was calculated by leveraging the mock model on the test set, employing the following formula:
where
Utilizing the mock model on the test set resulted in a space-saving amount of 2.4%
.
The objective was to train a neural language model on the train set. Following the training process, a graph showcasing the variation of the dev set perplexity with each epoch was generated, employing the previously developed perplexity function.
The NeuralLanguageModel
was a neural architecture designed for language modeling and text compression, structured with various layers:
-
Embedding Layer
- Converted token indexes into embeddings.
- Input Shape: Vocabulary size, Output Shape: Embedding size.
-
LSTM Cell
- Processed embeddings using an LSTM layer.
- Input Shape: Embedding size, Output Shape: State size.
-
Dropout Layer
- Applied dropout to intermediate states.
- Dropout Rate: 0.5.
-
Batch Normalization
- Normalized the LSTM output across batches.
-
Output Layer
- Produced logits predicting the next token.
- Input Shape: State size, Output Shape: Vocabulary size.
- Vocabulary Size: Determined the size of the vocabulary.
- Embedding Size: Defined the size of the embedding layer.
- State Size: Set the size of the LSTM layer.
- Dropout Rate: Dropout applied to intermediate states (default: 0.5).
- Weight Decay: Weight decay factor (default: 1e-5).
This architecture allowed the model to predict subsequent tokens in a sequence, aiding in text compression and language modeling tasks.
Hyperparameter optimization was conducted to pinpoint the most effective configuration. Following extensive testing, it was determined that the most favorable hyperparameters were:
Learning Rate: 0.01
Embedding Size: 312
State Size: 312
Batch Size: 64
Furthermore, this specific combination of hyperparameters yielded a dev perplexity of 73.2
. These hyperparameters were selected based on their ability to minimize perplexity during language model training.
The section previously showcased the training and evaluation graphs, displaying the model's performance across epochs. These graphs provided insights into the training progress, dev set perplexity variations, and other performance metrics, aiding in understanding the model's learning behavior and validation performance.
The perplexity of the dev set on the language model is: 73.3
The trained model's space saving efficiency on the test text was quantified, ensuring that upon decompressing the compressed test text, an exact match was obtained with the original test text. Special attention was given to stripping off the newline character from the end of the test text for accurate comparison.
The test text was segmented into sentences, and each individual sentence underwent compression. The top 5 most compressed and top 5 least compressed sentences, based on the space saving metric, were printed alongside their respective compressed versions.
A thorough investigation was conducted to discern whether a sentence's compressibility correlated with its similarity to the train set. This involved extracting trigrams from the train set using nltk.trigrams
and calculating a domain similarity measure for each test sentence by counting overlapping trigrams with the train set. The resultant list mapping each sentence's domain similarity to its space saving amount was then used to plot a scatter plot. The analysis aimed to identify any correlation between the domain similarity measure and the space saving amount, which could be evident through a trend or pattern in the scatter plot.
Upon examination, the scatter plot didn't exhibit a linear trend and predominantly skewed towards very low space saving amounts, irrespective of domain similarity. This observation led to the realization that domain similarity alone is insufficient to explain compressibility. Several other factors contribute to a sentence's compressibility, not solely contingent on its similarity to the train set.
Domain similarity, while an important factor, fails to wholly elucidate compressibility. Other unaccounted factors influencing compressibility include sentence length, linguistic complexity, language structure, recurring word patterns, dataset size used for training, and the employed compression algorithm. Additionally, the model's predictive capability based on prior tokens in a sentence, beyond trigram-based domain similarity, can significantly influence compression outcomes, contributing to the observed discrepancy in the scatter plot.
A simple improvement in the compression algorithm involved encoding common phrases as single tokens or codes, effectively reducing the vocabulary size while maintaining predictive capabilities. For instance, phrases like "the dog" could be replaced with a single representative token, allowing compressible phrases to be represented more efficiently, resulting in enhanced compression.
The outcomes from the Neural Language Model indicated an incremental improvement in space saving, achieving a range between 3.5%
to 3.8%
compared to the 2.4%
attained by the mock model. Despite this marginal enhancement, the model's dev perplexity, ranging from 73.2
to 73.8
, showed substantial improvement over the mock model's perplexity of 7062.2
. Several factors contributed to this, including regularization techniques like dropout and weight decay to mitigate overfitting, gradient clipping to prevent exploding gradients, and hyperparameter tuning.
The model demonstrated a steady decrease in training error and perplexity without signs of overfitting, suggesting its ability to learn intrinsic patterns in English sentences. However, the depth of understanding within the language remains uncertain without further analysis of the model's predictions.
To enhance the model's performance further, potential improvements include employing more advanced architectures like transformer models, training on a diverse dataset to capture additional language intricacies, adjusting token omission thresholds for increased compression, and potentially integrating a semantic similarity metric to ensure retained meaning in compressed text.
Overall, the neural model showcased promising performance enhancements in compression efficiency and linguistic comprehension, yet there's room for advancement through architectural refinements and broader data exposure.