This project implements a part-of-speech tagger for Ancient Classical Greek, a morphologically rich and low-resource language, using texts from The Ancient Greek and Latin Dependency Treebank and ideas inspired by Speech and Language Processing.
A syllable-based Naive Bayes model with Laplace/Lidstone smoothing is implemented to predict fine-grained part-of-speech using likelihood and prior estimates from the training data. A variant called StupidBayes
(due to its relation to the Stupid Backoff smoothing method) is introduced that offers similar accuracy with faster performance.
This morphological approach addresses a major difficulty of part-of-speeching tagging Ancient Greek using classical methods: namely, the complex system of word endings that results in many singularly occurring words and a highly flexible word order within sentences1.
MultinomialNaiveBayes
2 is trained on n-grams of word syllables generated from The Ancient Greek and Latin Dependency Treebank. N-gram depth is controllable via the ngram_range
parameter. The first training pass counts the occurrence of syllables per category, as well as class occurrences. Greek diacritics, which are usually stripped, are preserved to give more information to the model.
The second pass normalizes the raw counts to produce probabilities (represented via log probabilities for numerical stability) for syllable likelihoods and class priors. Laplace/Lidstone smoothing is implemented using an alpha
parameter that can be adjusted at runtime. As likelihood probability distributions are sparse, dictionaries and default values are used to speed up computation.
At prediction time the model uses Bayes' theorem (assuming conditional independence) to estimate the most likely class using the following relationship:
StupidBayes
3, on the other hand, is a variant of MultinomialNaiveBayes
that skips probabilities altogether. The training pass only stores occurrence of syllables per category. The maximum n-gram length to be generated is controllable via the ngram_depth
parameter. A simplified version of n-grams backoff is implemented to only generate shorter n-grams in order to fill in lookup gaps.
Predictions are generated during test time by simply returns the class with the highest count amongst all the input n-grams. Formally, this is given by:
Module training is carried out by cgpos.eval.train
. Training and test sets are produced using a shuffled split strategy. Training is parallelized via concurrent.futures
to speed up training time. Results are stored in runs
with the format YYYY-mm-dd_HH-MM-SS
.
cgpos.eval.eval
carries out hyperparameter tuning and model selection according to config.yaml
. Stratified k-fold shuffle strategy (k=5
) and F1 score evaluation metric are employed to handle class imbalance issues. Multioutput predictions are achieved by following the simple strategy of fitting one StupidBayes
per target, and a PartOfSpeechTagger
is created based on the best model configuration.
The best model is stored in models/pos_tagger.pkl
and a text report showing the best model configuration, classification reports and confusion matrices is saved to reports/report.txt
.
The overall fine-grained test accuracy for the best model after hyperparameter tuning is 80.19%.
Test accuracy for morphologically-driven elements (nouns, adjectives, and verbs) is 87.32%.
This is the best model architecture after hyperparameter tuning:
pos: ('StupidBayes', {'ngram_depth': 9}),
person: ('MultinomialNaiveBayes', {'alpha': 0.5, 'ngram_range': (1, 2)}),
number: ('StupidBayes', {'ngram_depth': 6}),
gender: ('MultinomialNaiveBayes', {'alpha': 0.1, 'ngram_range': (1, 3)}),
case: ('StupidBayes', {'ngram_depth': 7}),
tense: ('StupidBayes', {'ngram_depth': 7}),
mood: ('MultinomialNaiveBayes', {'alpha': 0.3, 'ngram_range': (1, 2)}),
voice: ('MultinomialNaiveBayes', {'alpha': 0.8, 'ngram_range': (1, 5)})}
degree: ('StupidBayes', {'ngram_depth': 9}),
precision recall f1-score support
N/A 0.01 0.33 0.02 3
article 0.89 0.86 0.87 2983
noun 0.94 0.89 0.91 11732
adjective 0.81 0.90 0.85 5316
pronoun 0.72 0.84 0.78 2921
verb 0.94 0.89 0.91 10179
adverb 0.57 0.89 0.70 3089
adposition 0.96 0.91 0.93 2885
conjunction 0.81 0.72 0.76 3133
numeral 0.83 0.63 0.72 71
interjection 1.00 1.00 1.00 19
particle 0.89 0.72 0.80 4810
punctuation 1.00 0.99 1.00 5574
accuracy 0.87 52715
macro avg 0.80 0.81 0.79 52715
weighted avg 0.89 0.87 0.87 52715
precision recall f1-score support
N/A 0.99 0.99 0.99 46656
first person 0.76 0.77 0.77 839
second person 0.65 0.64 0.64 732
third person 0.93 0.87 0.90 4488
accuracy 0.97 52715
macro avg 0.83 0.82 0.83 52715
weighted avg 0.97 0.97 0.97 52715
precision recall f1-score support
N/A 0.96 0.98 0.97 20844
singular 0.96 0.92 0.94 22222
plural 0.87 0.92 0.90 9516
dual 0.64 0.92 0.76 133
accuracy 0.94 52715
macro avg 0.86 0.94 0.89 52715
weighted avg 0.94 0.94 0.94 52715
precision recall f1-score support
N/A 0.98 0.97 0.97 27309
masculine 0.90 0.91 0.90 13571
feminine 0.88 0.90 0.89 6704
neuter 0.79 0.79 0.79 5131
accuracy 0.93 52715
macro avg 0.89 0.89 0.89 52715
weighted avg 0.93 0.93 0.93 52715
precision recall f1-score support
N/A 0.98 0.95 0.96 27589
nominative 0.80 0.87 0.83 7039
genitive 0.92 0.92 0.92 4870
dative 0.88 0.93 0.91 3628
accusative 0.88 0.85 0.86 9309
vocative 0.55 0.87 0.68 280
accuracy 0.92 52715
macro avg 0.83 0.90 0.86 52715
weighted avg 0.92 0.92 0.92 52715
precision recall f1-score support
N/A 0.99 0.97 0.98 44010
present 0.84 0.93 0.88 3471
imperfect 0.78 0.92 0.84 1083
perfect 0.81 0.90 0.85 509
plusquamperfect 0.67 0.59 0.63 105
future perfect 0.00 0.00 0.00 1
future 0.62 0.84 0.71 299
aorist 0.85 0.91 0.88 3237
accuracy 0.96 52715
macro avg 0.70 0.76 0.72 52715
weighted avg 0.97 0.96 0.96 52715
precision recall f1-score support
N/A 0.98 0.99 0.99 42640
indicative 0.92 0.92 0.92 4707
subjunctive 0.78 0.66 0.71 479
infinitive 0.99 0.87 0.92 1372
imperative 0.54 0.61 0.57 276
participle 0.95 0.89 0.92 2922
optative 0.89 0.72 0.80 319
accuracy 0.97 52715
macro avg 0.87 0.81 0.83 52715
weighted avg 0.97 0.97 0.97 52715
precision recall f1-score support
N/A 0.99 0.99 0.99 42987
active 0.94 0.94 0.94 6765
passive 0.77 0.91 0.84 261
middle 0.84 0.81 0.83 902
medio-passive 0.93 0.84 0.89 1800
accuracy 0.98 52715
macro avg 0.89 0.90 0.90 52715
weighted avg 0.98 0.98 0.98 52715
precision recall f1-score support
N/A 1.00 1.00 1.00 52532
positive 0.00 0.00 0.00 1
comparative 0.76 0.64 0.69 137
superlative 0.37 0.67 0.48 45
accuracy 1.00 52715
macro avg 0.53 0.58 0.54 52715
weighted avg 1.00 1.00 1.00 52715
true / pred | article | noun | adjective | pronoun | verb | adverb | adposition | conjunction | numeral | interjection | particle | punctuation |
---|---|---|---|---|---|---|---|---|---|---|---|---|
article | 2559 | 3 | 4 | 405 | 5 | 7 | 0 | 0 | 0 | 0 | 0 | 0 |
noun | 63 | 10393 | 491 | 179 | 404 | 73 | 9 | 33 | 1 | 0 | 63 | 0 |
adjective | 4 | 253 | 4763 | 160 | 74 | 21 | 9 | 15 | 7 | 0 | 2 | 0 |
pronoun | 69 | 20 | 302 | 2456 | 15 | 23 | 0 | 1 | 0 | 0 | 34 | 0 |
verb | 1 | 326 | 242 | 100 | 9031 | 64 | 39 | 151 | 1 | 0 | 211 | 0 |
adverb | 4 | 38 | 36 | 24 | 45 | 2744 | 54 | 65 | 0 | 0 | 74 | 0 |
adposition | 160 | 3 | 5 | 17 | 12 | 50 | 2635 | 2 | 0 | 0 | 0 | 0 |
conjunction | 4 | 2 | 4 | 57 | 3 | 763 | 5 | 2252 | 0 | 0 | 42 | 0 |
numeral | 0 | 1 | 25 | 0 | 0 | 0 | 0 | 0 | 45 | 0 | 0 | 0 |
interjection | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 19 | 0 | 0 |
particle | 7 | 7 | 11 | 4 | 24 | 1033 | 6 | 252 | 0 | 0 | 3464 | 0 |
punctuation | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 5545 |
As expected, the part-of-speech tagger does well classifying highly morphologically-driven elements such as nouns and verbs. But the model struggles to discern positionally-driven elements such as conjugations or particles. Some kind of modeling architecture that supports positional augmentation would be necessary to address this gap.
Get project requirements by setting up a poetry
environment:
- Install poetry
- In the terminal, run:
$ git clone https://github.com/tejomaygadgil/cgpos.git
$ cd cgpos/
$ make activate_poetry
poetry install
Installing dependencies from lock file
No dependencies to install or update
Installing the current project: cgpos (1.0.0)
...
Grab project data:
$ cd /dir/to/repository
$ make make_dataset
Initializing data directory
mkdir data/raw/zip
Grabbing Perseus data
...
Ready data for training by tokenizing word syllables:
$ cd /dir/to/repository
$ make build_features
Building features
python src/cgpos/data/features.py
...
Train the part-of-speech tagger and evaluate performance using:
$ cd /dir/to/repository
$ make train_model
Training model
python src/cgpos/eval/train.py
...
Hyperparameter range and model selection are adjustable via config file.
You can get a helpfile of all available make
options by running:
$ make
Available rules:
activate_poetry Activate poetry environment
build_features Build features
get_data Get raw data
init_data_dir Initialize data directory
install_poetry Install poetry environment
make_dataset Make Perseus dataset
remove_all_data Remove all data
remove_data Remove processed data
tests Run tests
train_model Train model and evaluate performance
This repository uses the following tools:
make
for running codepoetry
for package managementhydra
for code reproducibilityblack
andruff
for code review
Footnotes
-
This is in contrast to analytic languages like English that use word order and special words like "had" and "will" to express part-of-speech. ↩
-
from cgpos.models.multinomial_naive_bayes import MultinomialNaiveBayes
↩ -
from cgpos.models.multinomial_naive_bayes import StupidBayes
↩