Problem: Even after disambiguation, some morphological analyses in Vabamorf's output remain ambiguous. These ambiguous analyses are not sorted by probabilities. So, a commonly used strategy -- picking the first analysis in case of ambiguity -- is not a good one on Vabamorf's plain output, as the first analysis may not be the most likely one.
Solution: We analyse manually annotated corpus (Estonian UD treebank) and collect frequencies of correct variants of ambiguous morphological analyses. As a result, we get lexicons that can be used for reordering ambiguous analyses (by likelihoods) with the help of MorphAnalysisReorderer
(see this tutorial for details). This repository contains the source for creating the lexicons and evaluating their reordering performance.
Requirements: EstNLTK version 1.6.5+
-
01_convert_ud_corpus_to_vm.ipynb
-- converts Estonian UD treebank's data from UD format to Vabamorf's format. Saves results as EstNLTK's JSON files into folder'UD_converted'
; -
02_word_to_analyses_freq_lexicons.ipynb
-- re-annotates data with EstNLTK's Vabamorf, finds matches between automatic analyses and gold standard ones, and createsword_to_analyses
frequency lexicons. Aword_to_analyses
lexicon shows for each ambiguous word, how many times each of its analysis was a correct one (according to manual annotations). Evaluates differentword_to_analyses
frequency lexicons on the test set, and examines how much the reordeing improves chances of getting the first analysis as a correct one; -
03_category_freq_lexicons.ipynb
-- re-annotates data with EstNLTK's Vabamorf, finds matches between automatic analyses and gold standard ones, and createspartofspeech
andform
frequency lexicons. These lexicons show for eachpartofspeech
andform
category, how frequently the category appeared as a correct category of ambiguous words. Evaluates category frequency lexicons (in combination with the bestword_to_analyses
frequency lexicon) on the test set, and examines how much reordeings improve chances of getting the first analysis as a correct one;