-
Notifications
You must be signed in to change notification settings - Fork 9
SpellCheckerBuildingTutorialFsmnlp2012
Presented in FSMNLP 2012, Donostia. See also, the slides.
This language model is strongly inspired by Peter Norvig's nice essay about writing English spelling corrector in one day in Python. Specifically, you can use his big.txt as a corpus to replicate his results here. For demonstrational purposes, I am going to build our own "corpus" here online:
cat > demo-corpus.txt
Now we can write some examples examples here.
And press CTRL-D to end the file.
To extract words and frequencies from the corpus:
tr -d '[:punct:]' < demo-corpus.txt | tr -s '[:space:]' '\n' | sort | uniq -c | sort -nr | awk -f tropicalize-uniq-c.awk --assign=CS=20 | hfst-strings2fst -j -f openfst -o demo-lm.hfst
Note the awk script tropicalize-uniq-c.awk, it just -logs all the frequencies in the uniq -c data and divides by corpus size, which we explicitly specify using CS=20. The script is:
{printf("%s\t%f\n", $2, -log($1/CS));}
You can find beautified version here.
For most error models we need this concept of correctly typed runs of characters, any of those at all. Let's compile it with Xerox notation regexp:
echo '?*' | hfst-regexp2fst -o anystar.hfst
The edit distance error model we create using Xerox notation for now:
echo 'a:0::10 | 0:a::10 | a:b::10 | b:0::10 | 0:b::10 | b:a::10 | ?:0::10 | 0:?::10 | ?:?::10' | hfst-regexp2fst -o edit1.hfst
Or with swaps:
echo 'a:0::10 | 0:a::10 | a:b::10 (b:a) | b:0::10 | 0:b::10 | b:a::10 (a:b) | ?:0::10 | 0:?::10 | ?:?::10' | hfst-regexp2fst | hfst-minimize -o edit1.hfst
This needs to be combined with correctly typed parts before or after the error to be usable (so we have ?* ERROR ?*
):
hfst-concatenate anystar.hfst edit1.hfst | hfst-concatenate - edit1.hfst | errm.edit1.hfst
In practice though we usually use scripts to create the edit distance automata, since it can be scripted easily.
python editdist.py -d 1 -S -a demo-lm.hfst | hfst-txt2fst -o errm.edit1all.hfst
This python script is also here.
To make more edits we can e.g. use repeat (or compose):
hfst-repeat -f 1 -t 7 -i errm.edit1.hfst -o errm.edit7.hfst
In practice you'll want to use edit's between 1 and 3
Just to see how it works, we can see what it does to our words:
hfst-lookup errm.edit1all.hfst
The language-specific errors can be typed in as a list:
hfst-strings2fst -j -o en-orth.hfst
of:ough 5
...
Remember to end the input by CTRL-D.
To combine with edit distance, we use disjunction:
hfst-disjunct en-orth.hfst edit1.hfst | hfst-minimize > edit1+en-orth.hfst
Now we can combine these two errors with the correctly typed accepting automaton with concatenation to form full better edit model:
hfst-concatenate anystar.hfst edit1+en-orth.hfst | hfst-concatenate - anystar.hfst -o errm.edit1+en-orth.hfst
Now for full word errors, we use again just a list of strings to input:
hfst-strings2fst -j -o en-confusables.hfst
teh:the 2.5
mispelled:misspelt 2.5
...
The final combination between the past error models and these fullword errors is now made with disjuncition, since these fullword errors bypass the regular mechanism totally.
hfst-disjunct en-confusables.hfst errm.edit1+en-orth.hfst | hfst-minimize errm.everything.hfst
Turning automata into a hfst ospell/voikko/enchant speller for gtk text boxes and finally libreoffice extensions
...