This project features custom dictionary of Serbian words that are used for diacritic restoration and transliteration from Latin to Cyrillic scripts. Dictionary is distributed as a single SQLite database, located at ../resources/dictionary.sqlite
.
This dictionary is created using the following Unicode-encoded files:
words.txt
- Set of words that contain at least one of the following characters: s, c, z, š, č, ć, ž, đ or one of the following digraphs: nj, lj, dj, dz. Each word is followed by a relative frequency of occurrence in a Serbian language.
phrases.txt
- List of phrases that include words with diacritic characters, used for context disambiguation when dealing with multiple restoration candidates - e.g. kuca
(puppy) vs kuća
(house).
You can add additional entries to words.txt
and phrases.txt
files. After files are updated, SQLite database can be recreated by running the following script:
php build-database.php
List of words used in this dictionary is assembled from various sources:
- Serbian Hunspell spelling dictionary
- Jezička laboratorija
- Serbian dictionary from LanguageTool project
- List of words by user "reader" on mycity.rs forum
- Serbian Language Pipeline for Spacy
- Android LatinIME dictionaries
Relative frequency of words is taken from the srWaC - Serbian Web Corpus.