Skip to content

Latest commit

 

History

History
33 lines (19 loc) · 1.94 KB

README.md

File metadata and controls

33 lines (19 loc) · 1.94 KB

Dictionary of Serbian Words

This project features custom dictionary of Serbian words that are used for diacritic restoration and transliteration from Latin to Cyrillic scripts. Dictionary is distributed as a single SQLite database, located at ../resources/dictionary.sqlite.

This dictionary is created using the following Unicode-encoded files:

words.txt - Set of words that contain at least one of the following characters: s, c, z, š, č, ć, ž, đ or one of the following digraphs: nj, lj, dj, dz. Each word is followed by a relative frequency of occurrence in a Serbian language.

phrases.txt - List of phrases that include words with diacritic characters, used for context disambiguation when dealing with multiple restoration candidates - e.g. kuca (puppy) vs kuća (house).

Extending the database

You can add additional entries to words.txt and phrases.txt files. After files are updated, SQLite database can be recreated by running the following script:

php build-database.php

Acknowledgements

List of words used in this dictionary is assembled from various sources:

Relative frequency of words is taken from the srWaC - Serbian Web Corpus.