Open source speech and natural language processing resources for the Tunisian arabic dialect (work in progress).
The data and ressources collected within this project is multi-purpose ; named entities recognition, machine translation, language modelling, ..
List of named entities :
- List of Tunisian academics and scientists
- List of Tunisian artists
- List of Tunisian football players
- List of Tunisian media personalities (presenters, directors, producers, ..)
- List of Tunisian ministries
- List of Tunisian poets
- List of Tunisian politicians
- List of Tunisian trade unionists
- List of Tunisian writers
- List of Tunisian Higher Institutes for Technological Studies (ISET)
- List of Tunisian political parties
- List of Tunisian Universities (private)
- List of Tunisian Universities (public)
- List of Tunisian Unions
- Collect more raw text data in Tunisian arabic.
- Develop cleaning / spelling correction scripts for Tunisian arabic.
- Develop CODA-compatible normalization scripts for Tunisian arabic.
- Develop Arabizi / arabic conversion scripts.
- Develop scrapers for Tunisian news/forums websites.
- Build parallel datasets for machine translation between Tunisian <-> english / MSA.
- Develop translation systems for Tunisian <-> English and Tunisian <-> MSA.
CODA: Habash, Nizar, Mona T. Diab, and Owen Rambow. "Conventional Orthography for Dialectal Arabic." LREC. 2012.
Zribi, Inès, et al. "A Conventional Orthography for Tunisian Arabic." LREC. 2014.
Turki, Houcemeddine, et al. "A conventional orthography for maghrebi arabic." Proceedings of the International Conference on Language Resources and Evaluation (LREC), Portoroz, Slovenia. 2016.
Arabizi : Darwish, Kareem. *"Arabizi detection and conversion to Arabic." * arXiv preprint arXiv:1306.6755 (2013).
Yaghan, Mohammad Ali. "“Arabizi”: A contemporary style of Arabic Slang." Design issues 24.2 (2008): 39-52.
Masmoudi, Abir, et al. "Transliteration of arabizi into arabic script for tunisian dialect." ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP) 19.2 (2019): 1-21.