Hi,
This is a parallel corpus of slang sentences (sentences that may contain slang words) and formal sentences (sentances that only contain formal words) in Indonesian language. This dataset, consisting of 4,910 parallel sentence pairs, was used by Kurnia and Yulianti (2020) for lexical/text normalization using statistical machine translation. The sentences in this dataset come from Instagram post that were collected in previous research (Salsabila et al., 2018) to build Indonesian colloquial lexicon. In this dataset, the --- is used as a separator between parallel sentence pairs; and the ~~~ symbol is used as a separator between a slang sentence and its corresponding formal sentence.
Please cite this paper if you use this dataset:
@inproceedings{kurnia2020statistical,
title={Statistical Machine Translation Approach for Lexical Normalization on Indonesian Text},
author={Kurnia, Ajmal and Yulianti, Evi},
booktitle={2020 International Conference on Asian Language Processing (IALP)},
pages={288--293},
year={2020},
organization={IEEE}
}
If you have any questions regarding this dataset, you may contact ajmal.kurnia@ui.ac.id.
Thank you!