Misspelled-KG-dataset

This dataset is prepared based on Kyrgyz_News_Corpus.

Preliminary processing has been carried out:

All symbols that are absent in the Kyrgyz or Latin alphabets or numbers have been excluded.
Various variants of dashes/hyphens have been replaced with a single type of dash, different variants of quotation marks have been replaced with a single type of quotation mark, and extra spaces have been removed.
Long news articles have been divided into lines so that mean(len) = 102.45 and std(len) = 56.72. 4. Rows with languages other than Kyrgyz have been excluded.

Misspelled (trash) text was created using various approaches:

1 million trash lines were generated using a probabilistic noiser. The probabilistic noiser was trained based on a "golden dataset" with real errors, which is not public.
500 thousand trash lines were generated using a different probabilistic noiser.
The remaining trash lines were created using a random noiser, which, for words longer than 5 letters, has a 20% probability of deleting a letter/swapping a letter/replacing a letter with another letter/inserting any letter.

Punctuation errors (punc_trash) text was created using a random noiser, which has a 20% probability of deleting/inserting a comma and replacing the period at the end of the sentence with another punctuation mark, such as "!" or "?".

Train and test datasets were created by train_test_split with a train size of 2 million:

Train size = 2000000
Test size = 66223

Dataset

Misspelled-KG-dataset can be downloaded from here.

References

All of our achievements were made achievable thanks to the robust AI community in Kyrgyzstan and the contributions made by individuals within the AkylAI project (by TheCramer.com). We also express our gratitude to the Kyrgyz news agencies for their work, which allowed us to create this dataset.

License

Dataset is licensed under a Attribution-NonCommercial 4.0 International (CC BY-NC 4.0)

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Misspelled-KG-dataset

Dataset

References

Next

License

About

Releases

Packages

Akyl-AI/Misspelled-KG-dataset

Folders and files

Latest commit

History

Repository files navigation

Misspelled-KG-dataset

Dataset

References

Next

License

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages