Skip to content

Contains Adhola-English parallel sentences that can be used for Machine Translation.

Notifications You must be signed in to change notification settings

Pogayo/ADH-EN_MT_Dataset

Repository files navigation

ADH - ENG LANGAUGE DATASET

About

This repository contains Dhopadhola and English Sentences that can be used for Machine Translation. The text comes from several domains and was scrapped from different sources online and in print media.

I did this as part of my submission for AI4D Language Dataset Challenge Round 2. My submission was not selected but I have decided to make the data open source for anyone to use as that was my initial goal and that of the challenge.

NLP, Machine Translation, Africa, Uganda

Table of Contents

About Dataset

This dataset was created to provide Dhopadhola(ADH) to English Parallel sentences to help in availing services that require Natural Language Processing to Dhopadhola speakers.

The dataset can be used for Machine Translation purposes. It consists of 2484 parallel (Dhopadhola and English) sentences from different domains and 3386 monolingual Dhopadhola sentences. Both Supervised and Semi-supervised MT can utilise this dataset.

The dataset can also be used to study transfer learning in related African languages as it is closely related to Dholuo spoken in Kenya & Tanzania, Acholi, Lango and Alur in Uganda and other Luo languages.

Dhopadhola is a very low resourced language; it has very few resources available publicly on the internet and even in other print media. This dataset is will help in the availability of Dhopadhola in digital media as when the task for which it is intended for(Machine Translation) is implemented, more resources will be translated into the language and also the native speakers will be incentivized to use it online eg on social media because non-speakers can get the translations.

Dataset Composition

Get the most updated information from [the datasheet](./Clean Language Data/Ogayo_documentation_2.pdf)

Repo Structure

This repo contains 3 main folders of interest.

1. Clean language data

Contains all the text combined from different source files. Datasheets expounding on the data also available.

2. Raw data

Contains sentence in their individual source files. Not that raw as some cleaning has already been done. If you need the webpage or the document without any form of manipulation, let me know.

3. Notebooks

Jupyter Notebooks that I used to scrape and clean the data. They need some clean-up though.

Clone

  • Clone this repo to your local machine using https://github.com/Pogayo/ADH-EN_MT_Dataset

Contributing

To get started...

Step 1

  • Option 1

    • 🍴 Fork this repo!
  • Option 2

    • 👯 Clone this repo to your local machine using https://github.com/Pogayo/ADH-EN_MT_Dataset

Step 2

  • HACK AWAY! 🔨🔨🔨

Step 3

  • 🔃 Create a new pull request

Team

Perez Ogayo

Perez Ogayo

  • We are a small team. Join us and let's put Africa on the NLP Map together!

Support me

I am in the process of setting up a wallet. Feel free to reach out to me so that I can give you other payment details in the meantime.


License

CCBY4 licence

This work is licensed under a Creative Commons Attribution 4.0 International License.

About

Contains Adhola-English parallel sentences that can be used for Machine Translation.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published