ADH - ENG LANGAUGE DATASET

About

This repository contains Dhopadhola and English Sentences that can be used for Machine Translation. The text comes from several domains and was scrapped from different sources online and in print media.

I did this as part of my submission for AI4D Language Dataset Challenge Round 2. My submission was not selected but I have decided to make the data open source for anyone to use as that was my initial goal and that of the challenge.

NLP, Machine Translation, Africa, Uganda

About Dataset

This dataset was created to provide Dhopadhola(ADH) to English Parallel sentences to help in availing services that require Natural Language Processing to Dhopadhola speakers.

The dataset can be used for Machine Translation purposes. It consists of 2484 parallel (Dhopadhola and English) sentences from different domains and 3386 monolingual Dhopadhola sentences. Both Supervised and Semi-supervised MT can utilise this dataset.

The dataset can also be used to study transfer learning in related African languages as it is closely related to Dholuo spoken in Kenya & Tanzania, Acholi, Lango and Alur in Uganda and other Luo languages.

Dhopadhola is a very low resourced language; it has very few resources available publicly on the internet and even in other print media. This dataset is will help in the availability of Dhopadhola in digital media as when the task for which it is intended for(Machine Translation) is implemented, more resources will be translated into the language and also the native speakers will be incentivized to use it online eg on social media because non-speakers can get the translations.

Dataset Composition

Get the most updated information from [the datasheet](./Clean Language Data/Ogayo_documentation_2.pdf)

Repo Structure

This repo contains 3 main folders of interest.

1. Clean language data

Contains all the text combined from different source files. Datasheets expounding on the data also available.

2. Raw data

Contains sentence in their individual source files. Not that raw as some cleaning has already been done. If you need the webpage or the document without any form of manipulation, let me know.

3. Notebooks

Jupyter Notebooks that I used to scrape and clean the data. They need some clean-up though.

Clone

Clone this repo to your local machine using https://github.com/Pogayo/ADH-EN_MT_Dataset

Contributing

To get started...

Step 1

Option 1
- 🍴 Fork this repo!
Option 2
- 👯 Clone this repo to your local machine using https://github.com/Pogayo/ADH-EN_MT_Dataset

Step 2

HACK AWAY! 🔨🔨🔨

Step 3

🔃 Create a new pull request

Team

Perez Ogayo

We are a small team. Join us and let's put Africa on the NLP Map together!

Support me

I am in the process of setting up a wallet. Feel free to reach out to me so that I can give you other payment details in the meantime.

License

This work is licensed under a Creative Commons Attribution 4.0 International License.

Name		Name	Last commit message	Last commit date
Latest commit History 61 Commits
Clean Language Data		Clean Language Data
Notebooks		Notebooks
Raw data		Raw data
AI4D_Documention.docx		AI4D_Documention.docx
Evaluation.pdf		Evaluation.pdf
README.md		README.md
Spam_Classifier.ipynb		Spam_Classifier.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ADH - ENG LANGAUGE DATASET

About

Table of Contents

About Dataset

Dataset Composition

Repo Structure

1. Clean language data

2. Raw data

3. Notebooks

Clone

Contributing

Step 1

Step 2

Step 3

Team

Support me

License

About

Releases

Packages

Languages

Pogayo/ADH-EN_MT_Dataset

Folders and files

Latest commit

History

Repository files navigation

ADH - ENG LANGAUGE DATASET

About

Table of Contents

About Dataset

Dataset Composition

Repo Structure

1. Clean language data

2. Raw data

3. Notebooks

Clone

Contributing

Step 1

Step 2

Step 3

Team

Support me

License

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages