Massive

This project aims to process the massive dataset, focusing on generating language-specific files, such as en-xx.xlsx, for multiple languages and creating separate JSONL files for English (en), Swahili (sw), and German (de) with test, train, and dev data. Additionally, it will generate a single JSON file that contains translations from English to all languages with id and utt for the training sets. This project is designed to efficiently handle the dataset without using recursive algorithms to avoid potential memory and time complexity issues.

Prerequisites

Python version 3.11.5
PyCharm version 2023.2.1

Installation

Clone the repository on your local machine:

  git clone https://github.com/mikemwai/massive.git

Navigate to the project directory and create a virtual environment on your local machine through the command line:

  py -m venv myenv

Activate your virtual environment:

On Windows:

  myenv\Scripts\activate

On Mac:

  source myenv/bin/activate

Install project dependencies on your virtual environment:

  pip install -r requirements.txt

Extract the dataset folder on your project folder. Use winrar to extract.

Usage

Run the project on the IDE terminal:

   python main.py generate_excel_files separate_files train_translations

Contributions

If you'd like to contribute to this project:

Please fork the repository
Create a new branch for your changes
Submit a pull request

Contributions, bug reports, and feature requests are welcome!

Issues

If you have any issues with the project, feel free to open up an issue.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 81 Commits
output		output
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
dataset.rar		dataset.rar
email.txt		email.txt
functions.py		functions.py
generator.ps1		generator.ps1
googledrive.py		googledrive.py
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Massive

Prerequisites

Installation

Usage

Contributions

Issues

License

About

Releases

Packages

Contributors 5

Languages

License

mikemwai/massive

Folders and files

Latest commit

History

Repository files navigation

Massive

Prerequisites

Installation

Usage

Contributions

Issues

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 5

Languages

Packages