This project aims to process the massive dataset, focusing on generating language-specific files, such as en-xx.xlsx, for multiple languages and creating separate JSONL files for English (en), Swahili (sw), and German (de) with test, train, and dev data. Additionally, it will generate a single JSON file that contains translations from English to all languages with id and utt for the training sets. This project is designed to efficiently handle the dataset without using recursive algorithms to avoid potential memory and time complexity issues.
- Python version 3.11.5
- PyCharm version 2023.2.1
- Clone the repository on your local machine:
git clone https://github.com/mikemwai/massive.git
- Navigate to the project directory and create a virtual environment on your local machine through the command line:
py -m venv myenv
- Activate your virtual environment:
- On Windows:
myenv\Scripts\activate
- On Mac:
source myenv/bin/activate
- Install project dependencies on your virtual environment:
pip install -r requirements.txt
- Extract the dataset folder on your project folder. Use winrar to extract.
Run the project on the IDE terminal:
python main.py generate_excel_files separate_files train_translations
If you'd like to contribute to this project:
- Please fork the repository
- Create a new branch for your changes
- Submit a pull request
Contributions, bug reports, and feature requests are welcome!
If you have any issues with the project, feel free to open up an issue.
This project is licensed under the MIT License - see the LICENSE file for details.