Speech-to-text-data-collection-with-Kafka-Airflow-and-Spark
This project is to produce a tool that can be deployed to process posting and receiving text and audio files from and into a data lake, apply transformation in a distributed manner, and load it into a warehouse in a suitable format to train a speech-t0-text model.
Explore more
dataset on Github
·
Report Bug
·
Request Feature
Table of Contents
The purpose of this week’s challenge is to build a data engineering pipeline that allows recording millions of Amharic and Swahili speakers reading digital texts in app and web platforms. There are a number of large text corpuses we will use, but for the purpose of testing the backend development, you can use the recently released Amharic news text classification dataset with baseline performance dataset:
Here's What this module can do:
- List goes here
- and here
- ...
A list of commonly used resources that we find helpful are listed in the acknowledgements.
Resoures that are used in this project are :
- Boto3
- python kafka
You can get a local copy up and running follow these simple example steps.
- Clone the repo
git clone https://github.com/week9-Benkart/Speech-to-text-data-collection-with-Kafka-Airflow-and-Spark.git
- Install the setup.py
See the open issues for a list of proposed features (and known issues).
Contributions are what make the open source community such an amazing place to be learn, inspire, and create. Any contributions you make are greatly appreciated.
- Fork the Project
- Create your Feature Branch (
git checkout -b feature/AmazingFeature
) - Commit your Changes (
git commit -m 'Add some AmazingFeature'
) - Push to the Branch (
git push origin feature/AmazingFeature
) - Open a Pull Request
Distributed under the MIT License. See LICENSE
for more information.
Dibora (team lead)
Toyin (deputy team lead)
Elias Andualem
Abreham Gessesse
Euel Fantaye
Yosef Engdawork
Michael Darko Ahwireng
Mubarak Sani
Project Link: https://github.com/week9-Benkart/Speech-to-text-data-collection-with-Kafka-Airflow-and-Spark.git