Skip to content

This project is to produce a tool that can be deployed to process posting and receiving text and audio files from and into a data lake, apply transformation in a distributed manner, and load it into a warehouse in a suitable format to train a speech-t0-text model.

License

Notifications You must be signed in to change notification settings

DiboraHaile/Speech-to-text-data-collection-with-Kafka-Airflow-and-Spark

 
 

Repository files navigation

Contributors Forks Stargazers Issues MIT License

workflow


Logo

Speech-to-text data collection with Kafka, Airflow, and Spark

Speech-to-text-data-collection-with-Kafka-Airflow-and-Spark This project is to produce a tool that can be deployed to process posting and receiving text and audio files from and into a data lake, apply transformation in a distributed manner, and load it into a warehouse in a suitable format to train a speech-t0-text model.
Explore more

dataset on Github · Report Bug · Request Feature

Table of Contents
  1. About The Project
  2. Getting Started
  3. Usage
  4. Roadmap
  5. Contributing
  6. License
  7. Contact
  8. Acknowledgements

About The Project

The purpose of this week’s challenge is to build a data engineering pipeline that allows recording millions of Amharic and Swahili speakers reading digital texts in app and web platforms. There are a number of large text corpuses we will use, but for the purpose of testing the backend development, you can use the recently released Amharic news text classification dataset with baseline performance dataset:

Here's What this module can do:

  • List goes here
  • and here
  • ...

A list of commonly used resources that we find helpful are listed in the acknowledgements.

Built With

Resoures that are used in this project are :

  • Boto3
  • python kafka

Getting Started

You can get a local copy up and running follow these simple example steps.

Installation

  1. Clone the repo
    git clone https://github.com/week9-Benkart/Speech-to-text-data-collection-with-Kafka-Airflow-and-Spark.git
  2. Install the setup.py

Roadmap

See the open issues for a list of proposed features (and known issues).

Contributing

Contributions are what make the open source community such an amazing place to be learn, inspire, and create. Any contributions you make are greatly appreciated.

  1. Fork the Project
  2. Create your Feature Branch (git checkout -b feature/AmazingFeature)
  3. Commit your Changes (git commit -m 'Add some AmazingFeature')
  4. Push to the Branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

License

Distributed under the MIT License. See LICENSE for more information.

Contributers

Dibora (team lead)
Toyin (deputy team lead)
Elias Andualem
Abreham Gessesse
Euel Fantaye
Yosef Engdawork
Michael Darko Ahwireng
Mubarak Sani

Project Link: https://github.com/week9-Benkart/Speech-to-text-data-collection-with-Kafka-Airflow-and-Spark.git

Acknowledgements

About

This project is to produce a tool that can be deployed to process posting and receiving text and audio files from and into a data lake, apply transformation in a distributed manner, and load it into a warehouse in a suitable format to train a speech-t0-text model.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 99.5%
  • Python 0.5%