Table of content
The purpose of this project is to build a data engineering pipeline that allows recording millions of Amharic and Swahili speakers reading digital texts in-app and web platforms. For this project, the Amharic news text classification dataset with baseline performance dataset is used. The aim of this project is to produce a tool that can be deployed to process posting and receiving text and audio files from and into a data lake, apply transformation in a distributed manner, and load it into a warehouse in a suitable format to train a speech-to-text model.
This is our pipeline of this project that will be used to record millions of Amharic and Swahili speakers reading digital texts in-app and web platforms.
There are several files in the repository, including Python scripts, Jupyter notebooks, and text files.
git clone https://github.com/STT-Data-Engineering/Speech_to_text
Made with contrib.rocks