Skip to content

Latest commit

 

History

History
47 lines (38 loc) · 2.72 KB

README.md

File metadata and controls

47 lines (38 loc) · 2.72 KB

Speech-to-Text Data Collection

Forks Badge Pull Requests Badge Issues Badge GitHub contributors License Badge

Table of content

Introduction

The purpose of this project is to build a data engineering pipeline that allows recording millions of Amharic and Swahili speakers reading digital texts in-app and web platforms. For this project, the Amharic news text classification dataset with baseline performance dataset is used. The aim of this project is to produce a tool that can be deployed to process posting and receiving text and audio files from and into a data lake, apply transformation in a distributed manner, and load it into a warehouse in a suitable format to train a speech-to-text model.

Pipeline

This is our pipeline of this project that will be used to record millions of Amharic and Swahili speakers reading digital texts in-app and web platforms.
Speech-to-text data collection

Project Structure

There are several files in the repository, including Python scripts, Jupyter notebooks,  and text files. 

Installation

git clone https://github.com/STT-Data-Engineering/Speech_to_text

Contributors

contributors list

Made with contrib.rocks