Speech-to-Text Data Collection

Table of content

Introduction
Pipeline
Architecture
Project Structure
Installation

Introduction

The purpose of this project is to build a data engineering pipeline that allows recording millions of Amharic and Swahili speakers reading digital texts in-app and web platforms. For this project, the Amharic news text classification dataset with baseline performance dataset is used. The aim of this project is to produce a tool that can be deployed to process posting and receiving text and audio files from and into a data lake, apply transformation in a distributed manner, and load it into a warehouse in a suitable format to train a speech-to-text model.

Pipeline

This is our pipeline of this project that will be used to record millions of Amharic and Swahili speakers reading digital texts in-app and web platforms.

Project Structure

There are several files in the repository, including Python scripts, Jupyter notebooks, and text files.

Installation

git clone https://github.com/STT-Data-Engineering/Speech_to_text

Contributors

Selam Ayehubirhan
Henok Tilaye
Biruk Getaneh
Kevin Shyaka
Tadesse Kebede
Diye Mark
Danayt Bulom

Made with contrib.rocks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Speech-to-Text Data Collection

Introduction

Pipeline

Project Structure

Installation

Contributors

Files

README.md

Latest commit

History

README.md

File metadata and controls

Speech-to-Text Data Collection

Introduction

Pipeline

Project Structure

Installation

Contributors