Skip to content

AASHISHAG/archimob-swissgerman-deepspeech-importer

 
 

Repository files navigation

ArchiMob corpus Preprocessing for Mozilla DeepSpeech

The importer pre-processes the audio-and text-data so that it can be used with the open-source Speech-to-Text engine DeepSpeech. This repository is forked from tobiasrordorf/archimob-swissgerman-deepspeech-importer. Please reach out to Tobias Rordorf for specifics.

I have edited ReadMe and a few files so it can directly be used for deepspeech-swiss-german repository.

Table of Contents

How to access

You need to contact Dr. Samardžić for the access. More details can be referred at The ArchiMob Corpus

ArchiMob Corpus

The ArchiMob corpus represents German linguistic varieties spoken within the territory of Switzerland. This corpus is the first electronic resource containing long samples of transcribed text in Swiss German, intended for studying the spatial distribution of morphosyntactic features and for natural language processing.

This corpus is available under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Size: ~70 hours

Pre-processing Steps

  • If you have acquired the audio files as mentioned above, create a folder called 'audio' and place the files in this folder.
$ git clone https://github.com/AASHISHAG/archimob-swissgerman-deepspeech-importer.git
$ cd archimob-swissgerman-deepspeech-importer
$ mkdir audio <move audio to this folder>
$ python3 Archimob_DeepSpeech_Importer.py
  • The transcriptions, filenames and filesizes are merged and files below 10'000 Bytes and above 318'400 Bytes are dropped.

  • The merged transcripts are then cleaned of unwanted characters (e.g. semicolon, commas etc.)

  • The final transcripts are splitted into train (75%), dev (15%) and test (10%) files and stored in:

Final_Training_CSV_for_Deepspeech

Why being SHY to STAR the repository, if you use the resources? :D

About

DeepSpeech Importer for Swiss German Corpora ArchiMob

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%