Skip to content

Multi-Language Dataset Cleaner/Creator for Mozilla's DeepSpeech Framework

License

Notifications You must be signed in to change notification settings

silenterus/deepspeech-cleaner

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

deepspeech-cleaner

Multi-Language Dataset Cleaner/Combiner for Mozilla's DeepSpeech Framework

makes the whole process of collecting,cleaning and sorting datasets alot easier

Supported Languages

Supported but not enough Datasets

hu - Hungarian
nn - Norwegian
ro - Romanian
sv - Swedish
tr - Turkish
cs - Czech
da - Danish
fi - Finnish
et - Estonian
el - Greek
is - Icelandic
lv - Latvian
lt - Lithuanian
sr - Serbian
sk - Slovak
sl - Slovenian
sq - Albanian
bs - Bosnian
bg - Bulgarian
hr - Croatian

Installation :

install kenlm
install DeepSpeech
git clone https://github.com/silenterus/deepspeech-cleaner
cd deepspeech-cleaner
pip install -r requirements.txt

Quick Start

Downloader

download/analyze/insert all available corpora for french
python3 deepspeech-cleaner.py download --lang fr

Inserter

insert corpora - in case you download the files by yourself
python3 deepspeech-cleaner.py insert /path/to/corpora/ 

Creator

clean/sort/create all necessary files for training - includes lm.binary/trie if kenlm is installed
python3 deepspeech-cleaner.py create 

clean/sort/create all necessary files for training - no cleaning and no lm.binary/trie creation
python3 deepspeech-cleaner.py create --noclean --notrie

start deepspeech training
bash languages/fr/training/standard/start_train.sh

Other Options

Wiki Crawler

download/extract/clean articles from Wiki Dumps
python3 deepspeech-cleaner.py crawl 

Replacement Tester

Test num2words and your replacement rules
python3 deepspeech-cleaner.py test 1 2 3 is not for me 
python3 deepspeech-cleaner.py test /path/to/textfile.txt

Audio Transformer

convert/trimm/trimmsilence all audio files in your Database
python3 deepspeech-cleaner.py convert 

Autosave

all arguments are saved for each language seperately
autosave off/on
python3 deepspeech-cleaner.py autosave

Hints

change replacer rules in

languages/fr/replacer/..
files contain rules like
'@> '
' Sat > Saturday '
  • only files with a number attached will be used
  • <0 used before number translation
  • =>0 used after number translation
  • replace a word/symbol with '�' and the whole sentence get rejected
  • spaces at the start/end are important for whole words

change the string based sql querys in

languages/fr/sql_query/..
  • files are named like the tables in your "audio.db"
  • '!' at the end of a line functions as NOT

Help

python3 deepspeech-cleaner.py help

Datasets

coming soon

Worth checking out

Results

German Results: [de]

----- options:
<---< size [5-10000]
<---< duration [0.5-15]
<---< bitrate [0]
<---< samplerate [16000-48000]
<---< channels [0]
<---< wordcount [0]
<---< wordsec [0.2-2.0]
<---< lettercount [0]
<---< lettersec [0]
<---< upvotes [0]
<---< downvotes [0]
<---< sectors [0]
<---< wordlength [0]
<---< numbers [False]
<---< upper [False]
<---< lower [0]
----- info:
>---> corpora [forscher-tuda-vox16-zamia-custom-tatoeba-librivox-cv]
>---> gb [57.21]
>---> hours [494.2]
>---> words [3683014]
>---> letters [23751976]
>---> words per sec [2.07]
>---> letters per sec [13.35]
>---> all files [339232]
>---> train files [237463]
>---> test files [50886]
>---> dev files [50886]

Test - WER: 0.098498, CER: 3.228931, loss: 23.721140

WER: 3.500000, CER: 37.000000, loss: 326.320953

src: “eine neue”
res: “einem neuen leben und neuen pflichten entgegen”

WER: 3.000000, CER: 6.000000, loss: 7.963222

src: “ausverkauft”
res: “aus der fast”

WER: 3.000000, CER: 5.000000, loss: 11.577581

src: “riesengebirge”
res: “riesen der berge”

WER: 3.000000, CER: 6.000000, loss: 11.873451

src: “beerdigung”
res: “wer die un”

WER: 3.000000, CER: 8.000000, loss: 17.944910

src: “besuchstermin”
res: “es wuchs der”

WER: 3.000000, CER: 6.000000, loss: 22.410923

src: “beerdigung”
res: “wer die un”

WER: 3.000000, CER: 4.000000, loss: 25.310646

src: “weitermachen”
res: “bei der machen”

WER: 3.000000, CER: 34.000000, loss: 237.857559

src: “misses dent”
res: “es ist mein wunsch vergessen vernachlässigt”

WER: 3.000000, CER: 74.000000, loss: 484.282074

src: “es endigte mit einem”
res: “es endigte mit einem lauten schall welcher in jedem einsamen zimmer in echo zu wecken schienen”

WER: 2.800000, CER: 69.000000, loss: 650.892578

src: “computer alarm in neun minuten”
res: “per definition handelt es sich bei diesen geräten im engeren sinn um personal computer”

Polnish Results: [pl]

----- options:

<---> size [5-10000]
<---> duration [0.5-15]
<---> bitrate [0]
<---> samplerate [16000-48000]
<---> channels [0]
<---> wordcount [2-1500]
<---> wordsec [0.2-1.8]
<---> lettercount [0]
<---> lettersec [0]
<---> upvotes [0]
<---> downvotes [0]
<---> sectors [0]

----- info:

<---> corpora [librivox-tatoeba]
<---> gb [5.97]
<---> hours [51.8]
<---> words [420039]
<---> letters [2708869]
<---> words per sec [2.25]
<---> letters per sec [14.53]
<---> all files [25903]
<---> train files [18134]
<---> test files [3886]
<---> dev files [3886]
I Test of Epoch 12 - WER: 0.137465, loss: 29.99004187996005, mean edit distance: 0.058884
I WER: 0.142857, loss: 4.163468, mean edit distance: 0.065217
I - src: "jak w ogóle we wszystkich naszych obliczeniach"
I - res: "a w ogóle we wszystkich naszych obliczeniach "
I WER: 0.142857, loss: 4.163468, mean edit distance: 0.065217
I - src: "jak w ogóle we wszystkich naszych obliczeniach"
I - res: "a w ogóle we wszystkich naszych obliczeniach "
I WER: 0.181818, loss: 6.447145, mean edit distance: 0.025641
I - src: "pomimoto w stosunku wokulskiego do panny izabeli pierwsze lody były przełamane"
I - res: "pomimo to w stosunku wokulskiego do panny izabeli pierwsze lody były przełamane "
I WER: 0.400000, loss: 6.677766, mean edit distance: 0.107143
I - src: "otarła oczy i ciągnęła dalej"
I - res: "otarołaoczy i ciągnęła dalej "
I WER: 0.400000, loss: 6.677766, mean edit distance: 0.107143
I - src: "otarła oczy i ciągnęła dalej"
I - res: "otarołaoczy i ciągnęła dalej "
I WER: 0.500000, loss: 1.875308, mean edit distance: 0.105263
I - src: "niedziela sprowadzą"
I - res: "niedziela prowadzą "
I WER: 0.500000, loss: 1.875308, mean edit distance: 0.105263
I - src: "niedziela sprowadzą"
I - res: "niedziela prowadzą "
I WER: 1.000000, loss: 3.942765, mean edit distance: 0.105263
I - src: "tu będzie licytacya"
I - res: "tubędzielicytacya"
I WER: 1.000000, loss: 3.942765, mean edit distance: 0.105263
I - src: "tu będzie licytacya"
I - res: "tubędzielicytacya"
I WER: 1.000000, loss: 6.762781, mean edit distance: 0.176471
I - src: "jakto z kucharzem"
I - res: "jak to skucharzem"

Spanish Results: [es]

----- options:

<---> corpora [librivox-vox-tatoeba]
<---> size [5-10000]
<---> duration [0.5-15]
<---> bitrate [0]
<---> samplerate [16000-48000]
<---> channels [0]
<---> wordcount [2-1500]
<---> wordsec [0.25-1.9]
<---> lettercount [0]
<---> lettersec [0]
<---> upvotes [0]
<---> downvotes [0]
<---> sectors [0]

----- info:

<---> gb [20.02]
<---> hours [190.1]
<---> words [1545009]
<---> letters [8793419]
<---> words per sec [2.26]
<---> letters per sec [12.85]
<---> all files [139265]
<---> train files [97486]
<---> test files [20891]
<---> dev files [20891]
I Test of Epoch 12 - WER: 0.139222, loss: 16.857607432188242, mean edit distance: 0.060826
I WER: 0.250000, loss: 0.047055, mean edit distance: 0.047619
I - src: "tengo que comprar uno"
I - res: "tengo que comprar un "
I WER: 0.500000, loss: 0.039710, mean edit distance: 0.083333
I - src: "sé cuidadoso"
I - res: "se cuidadoso"
I WER: 0.500000, loss: 0.072996, mean edit distance: 0.111111
I - src: "me amabas"
I - res: "me amaba "
I WER: 0.500000, loss: 0.072996, mean edit distance: 0.111111
I - src: "me amabas"
I - res: "me amaba "
I WER: 0.500000, loss: 0.098463, mean edit distance: 0.071429
I - src: "cuándo termina"
I - res: "cuando termina"
I WER: 1.000000, loss: 0.027957, mean edit distance: 0.100000
I - src: "sabías eso"
I - res: "sabíaseso"
I WER: 1.000000, loss: 0.089742, mean edit distance: 0.125000
I - src: "ven solo"
I - res: "vensolo"
I WER: 1.000000, loss: 0.092845, mean edit distance: 0.100000
I - src: "te matarán"
I - res: "tematarán"
I WER: 1.000000, loss: 0.092845, mean edit distance: 0.100000
I - src: "te matarán"
I - res: "tematarán"
I WER: 1.000000, loss: 0.099211, mean edit distance: 0.076923
I - src: "has entendido"
I - res: "hasentendido"

French Results: [fr]

----- options:

<--->size [5-10000]
<--->duration [0.5-15]
<--->bitrate [0]
<--->samplerate [16000-48000]
<--->channels [0]
<--->wordcount [2-1500]
<--->wordsec [0.2-1.8]
<--->lettercount [0]
<--->lettersec [0]
<--->upvotes [0]
<--->downvotes [0]
<--->sectors [0]

----- info:

<---> corpora [librivox-tatoeba-vox16-accent]
<---> gb [26.14]
<---> hours [226.8]
<---> words [1932291]
<---> letters [11767617]
<---> words per sec [2.37]
<---> letters per sec [14.41]
<---> all files [125625]
<---> train files [87938]
<---> test files [18845]
<---> dev files [18845]
I Test of Epoch 11 - WER: 0.227659, loss: 38.279466658148145, mean edit distance: 0.123504
I WER: 0.333333, loss: 0.538573, mean edit distance: 0.166667
I - src: "ceci est bon"
I - res: "ceci est mon "
I WER: 0.333333, loss: 0.656955, mean edit distance: 0.166667
I - src: "pour le tout"
I - res: "pour le tour "
I WER: 0.333333, loss: 0.885854, mean edit distance: 0.062500
I - src: "nous avons gagné"
I - res: "nous avons gagne"
I WER: 0.333333, loss: 0.885854, mean edit distance: 0.062500
I - src: "nous avons gagné"
I - res: "nous avons gagne"
I WER: 0.500000, loss: 0.314220, mean edit distance: 0.333333
I - src: "de qui"
I - res: "ce qui "
I WER: 1.000000, loss: 0.245572, mean edit distance: 1.000000
I - src: "ah"
I - res: ""
I WER: 1.000000, loss: 0.448257, mean edit distance: 1.000000
I - src: "ah"
I - res: ""
I WER: 1.000000, loss: 0.448257, mean edit distance: 1.000000
I - src: "ah"
I - res: ""
I WER: 1.000000, loss: 0.628055, mean edit distance: 0.333333
I - src: "oui"
I - res: "ou "
I WER: 1.000000, loss: 0.628055, mean edit distance: 0.333333
I - src: "oui"
I - res: "ou "

Italian Results: [it]

----- options:

<---> corpora [librivox-vox-tatoeba]
<---> size [5-10000]
<---> duration [0.2-15]
<---> bitrate [0]
<---> samplerate [16000-48000]
<---> channels [0]
<---> wordcount [2-1500]
<---> wordsec [0.2-1.9]
<---> lettercount [0]
<---> lettersec [0]
<---> upvotes [0]
<---> downvotes [0]
<---> sectors [0]

----- info:

<---> gb [20.01]
<---> hours [146.5]
<---> words [1144530]
<---> letters [6766871]
<---> words per sec [2.17]
<---> letters per sec [12.83]
<---> all files [83291]
<---> train files [58304]
<---> test files [12495]
<---> dev files [12495]
I Test of Epoch 10 - WER: 0.184894, loss: 28.62499210021505, mean edit distance: 0.075463
I WER: 0.083333, loss: 1.599633, mean edit distance: 0.029851
I - src: "cosí riflettendo su le sue sciagure bruno celèsia si ridusse a casa"
I - res: "così riflettendo su le sue sciagure bruno celèsia si ridusse a casa "
I WER: 0.090909, loss: 1.664164, mean edit distance: 0.033333
I - src: "abbiamo forse fatto male no niente di male rispose il medico"
I - res: "abbiamo forse fatto male no niente di male rispose il medio "
I WER: 0.100000, loss: 1.168548, mean edit distance: 0.033898
I - src: "perchè vedete signora voi siete stata la pietra di paragone"
I - res: "perché vedete signora voi siete stata la pietra di paragone "
I WER: 0.100000, loss: 1.493682, mean edit distance: 0.016129
I - src: "state zitto avaraccio gridò carmaux che slegava il povero uomo"
I - res: "state zitto avaraccio gridò carmaux che slegava il povero uuomo"
I WER: 0.100000, loss: 1.706887, mean edit distance: 0.040816
I - src: "oh esclamò in quel momento toby che si era levato"
I - res: "o esclamò in quel momento toby che si era levato "
I WER: 0.142857, loss: 0.449785, mean edit distance: 0.046512
I - src: "giunsi al paese senza averne fissato alcuno"
I - res: "giunse al paese senza averne fissato alcuno "
I WER: 0.142857, loss: 1.841321, mean edit distance: 0.058824
I - src: "le ricerche durarono più d un mese"
I - res: "le ricerche durarono più di un mese "
I WER: 0.200000, loss: 0.612865, mean edit distance: 0.083333
I - src: "ah e quale filippo ferri"
I - res: "a e quale filippo ferri "
I WER: 0.200000, loss: 0.969935, mean edit distance: 0.086957
I - src: "entrai in un altra sala"
I - res: "entra in un altra sala "
I WER: 0.200000, loss: 0.969935, mean edit distance: 0.086957
I - src: "entrai in un altra sala"
I - res: "entra in un altra sala "

Ukranian Results: [uk]

----- options:

<---> corpora [librivox-vox-tatoeba]
<---> size [5-100000]
<---> duration [0.5-15]
<---> bitrate [0]
<---> samplerate [16000-48000]
<---> channels [0]
<---> wordcount [2-1500]
<---> wordsec [0.2-1.8]
<---> lettercount [0]
<---> lettersec [0]
<---> upvotes [0]
<---> downvotes [0]
<---> sectors [0]

----- info:

<---> gb [10.03]
<---> hours [71.8]
<---> words [500062]
<---> letters [3013061]
<---> words per sec [1.94]
<---> letters per sec [11.66]
<---> all files [31930]
<---> train files [22351]
<---> test files [4791]
<---> dev files [4791]
I Test of Epoch 10 - WER: 0.299552, loss: 41.175528268814084, mean edit distance: 0.117625
I WER: 0.250000, loss: 1.425027, mean edit distance: 0.117647
I - src: "але як се зробити"
I - res: "але як це зробити "
I WER: 0.250000, loss: 1.425027, mean edit distance: 0.117647
I - src: "але як се зробити"
I - res: "але як це зробити "
I WER: 0.285714, loss: 2.314395, mean edit distance: 0.066667
I - src: "тож до тебе я зверну свою мову"
I - res: "то ж до тебе я зверну свою мову "
I WER: 0.285714, loss: 2.314395, mean edit distance: 0.066667
I - src: "тож до тебе я зверну свою мову"
I - res: "то ж до тебе я зверну свою мову "
I WER: 0.333333, loss: 2.467164, mean edit distance: 0.250000
I - src: "а чия ти"
I - res: "а чи ти "
I WER: 0.333333, loss: 2.467164, mean edit distance: 0.250000
I - src: "а чия ти"
I - res: "а чи ти "
I WER: 0.500000, loss: 2.119555, mean edit distance: 0.142857
I - src: "ні сину"
I - res: "ні син "
I WER: 0.500000, loss: 2.119555, mean edit distance: 0.142857
I - src: "ні сину"
I - res: "ні син "
I WER: 1.000000, loss: 0.684362, mean edit distance: 0.333333
I - src: "яку"
I - res: "як "
I WER: 1.000000, loss: 0.684362, mean edit distance: 0.333333
I - src: "яку"
I - res: "як "

Russian Results: [ru]

----- options:

<---> size [5-10000]
<---> duration [0.5-15]
<---> bitrate [0]
<---> samplerate [16000-48000]
<---> channels [0]
<---> wordcount [2-1500]
<---> wordsec [0.2-1.8]
<---> lettercount [0]
<---> lettersec [0]
<---> upvotes [0]
<---> downvotes [0]
<---> sectors [0]

----- info:

<---> gb [6.8]
<---> hours [59.1]
<---> words [406385]
<---> letters [2507215]
<---> words per sec [1.91]
<---> letters per sec [11.79]
<---> all files [29083]
<---> train files [20360]
<---> test files [4363]
<---> dev files [4363]
I Test of Epoch 12 - WER: 0.369255, loss: 49.01442650910262, mean edit distance: 0.155081
I WER: 0.500000, loss: 0.076582, mean edit distance: 0.200000
I - src: "я том"
I - res: "я то "
I WER: 0.500000, loss: 0.076582, mean edit distance: 0.200000
I - src: "я том"
I - res: "я то "
I WER: 0.500000, loss: 0.199971, mean edit distance: 0.166667
I - src: "я села"
I - res: "я сел "
I WER: 0.500000, loss: 0.199971, mean edit distance: 0.166667
I - src: "я села"
I - res: "я сел "
I WER: 0.500000, loss: 0.276903, mean edit distance: 0.200000
I - src: "это я"
I - res: "это "
I WER: 0.500000, loss: 0.276903, mean edit distance: 0.200000
I - src: "это я"
I - res: "это "
I WER: 0.500000, loss: 0.312152, mean edit distance: 0.142857
I - src: "кто она"
I - res: "кто он "
I WER: 0.500000, loss: 0.312152, mean edit distance: 0.142857
I - src: "кто она"
I - res: "кто он "
I WER: 0.500000, loss: 0.868555, mean edit distance: 0.285714
I - src: "вы одна"
I - res: "в одна "
I WER: 0.500000, loss: 0.868555, mean edit distance: 0.285714
I - src: "вы одна"
I - res: "в одна "

Dutch Results: [nl]

----- options:
<---> corpora [swc-vox-tatoeba]
<---> size [5-10000]
<---> duration [0.5-15]
<---> bitrate [0]
<---> samplerate [16000-48000]
<---> channels [0]
<---> wordcount [2-1500]
<---> wordsec [0.25-1.9]
<---> lettercount [0]
<---> lettersec [0]
<---> upvotes [0]
<---> downvotes [0]
<---> sectors [0]

----- info:

<---> gb [10.10]
<---> hours [75.0]
<---> words [598945]
<---> letters [3752358]
<---> words per sec [2.22]
<---> letters per sec [13.9]
<---> all files [43711]
<---> train files [30598]
<---> test files [6558]
<---> dev files [6558]
I Test of Epoch 9 - WER: 0.396161, loss: 92.96824162893921, mean edit distance: 0.193605
I WER: 0.083333, loss: 3.263168, mean edit distance: 0.014706
I - src: "de buurtschap ligt ten zuiden van dasselaar en ten westen van norden"
I - res: "de buurtschap ligt ten zuiden van dasselaar en ten westen van noorden"
I WER: 0.125000, loss: 3.376268, mean edit distance: 0.026316
I - src: "het is een restant van de oude zeedijk"
I - res: "het is een restant van de oude zeedik"
I WER: 0.142857, loss: 2.820412, mean edit distance: 0.025000
I - src: "de herkomst van dit wapen is onduidelijk"
I - res: "de herkomst van dit wapen is onduidenlijk"
I WER: 0.142857, loss: 3.029150, mean edit distance: 0.028571
I - src: "het ligt iets ten noorden van gendt"
I - res: "het ligt iets ten noorden van gent"
I WER: 0.142857, loss: 3.029150, mean edit distance: 0.028571
I - src: "het ligt iets ten noorden van gendt"
I - res: "het ligt iets ten noorden van gent"
I WER: 0.142857, loss: 3.058265, mean edit distance: 0.025641
I - src: "bij het buurtje lag een wierde die in de negentiende eeuw geheel is afgegraven"
I - res: "bij het buurtje lag een wierde die in de negentien e eeuw geheel is afgegraven "
I WER: 0.222222, loss: 2.067109, mean edit distance: 0.023256
I - src: "het dorp ligt op de rechteroever van de lek"
I - res: "het dorp ligt op de rechter oever van de lek"
I WER: 0.285714, loss: 1.334180, mean edit distance: 0.025000
I - src: "het dorp ontstond in de negentiende eeuw"
I - res: "het dorp ontstond in de negentien e eeuw"
I WER: 0.333333, loss: 2.122649, mean edit distance: 0.017857
I - src: "in duizendzeshonderdeenenvijftig wordt een sluis gebouwd"
I - res: "in duizendzeshonderdeenenvijftig wordt een sluisgebouwd"
I WER: 0.333333, loss: 2.648912, mean edit distance: 0.026316
I - src: "hier wordt lesgegeven aan de onderbouw"
I - res: "hier wordt les gegeven aan de onderbouw"

Portuguese Results: [pt]

----- options:

<---> size [5-10000]
<---> duration [0.5-15]
<---> bitrate [0]
<---> samplerate [16000-48000]
<---> channels [0]
<---> wordcount [2-1500]
<---> wordsec [0.2-1.8]
<---> lettercount [0]
<---> lettersec [0]
<---> upvotes [0]
<---> downvotes [0]
<---> sectors [0]

----- info:

<---> corpora [tatoeba-vox16]
<---> gb [1.13]
<---> hours [9.8]
<---> words [66346]
<---> letters [352192]
<---> words per sec [1.88]
<---> letters per sec [9.98]
<---> all files [13684]
<---> train files [9579]
<---> test files [2054]
<---> dev files [2054]
I Test of Epoch 10 - WER: 0.507568, loss: 21.292116564373636, mean edit distance: 0.244271
I WER: 0.200000, loss: 1.065989, mean edit distance: 0.058824
I - src: "não foi tom não é"
I - res: "não foi tom não "
I WER: 0.250000, loss: 1.081908, mean edit distance: 0.200000
I - src: "tom não tem pai"
I - res: "tom não tem "
I WER: 0.250000, loss: 1.081908, mean edit distance: 0.200000
I - src: "tom não tem pai"
I - res: "tom não tem "
I WER: 0.333333, loss: 1.577532, mean edit distance: 0.083333
I - src: "tom é cantor"
I - res: "tom é cantour"
I WER: 0.333333, loss: 1.577532, mean edit distance: 0.083333
I - src: "tom é cantor"
I - res: "tom é cantour"
I WER: 0.500000, loss: 1.114254, mean edit distance: 0.083333
I - src: "estou seguro"
I - res: "estou segura"
I WER: 0.500000, loss: 1.114254, mean edit distance: 0.083333
I - src: "estou seguro"
I - res: "estou segura"
I WER: 0.500000, loss: 1.841137, mean edit distance: 0.333333
I - src: "oi tom"
I - res: "o tom "
I WER: 0.500000, loss: 1.879081, mean edit distance: 0.100000
I - src: "não corras"
I - res: "não coras"
I WER: 0.500000, loss: 1.879081, mean edit distance: 0.100000
I - src: "não corras"
I - res: "não coras"