Skip to content

Latest commit

 

History

History
125 lines (96 loc) · 6.33 KB

README.md

File metadata and controls

125 lines (96 loc) · 6.33 KB

asr-benchmark

The goal here is to evaluate some automatic speech recognition (ASR) systems for Brazilian Portuguese (although the tools developed here may be used to evaluate ASR systems in any language). The databases that were used are public and may be freely downloaded by anyone. The designed setup may be reproduced and the results may be confirmed by anyone who wants.

Download databases

  • LapsBenchMark1.4: wget http://www.laps.ufpa.br/falabrasil/files/LapsBM1.4.rar

  • Voxforge: wget -r -nH -nd -np -R index.html* http://www.repository.voxforge1.org/downloads/pt/Trunk/Audio/Original/48kHz_16bit/

After downloading you must downsample the databases to 16000 Hz and 8000 Hz. It can be done with any tool you want. A good one is sox.

LapsBenchMark1.4 has 700 files. Voxforge has many more files, but in this benchmark, 700 audio files were randomly sampled from this database and used in the evaluation. The chosen files are listed in data/voxforge-{8k,16k}.txt.

Dependencies

You will need Python 3 to run the benchmark scripts. And, optionally, you may use some scripts I wrote in Bash to process the transcriptions generated by the benchmark scripts.

I use Anaconda to deal with Python dependencies, which, in this case, were watson-developer-cloud and python-dotenv.

For creating my environment, I did:

conda create -n asr python=3.5
source activate asr
pip install --upgrade watson-developer-cloud
pip install python-dotenv
pip install SpeechRecognition

In this benchmark, word error rate (WER) and sentence error rate (SER) will be evaluated and you will need a tool to measure them. The sclite, included in NIST Speech Recognition Scoring Toolkit may be used for this purpose. Another equivalent tool is the compute-wer from kaldi toolkit. I used this last one just because I had kaldi installed in my machine.

You will also need to create some credentials to access IBM and Microsoft speech API. You must go to IBM Bluemix and Microsoft Bing to get your keys.

After grabbing your keys, create a .env file in the scripts directory with the following variables and theirs values:

  • BLUEMIX_USERNAME="XXXXXXXX"
  • BLUEMIX_PASSWORD="YYYYYY"
  • SUBSCRIPTION_KEY="MMMMMM"
  • INSTANCE_ID="ZZZZZZ"
  • REQUEST_ID="QQQQQQQ"

BLUEMIX_USERNAME and BLUEMIX_PASSWORD are keys necessary for running IBM benchmark. The other 3 keys are only necessary to run Microsoft benchmark.

Benchmark

source activate asr

python scripts/ibmASR.py 16000 data/laps-16k.txt > results/ibm-laps-16k.tra
python scripts/ibmASR.py 8000  data/laps-8k.txt  > results/ibm-laps-8k.tra
python scripts/ibmASR.py 16000 data/voxforge-16k.txt > results/ibm-voxforge-16k.tra
python scripts/ibmASR.py 8000  data/voxforge-8k.txt  > results/ibm-voxforge-8k.tra

python scripts/microsoftASR.py 16000 data/laps-16k.txt > results/microsoft-laps-16k.tra
python scripts/microsoftASR.py 8000  data/laps-8k.txt  > results/microsoft-laps-8k.tra
python scripts/microsoftASR.py 16000 data/voxforge-16k.txt > results/microsoft-voxforge-16k.tra
python scripts/microsoftASR.py 8000  data/voxforge-8k.txt  > results/microsoft-voxforge-8k.tra

python scripts/googleASR.py data/laps-16k.txt > results/google-laps-16k.tra
python scripts/googleASR.py data/laps-8k.txt  > results/google-laps-8k.tra
python scripts/googleASR.py data/voxforge-16k.txt > results/google-voxforge-16k.tra
python scripts/googleASR.py data/voxforge-8k.txt  > results/google-voxforge-8k.tra

./scripts/buildLapsHyp.sh results/ibm-laps-16k.tra > hypotheses/ibm-laps-16k.hyp
./scripts/buildLapsHyp.sh results/ibm-laps-8k.tra  > hypotheses/ibm-laps-8k.hyp
./scripts/buildVoxforgeHyp.sh results/ibm-voxforge-8k.tra  > hypotheses/ibm-voxforge-8k.hyp
./scripts/buildVoxforgeHyp.sh results/ibm-voxforge-16k.tra > hypotheses/ibm-voxforge-16k.hyp

./scripts/buildLapsHyp.sh results/microsoft-laps-16k.tra > hypotheses/microsoft-laps-16k.hyp
./scripts/buildLapsHyp.sh results/microsoft-laps-8k.tra  > hypotheses/microsoft-laps-8k.hyp
./scripts/buildVoxforgeHyp.sh results/microsoft-voxforge-8k.tra  > hypotheses/microsoft-voxforge-8k.hyp
./scripts/buildVoxforgeHyp.sh results/microsoft-voxforge-16k.tra > hypotheses/microsoft-voxforge-16k.hyp

compute-wer --mode=present ark:references/laps.ref ark:hypotheses/ibm-laps-16k.hyp
compute-wer --mode=present ark:references/laps.ref ark:hypotheses/ibm-laps-8k.hyp
compute-wer --mode=present ark:references/voxforge.ref ark:hypotheses/ibm-voxforge-16k.hyp
compute-wer --mode=present ark:references/voxforge.ref ark:hypotheses/ibm-voxforge-8k.hyp

compute-wer --mode=present ark:references/laps.ref ark:hypotheses/microsoft-laps-16k.hyp
compute-wer --mode=present ark:references/laps.ref ark:hypotheses/microsoft-laps-8k.hyp
compute-wer --mode=present ark:references/voxforge.ref ark:hypotheses/microsoft-voxforge-16k.hyp
compute-wer --mode=present ark:references/voxforge.ref ark:hypotheses/microsoft-voxforge-8k.hyp

Results

Results shown in terms of WER (Word Error Rate) and SER (Sentence Error Rate).

Database IBM Microsoft
Laps 16 kHz %WER 13.59 [ 982 / 7228, 110 ins, 217 del, 655 sub ]
%SER 64.14 [ 449 / 700 ]
%WER 15.88 [ 1148 / 7228, 96 ins, 248 del, 804 sub ]
%SER 68.00 [ 476 / 700 ]
Laps 8 kHz %WER 13.89 [ 1004 / 7228, 106 ins, 242 del, 656 sub ]
%SER 64.57 [ 452 / 700 ]
%WER 16.03 [ 1159 / 7228, 97 ins, 248 del, 814 sub ]
%SER 67.29 [ 471 / 700 ]
Voxforge 16 kHz %WER 31.23 [ 1067 / 3417, 134 ins, 313 del, 620 sub ]
%SER 54.74 [ 375 / 685 ]
%WER 18.28 [ 616 / 3370, 46 ins, 186 del, 384 sub ]
%SER 39.73 [ 269 / 677 ]
Voxforge 8 kHz %WER 28.62 [ 995 / 3477, 115 ins, 284 del, 596 sub ]
%SER 53.58 [ 374 / 698 ]
%WER 18.05 [ 611 / 3385, 46 ins, 197 del, 368 sub ]
%SER 39.21 [ 267 / 681 ]

These are the results in 5/february/2017. The systems may be upgraded along the time and these rates may change.