Quantum Machine Learning for Automatic Spoken-Term Recognition.
- NEW There is a new M.Sc. thesis work by Juha Korvenaho (supervised by Prof. Tomi Kinnunen) with more investigation. Feel free to have a look and their code in 2022.
- NEW Our paper is accepted to IEEE International Conference on Acoustics, Speech, & Signal Processing (ICASSP) 2021.
We would like to thank the reviewers and committee members in the Speech Processing and Quantum Signals community.
Released the quantum speech processing code! (2020 Dec) Colab demo is also provided. ICASSP Video | Slides
- ICASSP 21 Paper | Arxiv "Decentralizing Feature Extraction with Quantum Convolutional Neural Network for Automatic Speech Recognition"
- option 1: from conda and pip install
conda install -c anaconda tensorflow-gpu=2.0
conda install -c conda-forge scikit-learn
conda install -c conda-forge librosa
pip install pennylane --upgrade
- option 2: from environment.yml (for 2080 Ti with CUDA 10.0)
conda env create -f environment.yml
Origin with tensorflow 2.0 with CUDA 10.0.
We use Google Speech Commands Dataset V1 for Limited-Vocabulary Speech Recognition.
mkdir ../dataset
cd ../dataset
wget http://download.tensorflow.org/data/speech_commands_v0.01.tar.gz
tar -xf speech_commands_v0.01.tar.gz
We provide 2000
pre-processed feautres in ./data_quantum
, which included both mel features, and (2,2)
quanvolution features with 1500
for training and 500
for testing. You could get 90.6%
test accuracy by the provided data.
You could use np.load
to load these features to train your own quantum speech processing model in 3.1
.
Please set the sampling rate sr
and data ratio (--port N
for 1/N data; --port 1
for all data) for extracting Mel Features.
python main_qsr.py --sr 16000 --port 100 --mel 1 --quanv 1
If you have pre-load audio features from 2.2.
you can set the quantum convolution kernal size in helper_q_tool.py
function quanv. We provide an example for kernal size = 3 in line 57.
You will see a message below during the Quanvolution Encoding with features extraction comment from 2.2.
.
===== Shape 60 126
Kernal = 2
Quantum pre-processing of train Speech:
2/175
Spoken Terms Recognition with additional U-Net Encoder discussed in our work.
python main_qsr.py
In 25 epochs. One way to improve the recognition system performance is to encode more data for training, refer to 2.2.
and 2.3
.
1500/1500 [==============================] - 3s 2ms/sample - val_loss: 0.4408 - val_accuracy: 0.9060
- Alternatively, training without U-Net as the method proposed in Douglas C. de Andrade et al. similar to their implementation but without
kapre
layers.
Please set use_Unet = False.
in model.py.
def attrnn_Model(x_in, labels, ablation = False):
# simple LSTM
rnn_func = L.LSTM
use_Unet = False
python cam_sp.py
We also provide a CTC model with Word Error Rate (WER) evaluation for future studies to the community refer to the discussion.
For example, an output "y-e--a" of input "yes" is identified as an incorrect word with the CTC alignment.
Noted this Quantum ASR CTC version is only supported tensorflow-gpu==2.3
. Please create a new environment for running this experiment.
- unzip the features for asr
cd data_quantum/asr_set
bash unzip.sh
- run the ctc model in
./speech_quantum_dl
python qsr_ctc_wer.py
Epoch 32/50
107/107 [==============================] - 5s 49ms/step - loss: 0.1191 - val_loss: 0.7115
Epoch 33/50
107/107 [==============================] - 5s 49ms/step - loss: 0.1547 - val_loss: 0.6701
=== WER: 9.895833333333334 %
Tutorial Link.
- Only for academic purpose. Feel free to contact the author for the other purposes.
If you think this work helps your research or use the code, please consider reference our paper. Thank you!
@inproceedings{yang2021decentralizing,
title={Decentralizing feature extraction with quantum convolutional neural network for automatic speech recognition},
author={Yang, Chao-Han Huck and Qi, Jun and Chen, Samuel Yen-Chi and Chen, Pin-Yu and Siniscalchi, Sabato Marco and Ma, Xiaoli and Lee, Chin-Hui},
booktitle={2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
pages={6523--6527},
year={2021},
organization={IEEE}
}
See PySyft and PyVertical for vertical federated learning setup. Please refer to a veritical learning example for virtualization.
We would like to appreciate Xanadu AI for providing the PennyLane and IBM research for providing qiskit and quantum hardware to the community. There is no conflict of interest.
Since the area between speech and quantum ML is still quite new, please feel free to open a issue for discussion.
Feel free to use this implementation for other speech processing or sequence modeling tasks (e.g., speaker recognition, speech seperation, event detection ...) as the quantum advantages discussed in the paper.