Visual Speech Recognition for Multiple Languages

Authors

Pingchuan Ma, Stavros Petridis, Maja Pantic.

Introduction

This is the repository of Visual Speech Recognition for Multiple Languages, which is the successor of End-to-End Audio-Visual Speech Recognition with Conformers. The repository is mainly based on ESPnet. We provide state-of-the-art algorithms for end-to-end visual speech recognition in the wild.

Major features

Modular Design

The repository is composed of face tracking, pre-processing, and acoustic/visual encoder backbones.
Support of Benchmarks for Speech Recognition

Our models provide state-of-the-art performance for speech recognition datasets.
Support of Extraction of Representations or Mouth Region Of Interest

Our models directly support extraction of speech representations or mouth region of interests (ROIs).
Support of Recognition of Your Own Videos

We provide support for performing visual speech recognition for your own videos.

Demo

English -> Mandarin -> Spanish	French -> Portuguese -> Italian

Youtube | Bilibili

Installation

How to Install Environments

Clone the repository into a directory. We refer to that directory as ${lipreading_root}.

git clone https://github.com/mpc001/Visual_Speech_Recognition_for_Multiple_Languages

Install PyTorch (>=1.8.0)
Install other packages.

pip install -r requirements.txt

How to Prepare Models and Landmarks

Model. Download a model from Model Zoo.
- For models trained on the CMU-MOSEAS dataset, which contains multiple languages, please unzip them into ${lipreading_root}/models/${dataset}/${language_code} (e.g. ${lipreading_root}/models/CMUMOSEAS/pt).
- For models trained on a dataset with one language, please unzip them into ${lipreading_root}/models/${dataset}.
Language Model. The performance can be improved in most cases by incorporating an external language model. Please download a language model from Model Zoo.
- For a language model trained for the CMU-MOSEAS dataset, please unzip them into ${lipreading_root}/language_models/${dataset}/${language_code}.
- For a language model trained for datasets with one language, please unzip them into ${lipreading_root}/language_models/${dataset}.
Tracker [option]. If you intend to test your own videos, additional packages for face detection and face alignment need to be pre-installed, which are provided in the tools folder.
Landmarks [option]. If you want to evaluate on benchmarks, there is no need to install the tracker. Please download pre-computed landmarks from Model Zoo and unzip them into ${lipreading_root}/landmarks/${dataset}.

Recognition

Generic Options

We refer to a path name (.ini) that includes configuration information as <CONFIG-FILENAME-PATH>. We put configuration files in ${lipreading_root}/configs by default.
We refer to a path name (.ref) that includes labels information as <LABELS-FILENAME-PATH>.
- For the CMU-MOSEAS dataset and Multilingual TEDx dataset, which include multiple languages, we put labels files (.ref) in ${lipreading_root}/labels/${dataset}/${language_code}.
- For datasets with one language, we put label files in ${lipreading_root}/labels/${dataset}.
We refer to the original dataset directory as <DATA-DIRECTORY-PATH>, and to the path name of a single original video as <DATA-FILENAME-PATH>.
We refer to the landmarks diectory as <LANDMARKS-DIRECTORY-PATH>. We assume the default directory is ${lipreading_root}/landmarks/${dataset}/${dataset}_landmarks.
We use CPU for inference by default. If you want to speed up the decoding process, please consider
- adding a command-line argument about the GPU option (e.g. --gpu-idx <GPU_ID>). <GPU_ID> is the ID of your selected GPU, which is a 0-based integer.
- setting beam_size in the configuration filename (.ini) <CONFIG-FILENAME-PATH> to a small value (e.g. 5) in case your maximum GPU Memory is exceeded.

How to Test

We assume original videos from desired dataset have been downloaded to the dataset directory <DATA-DIRECTORY-PATH> and landmarks have been unzipped to the landmark directory ${lipreading_root}/landmarks/${dataset}.
The frame rate (fps) of your video should match the input v_fps in the configuration file.

To evaluate the performance on desired dataset.

python main.py --config-filename <CONFIG-FILENAME-PATH> \
               --labels-filename <LABELS-FILENAME-PATH> \
               --data-dir <DATA-DIRECTORY-PATH> \
               --landmarks-dir <LANDMARKS-DIRECTORY-PATH>

To lip read from a single video file.

python main.py --config-filename <CONFIG-FILENAME-PATH> \
               --data-filename <DATA-FILENAME-PATH>

How to Extract Mouth ROIs

Mouth ROIs can be extracted by setting <FEATS-POSITION> to mouth. The mouth ROIs will be saved to <OUTPUT-FILENAME-PATH> with the .avi file extension.
The ${lipreading_root}/outputs folder can be used to save the mouth ROIs.

To extract mouth ROIs from desired dataset.

python main.py --labels-filename <LABELS-FILENAME-PATH> \
               --data-dir <DATA-DIRECTORY-PATH> \
               --landmarks-dir <LANDMARKS-DIRECTORY-PATH> \
               --dst-dir <OUTPUT-DIRECTORY-PATH> \
               --feats-position <FEATS-POSITION>

To extract mouth ROIs from a single video file.

python main.py --data-filename <DATA-FILENAME-PATH> \
               --dst-filename <OUTPUT-FILENAME-PATH> \
               --feats-position <FEATS-POSITION>

How to Extract Speech Representations

Speech representations can be extracted from the top of ResNet-18 (512-D) or Conformer (256-D) by setting <FEATS-POSITION> to resnet or conformer, respetively. The representations will be saved to <OUTPUT-DIRECTORY-PATH> or <OUTPUT-FILENAME-PATH> with the .npz file extension.
The ${lipreading_root}/outputs folder can be used to save the speech representations.

To extract speech representations from desired dataset.

python main.py --config-filename <CONFIG-FILENAME-PATH> \
               --labels-filename <LABELS-FILENAME-PATH> \
               --data-dir <DATA-DIRECTORY-PATH> \
               --landmarks-dir <LANDMARKS-DIRECTORY-PATH> \
               --dst-dir <OUTPUT-DIRECTORY-PATH> \
               --feats-position <FEATS-POSITION>

To extract speech representations from a single video file.

python main.py --config-filename <CONFIG-FILENAME-PATH> \
               --data-filename <DATA-FILENAME-PATH> \
               --dst-filename <OUTPUT-FILENAME-PATH> \
               --feats-position <FEATS-POSITION>

Model Zoo

Overview

We support a number of datasets for speech recognition:

Evaluation

We provide landmarks, language models, models for each dataset. Please see the models page for details.

Citation

If you find this code useful in your research, please consider citing the following papers:

@article{ma2022visual,
  title={{Visual Speech Recognition for Multiple Languages in the Wild}},
  author={Ma, Pingchuan and Petridis, Stavros and Pantic, Maja},
  journal={{Nature Machine Intelligence}},
  volume={4},
  pages={930--939},
  year={2022}
  url={https://doi.org/10.1038/s42256-022-00550-z},
  doi={10.1038/s42256-022-00550-z}
}

License

It is noted that the code can only be used for comparative or benchmarking purposes. Users can only use code supplied under a License for non-commercial purposes.

Contact

[Pingchuan Ma](pingchuan.ma16[at]imperial.ac.uk)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Visual Speech Recognition for Multiple Languages

Authors

Introduction

Demo

Installation

How to Install Environments

How to Prepare Models and Landmarks

Recognition

Generic Options

How to Test

How to Extract Mouth ROIs

How to Extract Speech Representations

Model Zoo

Overview

Evaluation

Citation

License

Contact

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
configs		configs
dataloader		dataloader
doc		doc
espnet		espnet
labels		labels
landmarks		landmarks
language_models		language_models
lipreading		lipreading
metrics		metrics
models		models
outputs		outputs
tools		tools
tracker		tracker
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

License

ibug-group/Visual_Speech_Recognition_for_Multiple_Languages

Folders and files

Latest commit

History

Repository files navigation

Visual Speech Recognition for Multiple Languages

Authors

Introduction

Demo

Installation

How to Install Environments

How to Prepare Models and Landmarks

Recognition

Generic Options

How to Test

How to Extract Mouth ROIs

How to Extract Speech Representations

Model Zoo

Overview

Evaluation

Citation

License

Contact

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages