Forced Alignment for Chinese Speech

This repository shows how the Chinese speech datasets can be aligned using Montreal Forced Aligner. Annotations and models are provided for several popular datasets (AISHELL-3, biaobei).

What is Forced Alignment?

Forced Alignment takes as input an audio file and a transcript of what has been said in the audio file. The start and end time of a sentence, word, phoneme or paragraph is then determined. For example, in the case of an audio book and the corresponding text, forced alignment could align the sentence with the audio.

Montreal Forced Aligner

Montreal Forced Aligner (MFA) is a popular library for creating such alignments. MFA was notably used in FastSpeech-2.

The library takes as input the text (English text, pinyin transcription, Chinese characters, ...) and the audio. The inputs are ".lab" files containing the text and ".wav" audio files. The outputs are ".TextGrid" files containing the start and end time of each word and phoneme.

Currently, there is only an acoustic model for the alignment of Chinese characters and no pinyin model. However, a pinyin model would be much better. Many datasets provide more accurate information with pinyin, which we could use. Also, the MFA dictionary model does not treat Erhua and other aspects of Chinese phonology in an optimal way.

Therefore, I trained my own model, using a dictionary based on IPA.

Generated TextGrid files and pretrained model

Instead of following the instruction below, you can also download the generated files from the releases.

Dataset	Model	TextGrid
AISHELL-3	url	url
biaobei	url	url
AISHELL-3 + biaobei + more	url	-

Instruction

Download dataset(s)

AISHELL-3: https://www.openslr.org/93/
biaobei: https://en.data-baker.com/datasets/freeDatasets/

Extract the downloaded datasets to the following directories: datasets/aishell3, datasets/biaobei.

For custom datasets, create a directory datasets/general with the following structure:

datasets/general/
├── SPEAKER_NAME_1/
│ ├── text_1.hanzi (with Chinese text, e.g., 对不起)
│ ├── text_1.wav (corresponding audio file)
│ ...
├── SPEAKER_NAME_2/
│ ├── text_2.hanzi
│ ├── text_2.wav
│ ...
├── ...

Prepare alignment

conda create -n aligner -c conda-forge montreal-forced-aligner
conda activate aligner
pip install pinyin_to_ipa and python create_dictionary.py

Perform alignment

In the following change TEMP_DIR and num_jobs. Make sure that TEMP_DIR is an absolute path or the training might fail.

mfa train datasets/biaobei biaobei_pinyin_dictionary.txt biaobei_pinyin_acoustic.zip --output_directory datasets/biaobei --num_jobs 1 --temporary_directory TEMP_DIR --clean --use_mp --use_threading --single_speaker.
mfa train datasets/aishell3 aishell3_pinyin_dictionary.txt aishell3_pinyin_acoustic.zip --output_directory datasets/aishell3 --num_jobs 32 --temporary_directory TEMP_DIR --clean --use_mp --use_threading.

Post-processing

pip install tgt
python postprocess.py

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
create_dictionary.py		create_dictionary.py
postprocess.py		postprocess.py
praat.jpg		praat.jpg
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Forced Alignment for Chinese Speech

What is Forced Alignment?

Montreal Forced Aligner

Generated TextGrid files and pretrained model

Instruction

Download dataset(s)

Prepare alignment

Perform alignment

Post-processing

About

Releases 2

Languages

License

lars76/forced-alignment-chinese

Folders and files

Latest commit

History

Repository files navigation

Forced Alignment for Chinese Speech

What is Forced Alignment?

Montreal Forced Aligner

Generated TextGrid files and pretrained model

Instruction

Download dataset(s)

Prepare alignment

Perform alignment

Post-processing

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 2

Languages