Perform any-2-any speaking rate conversion
Paper: https://arxiv.org/abs/2209.01978
Demo: go.upb.de/interspeech2022
git clone https://github.com/michael-kuhlmann/phoneme_rate_conversion.git
cd phoneme_rate_conversion
pip install -e .
We provide a training script to train a local phoneme rate estimator from forced alignments.
- Prepare the following environment variables:
$DB_ROOT
: Points to the path where the LibriSpeech corpus and alignments will be stored$STORAGE_ROOT
: Points to the path where the trained models will be stored
- Download the LibriSpeech subsets to
$DB_ROOT/LibriSpeech
- Download librispeech.json and put it under
jsons
- Download librispeech_phone_ali to
$DB_ROOT
- We used the Montreal Forced Aligner (MFA) to get the alignments. To create the alignments yourself, see here
We use sacred for easily configurable training runs. To start the training with the default values, use
python -m phoneme_rate_conversion.train
To customize the configuration, use
python -m phoneme_rate_conversion.train with shift_ms=10 window_size_ms=20
This will change the shift and window size of the STFT from 12.5ms and 50ms to 10ms and 20ms, respectively. You can get the full config list with customizable options from
python -m phoneme_rate_conversion.train print_config
We provide a pretrained model that was trained on LibriSpeech and Timit and showed good generalization.
During inference, we can perform speaking rate conversion between two audios without requiring any text labels.
from phoneme_rate_conversion.inference import SpeakingRateConverter
from scipy.io import wavfile
converter = SpeakingRateConverter.from_config(
SpeakingRateConverter.get_config(dict(model_dir='pretrained/')))
c_sample_rate, content_wav = wavfile.read('/path/to/content/wav')
s_sample_rate, style_wav = wavfile.read('/path/to/style/wav')
assert c_sample_rate == s_sample_rate
content_wav_time_scaled = converter(content_wav, style_wav, in_rate=c_sample_rate)
This will imprint the speaking rate of style_wav
onto content_wav
. The quality of the conversion depends on the
choice of the utterances, the quality of the speaking rate estimator and the voice activity detection (VAD) algorithm.
SpeakingRateConverter
supports different time scaling and VAD algorithms which can be customized by overwriting the
time_scale_fn
and vad
arguments:
import phoneme_rate_conversion as prc
converter = prc.inference.SpeakingRateConverter(
model_dir='pretrained/',
time_scale_fn=prc.modules.time_scaling.WSOLA(sample_rate=c_sample_rate),
vad=prc.utils.vad.WebRTCVAD(sample_rate=c_sample_rate),
)
To deactivate the VAD, pass vad=None
.
In our paper, we also proposed a completely unsupervised approach based on unsupervised phoneme segmentation. We slightly modified the code to work with this repository:
git clone https://github.com/michael-kuhlmann/UnsupSeg.git
cd UnsupSeg
pip install -e .
The inference works similarly:
from phoneme_rate_conversion.inference import UnsupSegSpeakingRateConverter
converter = UnsupSegSpeakingRateConverter.from_config(
UnsupSegSpeakingRateConverter.get_config(dict(model_dir='pretrained/')))
content_wav_time_scaled = converter(content_wav, style_wav, in_rate=c_sample_rate)
You can find a pretrained model in the same pretrained directory or use one from the original repository.