Official implementation of EdiTTS: Score-based Editing for Controllable Text-to-Speech. Audio samples are available on our demo page.
We present EdiTTS, an off-the-shelf speech editing methodology based on score-based generative modeling for text-to-speech synthesis. EdiTTS allows for targeted, granular editing of audio, both in terms of content and pitch, without the need for any additional training, task-specific optimization, or architectural modifications to the score-based model backbone. Specifically, we apply coarse yet deliberate perturbations in the Gaussian prior space to induce desired behavior from the diffusion model, while applying masks and softening kernels to ensure that iterative edits are applied only to the target region. Listening tests demonstrate that EdiTTS is capable of reliably generating natural-sounding audio that satisfies user-imposed requirements.
Please cite this work as follows.
@inproceedings{tae22_interspeech,
author={Jaesung Tae and Hyeongju Kim and Taesu Kim},
title={{EdiTTS: Score-based Editing for Controllable Text-to-Speech}},
year=2022,
booktitle={Proc. Interspeech 2022},
pages={421--425},
doi={10.21437/Interspeech.2022-6}
}
-
Create a Python virtual environment (
venv
orconda
) and install package requirements as specified inrequirements.txt
.python -m venv venv source venv/bin/activate pip install -U pip pip install -r requirements.txt
-
Build the monotonic alignment module.
cd model/monotonic_align python setup.py build_ext --inplace
For more information, refer to the official repository of Grad-TTS.
The following checkpoints are already included as part of this repository, under checkpts
.
-
Prepare an input file containing samples for speech generation. Mark the segment to be edited via a vertical bar separator,
|
. For instance, a single sample might look likeIn | the face of impediments confessedly discouraging |
We provide a sample input file in
resources/filelists/edit_pitch_example.txt
. -
To run inference, type
CUDA_VISIBLE_DEVICES=0 python edit_pitch.py \ -f resources/filelists/edit_pitch_example.txt \ -c checkpts/grad-tts-old.pt -t 1000 \ -s out/pitch/wavs
Adjust
CUDA_VISIBLE_DEVICES
as appropriate.
-
Prepare an input file containing pairs of sentences. Concatenate each pair with
#
and mark the parts to be replaced with a vertical bar separator. For instance, a single pair might look likeThree others subsequently | identified | Oswald from a photograph. #Three others subsequently | recognized | Oswald from a photograph.
We provide a sample input file in
resources/filelists/edit_content_example.txt
. -
To run inference, type
CUDA_VISIBLE_DEVICES=0 python edit_content.py \ -f resources/filelists/edit_content_example.txt \ -c checkpts/grad-tts-old.pt -t 1000 \ -s out/content/wavs
Released under the modified GNU General Public License.