This repository includes two main components: a shell script for multiprocess batched inference and a Python script for single-process inference of the VITS (Variational Inference Text-to-Speech) model.
All the code in this repository is adapted from the original VITS repository, available here.
Please follow the installation instructions from the original VITS repository before running the scripts.
./batched_vits_multiprocess_inference.sh --csv_file <csv_file> --gpu_ids <gpu_ids> --max_process <max_process> --batch_size <batch_size> --vits_config <vits_config> --vits_checkpoint <vits_checkpoint> [--audio_save_dir <audio_save_dir> --noise_scale <noise_scale> --noise_scale_w <noise_scale_w> --length_scale <length_scale> --vits_multispeaker true]
--csv_file
: Path to the CSV or TSV file containing input data. Contains two columnstext
andfilename
.text
column has text for which audio has to be generated and saved intofilename
. Optionalspeaker_id
column for multispeaker model (See Note below Multispeaker Model example). See example csv file: test_data.csv--gpu_ids
: Comma-separated GPU IDs to use for multiprocessing.--max_process
: Maximum number of parallel processes.--batch_size
: Batch size for each process.--vits_config
: Path to the VITS model configuration file.--vits_checkpoint
: Path to the VITS model checkpoint file.--audio_save_dir
: Directory to save generated audio (default: "./VITS_TTS_samples/").--noise_scale
: Noise scale factor (default: 0.667).--noise_scale_w
: Noise scale weight (default: 0.8).--length_scale
: Length scale factor (default: 1).--vits_multispeaker
: Optional flag for indicating whether a multispeaker model is used (default: false).
bash batched_vits_multiprocess_inference.sh --csv_file test_data.csv --gpu_ids 2,4 --max_process 4 --batch_size 2 --vits_config ./configs/ljs_base.json --vits_checkpoint ../pretrained_ljs.pth --audio_save_dir ./TTS_samples/test
Note: No need of speaker_id column in test_data.csv for single speaker model
bash batched_vits_multiprocess_inference.sh --csv_file test_data.csv --gpu_ids 2,4 --max_process 4 --batch_size 2 --vits_config ./configs/vctk_base.json --vits_checkpoint ../pretrained_vctk.pth --audio_save_dir ./TTS_samples/test_sid --vits_multispeaker true
Note: If you choose to use the vits_multispeaker option and "speaker_id" column is absent in your dataset.
In such cases, the script will compensate by generating random speaker IDs, chosen from the range 0 to hyperparameters.data.n_speakers-1.
- Log files (in ./logs directory) for each process:
log_1.txt
,log_2.txt
, ...,log_<max_process>.txt
. - Generated audio saved in the specified directory.
Note: If audio files with the same name already exist in the output directory, they will not be regenerated.
Multiple runs of this file with different arguments are done using batched_vits_multiprocessing_inference.sh
python batched_vits_inference.py \
--vits_config <vits_config_path> \
--vits_checkpoint <vits_checkpoint_path> \
--audio_saving_dir <audio_save_dir> \
--data_file <data_file_path> \
[--seed <seed>] \
[--start_idx <start_index>] \
[--end_idx <end_index>] \
--batch_size <batch_size> \
[--noise_scale <noise_scale>] \
[--noise_scale_w <noise_scale_w>] \
[--length_scale <length_scale>] \
[--vits_multispeaker True]
--vits_config
: Path to the VITS model configuration file (default: "../configs/vctk_base.json").--vits_checkpoint
: Path to the VITS model checkpoint file (default: "./pretrained_ljs.pth").--audio_saving_dir
or-v
: Directory to save the TTS samples generated by the VITS model (default: "./VITS_TTS_samples/").--data_file
: Path to the CSV or TSV file containingtext
andaudio_filename
columns. Thetext
column contains the text for which audio will be generated, and theaudio_filename
column contains the path where the generated audio will be saved (default: ./test_data.csv).--seed
: Seed for reproducibility (default: 1).--start_idx
: Start from this index in the dataframe used for multiprocessing (inclusive) (default: 0).--end_idx
: End on this index in the dataframe used for multiprocessing (NOT inclusive, so the range is [start_idx, end_idx)) (default: None, defaults to the full length of the dataset).--batch_size
: Batch size for inference.--noise_scale
: Noise scale used for inference (default: 0.667).--noise_scale_w
: Noise scale weight used for inference (default: 0.8).--length_scale
: Duration used for inference (default: 1).--vits_multispeaker
: Optional flag for indicating whether a multispeaker model is used (default: False).
python batched_vits_inference.py \
--vits_config ./configs/ljs_base.json \
--vits_checkpoint pretrained_ljs.pth \
--audio_saving_dir ./VITS_TTS_samples/ \
--data_file test_data.csv \
--seed 1 \
--start_idx 0 \
--end_idx 100 \
--batch_size 4 \
--noise_scale 0.5 \
--noise_scale_w 0.9 \
--length_scale 1.2 \
--vits_multispeaker True
Feel free to adjust the parameters based on your specific needs. Contributions and feedback are welcome!