Skip to content

Efficient batched inference scripts for VITS TTS model, supporting single-process and multiprocessing modes.

License

Notifications You must be signed in to change notification settings

piyushsinghpasi/VITS-Batched-Inference

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Batched Inference for VITS TTS Model

This repository includes two main components: a shell script for multiprocess batched inference and a Python script for single-process inference of the VITS (Variational Inference Text-to-Speech) model.

Installation

All the code in this repository is adapted from the original VITS repository, available here.

Please follow the installation instructions from the original VITS repository before running the scripts.

Batched Inference (Multiprocessing)

Usage

./batched_vits_multiprocess_inference.sh --csv_file <csv_file> --gpu_ids <gpu_ids> --max_process <max_process> --batch_size <batch_size> --vits_config <vits_config> --vits_checkpoint <vits_checkpoint> [--audio_save_dir <audio_save_dir> --noise_scale <noise_scale> --noise_scale_w <noise_scale_w> --length_scale <length_scale> --vits_multispeaker true]

Parameters

  • --csv_file: Path to the CSV or TSV file containing input data. Contains two columns text and filename. text column has text for which audio has to be generated and saved into filename. Optional speaker_id column for multispeaker model (See Note below Multispeaker Model example). See example csv file: test_data.csv
  • --gpu_ids: Comma-separated GPU IDs to use for multiprocessing.
  • --max_process: Maximum number of parallel processes.
  • --batch_size: Batch size for each process.
  • --vits_config: Path to the VITS model configuration file.
  • --vits_checkpoint: Path to the VITS model checkpoint file.
  • --audio_save_dir: Directory to save generated audio (default: "./VITS_TTS_samples/").
  • --noise_scale: Noise scale factor (default: 0.667).
  • --noise_scale_w: Noise scale weight (default: 0.8).
  • --length_scale: Length scale factor (default: 1).
  • --vits_multispeaker: Optional flag for indicating whether a multispeaker model is used (default: false).

Example

Single Speaker Model

bash batched_vits_multiprocess_inference.sh --csv_file test_data.csv --gpu_ids 2,4 --max_process 4 --batch_size 2 --vits_config ./configs/ljs_base.json --vits_checkpoint ../pretrained_ljs.pth --audio_save_dir ./TTS_samples/test

Note: No need of speaker_id column in test_data.csv for single speaker model

Multispeaker Model

bash batched_vits_multiprocess_inference.sh --csv_file test_data.csv --gpu_ids 2,4 --max_process 4 --batch_size 2 --vits_config ./configs/vctk_base.json --vits_checkpoint ../pretrained_vctk.pth --audio_save_dir ./TTS_samples/test_sid --vits_multispeaker true

Note: If you choose to use the vits_multispeaker option and "speaker_id" column is absent in your dataset.

In such cases, the script will compensate by generating random speaker IDs, chosen from the range 0 to hyperparameters.data.n_speakers-1.

Output

  • Log files (in ./logs directory) for each process: log_1.txt, log_2.txt, ..., log_<max_process>.txt.
  • Generated audio saved in the specified directory.

Note: If audio files with the same name already exist in the output directory, they will not be regenerated.

Single Process Inference (batched_vits_inference.py)

Multiple runs of this file with different arguments are done using batched_vits_multiprocessing_inference.sh

Usage

python batched_vits_inference.py \
    --vits_config <vits_config_path> \
    --vits_checkpoint <vits_checkpoint_path> \
    --audio_saving_dir <audio_save_dir> \
    --data_file <data_file_path> \
    [--seed <seed>] \
    [--start_idx <start_index>] \
    [--end_idx <end_index>] \
    --batch_size <batch_size> \
    [--noise_scale <noise_scale>] \
    [--noise_scale_w <noise_scale_w>] \
    [--length_scale <length_scale>] \
    [--vits_multispeaker True]

Parameters

  • --vits_config: Path to the VITS model configuration file (default: "../configs/vctk_base.json").
  • --vits_checkpoint: Path to the VITS model checkpoint file (default: "./pretrained_ljs.pth").
  • --audio_saving_dir or -v: Directory to save the TTS samples generated by the VITS model (default: "./VITS_TTS_samples/").
  • --data_file: Path to the CSV or TSV file containing text and audio_filename columns. The text column contains the text for which audio will be generated, and the audio_filename column contains the path where the generated audio will be saved (default: ./test_data.csv).
  • --seed: Seed for reproducibility (default: 1).
  • --start_idx: Start from this index in the dataframe used for multiprocessing (inclusive) (default: 0).
  • --end_idx: End on this index in the dataframe used for multiprocessing (NOT inclusive, so the range is [start_idx, end_idx)) (default: None, defaults to the full length of the dataset).
  • --batch_size: Batch size for inference.
  • --noise_scale: Noise scale used for inference (default: 0.667).
  • --noise_scale_w: Noise scale weight used for inference (default: 0.8).
  • --length_scale: Duration used for inference (default: 1).
  • --vits_multispeaker: Optional flag for indicating whether a multispeaker model is used (default: False).

Example

python batched_vits_inference.py \
    --vits_config ./configs/ljs_base.json \
    --vits_checkpoint pretrained_ljs.pth \
    --audio_saving_dir ./VITS_TTS_samples/ \
    --data_file test_data.csv \
    --seed 1 \
    --start_idx 0 \
    --end_idx 100 \
    --batch_size 4 \
    --noise_scale 0.5 \
    --noise_scale_w 0.9 \
    --length_scale 1.2 \
    --vits_multispeaker True

Feel free to adjust the parameters based on your specific needs. Contributions and feedback are welcome!

About

Efficient batched inference scripts for VITS TTS model, supporting single-process and multiprocessing modes.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published