Grad-SVC based on Grad-TTS from HUAWEI Noah's Ark Lab

This project is named as Grad-SVC, or GVC for short. Its core technology is diffusion, but so different from other diffusion based SVC models. Codes are adapted from Grad-TTS and whisper-vits-svc. So the features from whisper-vits-svc are used in this project. By the way, Diff-VC is a follow-up of Grad-TTS, Diffusion-Based Any-to-Any Voice Conversion

Grad-TTS: A Diffusion Probabilistic Model for Text-to-Speech

The framework of grad-svc-v1

The framework of grad-svc-v2 & v3, encoder:768->512, diffusion:64->96

Elysia_Grad_SVC.mp4

Features

Such beautiful codes from Grad-TTS

easy to read
Multi-speaker based on speaker encoder
No speaker leaky based on Perturbation & Instance Normlize & GRL

One-shot Voice Conversion by Separating Speaker and Content Representations with Instance Normalization
No electronic sound
Integrated DPM Solver-k for less steps
Integrated Fast Maximum Likelihood Sampling Scheme, for less steps
Conditional Flow Matching (V3), first used in SVC
Rectified Flow Matching (TODO)

Setup Environment

Install project dependencies
```
pip install -r requirements.txt
```
Download the Timbre Encoder: Speaker-Encoder by @mueller91, put best_model.pth.tar into speaker_pretrain/.
Download hubert_soft model，put hubert-soft-0d54a1f4.pt into hubert_pretrain/.
Download pretrained nsf_bigvgan_pretrain_32K.pth, and put it into bigvgan_pretrain/.

Performance Bottleneck: Generator and Discriminator are 116Mb, but Generator is only 22Mb

系统性能瓶颈：生成器和判别器一共116M，而生成器只有22M
Download pretrain model gvc.pretrain.pth, and put it into grad_pretrain/.
```
python gvc_inference.py --model ./grad_pretrain/gvc.pretrain.pth --spk ./assets/singers/singer0001.npy --wave test.wav
```
For this pretrain model, temperature is set temperature=1.015 in gvc_inference.py to get good result.

Dataset preparation

Put the dataset into the data_raw directory following the structure below.

data_raw
├───speaker0
│   ├───000001.wav
│   ├───...
│   └───000xxx.wav
└───speaker1
    ├───000001.wav
    ├───...
    └───000xxx.wav

Data preprocessing

After preprocessing you will get an output with following structure.

data_gvc/
└── waves-16k
│    └── speaker0
│    │      ├── 000001.wav
│    │      └── 000xxx.wav
│    └── speaker1
│           ├── 000001.wav
│           └── 000xxx.wav
└── waves-32k
│    └── speaker0
│    │      ├── 000001.wav
│    │      └── 000xxx.wav
│    └── speaker1
│           ├── 000001.wav
│           └── 000xxx.wav
└── mel
│    └── speaker0
│    │      ├── 000001.mel.pt
│    │      └── 000xxx.mel.pt
│    └── speaker1
│           ├── 000001.mel.pt
│           └── 000xxx.mel.pt
└── pitch
│    └── speaker0
│    │      ├── 000001.pit.npy
│    │      └── 000xxx.pit.npy
│    └── speaker1
│           ├── 000001.pit.npy
│           └── 000xxx.pit.npy
└── hubert
│    └── speaker0
│    │      ├── 000001.vec.npy
│    │      └── 000xxx.vec.npy
│    └── speaker1
│           ├── 000001.vec.npy
│           └── 000xxx.vec.npy
└── speaker
│    └── speaker0
│    │      ├── 000001.spk.npy
│    │      └── 000xxx.spk.npy
│    └── speaker1
│           ├── 000001.spk.npy
│           └── 000xxx.spk.npy
└── singer
    ├── speaker0.spk.npy
    └── speaker1.spk.npy

Re-sampling

Generate audio with a sampling rate of 16000Hz in ./data_gvc/waves-16k

python prepare/preprocess_a.py -w ./data_raw -o ./data_gvc/waves-16k -s 16000

Generate audio with a sampling rate of 32000Hz in ./data_gvc/waves-32k

python prepare/preprocess_a.py -w ./data_raw -o ./data_gvc/waves-32k -s 32000

Use 16K audio to extract pitch

python prepare/preprocess_f0.py -w data_gvc/waves-16k/ -p data_gvc/pitch

use 32k audio to extract mel

python prepare/preprocess_spec.py -w data_gvc/waves-32k/ -s data_gvc/mel

Use 16K audio to extract hubert

python prepare/preprocess_hubert.py -w data_gvc/waves-16k/ -v data_gvc/hubert

Use 16k audio to extract timbre code

python prepare/preprocess_speaker.py data_gvc/waves-16k/ data_gvc/speaker

Extract the average value of the timbre code for inference

python prepare/preprocess_speaker_ave.py data_gvc/speaker/ data_gvc/singer

Use 32k audio to generate training index
```
python prepare/preprocess_train.py
```
Training file debugging
```
python prepare/preprocess_zzz.py
```

Train

Start training
```
python gvc_trainer.py
```

Resume training

python gvc_trainer.py -p logs/grad_svc/grad_svc_***.pth

Log visualization
```
tensorboard --logdir logs/
```

Train Loss

Inference

Export inference model

python gvc_export.py --checkpoint_path logs/grad_svc/grad_svc_***.pth

Inference

python gvc_inference.py --model gvc.pth --spk ./data_gvc/singer/your_singer.spk.npy --wave test.wav --rature 1.015 --shift 0

temperature=1.015, needs to be adjusted to get good results; Recommended range is (1.001, 1.035).

Inference step by step

Extract hubert content vector

python hubert/inference.py -w test.wav -v test.vec.npy

Extract pitch to the csv text format

python pitch/inference.py -w test.wav -p test.csv

Convert hubert & pitch to wave

python gvc_inference.py --model gvc.pth --spk ./data_gvc/singer/your_singer.spk.npy --wave test.wav --vec test.vec.npy --pit test.csv --shift 0

Data

Name	URL
PopCS	https://github.com/MoonInTheRiver/DiffSinger/blob/master/resources/apply_form.md
opencpop	https://wenet.org.cn/opencpop/download/
Multi-Singer	https://github.com/Multi-Singer/Multi-Singer.github.io
M4Singer	https://github.com/M4Singer/M4Singer/blob/master/apply_form.md
VCTK	https://datashare.ed.ac.uk/handle/10283/2651

Code sources and references

https://github.com/huawei-noah/Speech-Backbones/blob/main/Grad-TTS

https://github.com/huawei-noah/Speech-Backbones/tree/main/DiffVC

https://github.com/facebookresearch/speech-resynthesis

https://github.com/cantabile-kwok/VoiceFlow-TTS

https://github.com/shivammehta25/Matcha-TTS

https://github.com/shivammehta25/Diff-TTSG

https://github.com/majidAdibian77/ResGrad

https://github.com/LuChengTHU/dpm-solver

https://github.com/gmltmd789/UnitSpeech

https://github.com/zhenye234/CoMoSpeech

https://github.com/seahore/PPG-GradVC

https://github.com/thuhcsi/LightGrad

https://github.com/lmnt-com/wavegrad

https://github.com/naver-ai/facetts

https://github.com/jaywalnut310/vits

https://github.com/NVIDIA/BigVGAN

https://github.com/bshall/soft-vc

https://github.com/mozilla/TTS

https://github.com/ubisoft/ubisoft-laforge-daft-exprt

https://github.com/yl4579/StyleTTS-VC

https://github.com/MingjieChen/DYGANVC

https://github.com/sony/ai-research-code/tree/master/nvcnet

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Grad-SVC based on Grad-TTS from HUAWEI Noah's Ark Lab

Features

Setup Environment

Dataset preparation

Data preprocessing

Train

Train Loss

Inference

Data

Code sources and references

About

Releases 7

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 97 Commits
assets		assets
bigvgan		bigvgan
bigvgan_pretrain		bigvgan_pretrain
configs		configs
grad		grad
grad_extend		grad_extend
grad_pretrain		grad_pretrain
hubert		hubert
hubert_pretrain		hubert_pretrain
pitch		pitch
prepare		prepare
speaker		speaker
speaker_pretrain		speaker_pretrain
spec		spec
LICENSE		LICENSE
README.md		README.md
gvc_export.py		gvc_export.py
gvc_inference.py		gvc_inference.py
gvc_trainer.py		gvc_trainer.py
requirements.txt		requirements.txt

License

PlayVoice/Grad-SVC

Folders and files

Latest commit

History

Repository files navigation

Grad-SVC based on Grad-TTS from HUAWEI Noah's Ark Lab

Features

Setup Environment

Dataset preparation

Data preprocessing

Train

Train Loss

Inference

Data

Code sources and references

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 7

Packages 0

Contributors 2

Languages

Packages