Skip to content

Latest commit

 

History

History
191 lines (145 loc) · 5.48 KB

README.md

File metadata and controls

191 lines (145 loc) · 5.48 KB

Outline of recipes

Here we introcude the outline of recipes.

If you want to learn step-by-step, you can try the demo recipe in Google colab!

Open In Colab

Supported database

Type of recipe

sd: speaker-dependent model

  • build speaker dependent model
  • the speaker of training data is the same as that of evaluation data
  • auxiliary features are based on World analysis
  • noise shaping with world mel-cepstrum is applied

si-open: speaker-independent model in open condition

  • build speaker independent model in spekaer-open condition
  • the speakers of evaluation data does not include those of training data
  • auxiliary features are based on World analysis
  • noise shaping with world mel-cepstrum is applied

si-close: speaker-independent model in speaker-closed condition

  • build speaker independent model in open condition
  • the speakers of evaluation data includes those of training data
  • auxiliary features are based on World analysis
  • noise shaping with world mel-cepstrum is applied

*-melspc: model with mel-spectrogram

  • build the model with mel-spectrogram
  • auxiliary features are mel-spectrogram
  • noise shaping with stft mel-cepstrum is applied

Flow of recipe

  1. data preparation (stage 0)
  2. auxiliary feature extraction (stage 1)
  3. statistics calculation (stage 2)
  4. noise weighting (stage 3)
  5. WaveNet training (stage 4)
  6. WaveNet decoding (stage 5)
  7. noise shaping (stage 6)

How-to-run

# change directory to one of the recipe
$ cd arctic/sd

# run the recipe
$ ./run.sh

# you can skip some stages (in this case only stage 4,5,6 will be conducted)
$ ./run.sh --stage 456

# you can also change hyperparameters via command line
$ ./run.sh --lr 1e-3 --batch_length 10000

# multi-gpu training / decoding are supported (batch size should be greater than #gpus)
$ ./run.sh --n_gpus 3 --batch_size 3

Run recipe with slurm

If slurm is installed in your servers, you can run recipes with slurm.

$ cd egs/arctic/sd

# edit configuration
$ vim cmd.sh
# please edit as follows
-- cmd.sh --
# for local
# export train_cmd="run.pl"
# export cuda_cmd="run.pl --gpu 1"

# for slurm (you can change configuration file "conf/slurm.conf")
export train_cmd="slurm.pl --config conf/slurm.conf"
export cuda_cmd="slurm.pl --gpu 1 --config conf/slurm.conf"

$ vim conf/slurm.conf
# edit <your_partition_name>
-- slurm.conf --
command sbatch --export=PATH  --ntasks-per-node=1
option time=* --time $0
option mem=* --mem-per-cpu $0
option mem=0
option num_threads=* --cpus-per-task $0 --ntasks-per-node=1
option num_threads=1 --cpus-per-task 1  --ntasks-per-node=1
default gpu=0
option gpu=0 -p <your_partion_name>
option gpu=* -p <your_partion_name> --gres=gpu:$0 --time 10-00:00:00

# run the recipe
$ ./run.sh

If you want to know more info about run.pl and slurm.pl, see https://kaldi-asr.org/doc/queue.html.

Use pre-trained model to decode your own data

To synthesize your own data, things what you need are as follows:

- checkpoint-final.pkl (model parameter file)
- model.conf (model configuration file)
- stats.h5 (feature statistics file)
- *.wav (your own wav file, should be 16000 Hz)

The procedure is as follows:

$ cd egs/arctic/si-close

# download pre-trained model which trained with 6 arctic speakers and world features
$ wget "https://www.dropbox.com/s/xt7qqmfgamwpqqg/si-close_lr1e-4_wd0_bs20k_ns_up.zip?dl=0" -O si-close_lr1e-4_wd0_bs20k_ns_up.zip

# unzip
$ unzip si-close_lr1e-4_wd0_bs20k_ns_up.zip

# make filelist of your own wav files
$ find <your_wav_dir> -name "*.wav" > wav.scp

# feature extraction
$ . ./path.sh
$ feature_extract.py \
    --waveforms wav.scp \
    --wavdir wav/test \
    --hdf5dir hdf5/test \
    --feature_type world \
    --fs 16000 \
    --shiftms 5 \
    --minf0 <set_appropriate_value> \
    --maxf0 <set_appropriate_value> \
    --mcep_dim 24 \
    --mcep_alpha 0.41 \
    --highpass_cutoff 70 \
    --fftl 1024 \
    --n_jobs 1

# make filelist of feature file
$ find hdf5/test -name "*.h5" > feats.scp

# decode with pre-trained model
$ decode.py \
    --feats feats.scp \
    --stats si-close_lr1e-4_wd0_bs20k_ns_up/stats.h5 \
    --outdir si-close_lr1e-4_wd0_bs20k_ns_up/wav \
    --checkpoint si-close_lr1e-4_wd0_bs20k_ns_up/checkpoint-final.pkl \
    --config si-close_lr1e-4_wd0_bs20k_ns_up/model.conf \
    --fs 16000 \
    --batch_size 32 \
    --n_gpus 1

# make filelist of generated wav file
$ find si-close_lr1e-4_wd0_bs20k_ns_up/wav -name "*.wav" > wav_generated.scp

# perform noise shaping
$ noise_shaping.py \
    --waveforms wav_generated.scp \
    --stats si-close_lr1e-4_wd0_bs20k_ns_up/stats.h5 \
    --outdir si-close_lr1e-4_wd0_bs20k_ns_up/wav_nsf \
    --feature_type world \
    --fs 16000 \
    --shiftms 5 \
    --mcep_dim_start 2 \
    --mcep_dim_end 27 \
    --mcep_alpha 0.41 \
    --mag 0.5 \
    --inv false \
    --n_jobs 1

Finally, you can hear the generated wav files in si-close_lr1e-4_wd0_bs20k_ns_up/wav_nsf.

Author

Tomoki Hayashi @ Nagoya University
e-mail:hayashi.tomoki@g.sp.m.is.nagoya-u.ac.jp