Code accompanying the paper SampleRNN: An Unconditional End-to-End Neural Audio Generation Model [here and here]. Samples are available here.
Extensively tested with:
- cuDNN 5105
- Python 2.7.12
- Numpy 1.11.1
- Theano 0.8.2 (0.9 for WaveNet re-implementation)
- Lasagne 0.2.dev1
Music dataset was created from all 32 Beethoven’s piano sonatas available publicly on archive.org. datasets/music
contains scripts to preprocess and build this dataset. It is also available here for download. Extract the tar file and put all the numpy files in datasets/music
directory.
To train a model on an existing dataset with accelerated GPU processing, you need to run following lines from the root of sampleRNN_ICLR2017
folder which corresponds to the best found set of hyper-paramters.
Mission control center:
$ pwd
/u/mehris/sampleRNN_ICLR2017
$ python models/two_tier/two_tier.py -h
usage: two_tier.py [-h] [--exp EXP] --n_frames N_FRAMES --frame_size
FRAME_SIZE --weight_norm WEIGHT_NORM --emb_size EMB_SIZE
--skip_conn SKIP_CONN --dim DIM --n_rnn {1,2,3,4,5}
--rnn_type {LSTM,GRU} --learn_h0 LEARN_H0 --q_levels
Q_LEVELS --q_type {linear,a-law,mu-law} --which_set
{ONOM,BLIZZ,MUSIC} --batch_size {64,128,256} [--debug]
[--resume]
two_tier.py No default value! Indicate every argument.
optional arguments:
-h, --help show this help message and exit
--exp EXP Experiment name
--n_frames N_FRAMES How many "frames" to include in each Truncated BPTT
pass
--frame_size FRAME_SIZE
How many samples per frame
--weight_norm WEIGHT_NORM
Adding learnable weight normalization to all the
linear layers (except for the embedding layer)
--emb_size EMB_SIZE Size of embedding layer (0 to disable)
--skip_conn SKIP_CONN
Add skip connections to RNN
--dim DIM Dimension of RNN and MLPs
--n_rnn {1,2,3,4,5} Number of layers in the stacked RNN
--rnn_type {LSTM,GRU}
GRU or LSTM
--learn_h0 LEARN_H0 Whether to learn the initial state of RNN
--q_levels Q_LEVELS Number of bins for quantization of audio samples.
Should be 256 for mu-law.
--q_type {linear,a-law,mu-law}
Quantization in linear-scale, a-law-companding, or mu-
law compandig. With mu-/a-law quantization level shoud
be set as 256
--which_set {ONOM,BLIZZ,MUSIC}
ONOM, BLIZZ, or MUSIC
--batch_size {64,128,256}
size of mini-batch
--debug Debug mode
--resume Resume the same model from the last checkpoint. Order
of params are important. [for now]
To run:
$ THEANO_FLAGS=mode=FAST_RUN,device=gpu0,floatX=float32 python -u models/two_tier/two_tier.py --exp BEST_2TIER --n_frames 64 --frame_size 16 --emb_size 256 --skip_conn False --dim 1024 --n_rnn 3 --rnn_type GRU --q_levels 256 --q_type linear --batch_size 128 --weight_norm True --learn_h0 True --which_set MUSIC
$ python models/three_tier/three_tier.py -h
usage: three_tier.py [-h] [--exp EXP] --seq_len SEQ_LEN --big_frame_size
BIG_FRAME_SIZE --frame_size FRAME_SIZE --weight_norm
WEIGHT_NORM --emb_size EMB_SIZE --skip_conn SKIP_CONN
--dim DIM --n_rnn {1,2,3,4,5} --rnn_type {LSTM,GRU}
--learn_h0 LEARN_H0 --q_levels Q_LEVELS --q_type
{linear,a-law,mu-law} --which_set {ONOM,BLIZZ,MUSIC}
--batch_size {64,128,256} [--debug] [--resume]
three_tier.py No default value! Indicate every argument.
optional arguments:
-h, --help show this help message and exit
--exp EXP Experiment name
--seq_len SEQ_LEN How many samples to include in each Truncated BPTT
pass
--big_frame_size BIG_FRAME_SIZE
How many samples per big frame in tier 3
--frame_size FRAME_SIZE
How many samples per frame in tier 2
--weight_norm WEIGHT_NORM
Adding learnable weight normalization to all the
linear layers (except for the embedding layer)
--emb_size EMB_SIZE Size of embedding layer (> 0)
--skip_conn SKIP_CONN
Add skip connections to RNN
--dim DIM Dimension of RNN and MLPs
--n_rnn {1,2,3,4,5} Number of layers in the stacked RNN
--rnn_type {LSTM,GRU}
GRU or LSTM
--learn_h0 LEARN_H0 Whether to learn the initial state of RNN
--q_levels Q_LEVELS Number of bins for quantization of audio samples.
Should be 256 for mu-law.
--q_type {linear,a-law,mu-law}
Quantization in linear-scale, a-law-companding, or mu-
law compandig. With mu-/a-law quantization level shoud
be set as 256
--which_set {ONOM,BLIZZ,MUSIC}
ONOM, BLIZZ, or MUSIC
--batch_size {64,128,256}
size of mini-batch
--debug Debug mode
--resume Resume the same model from the last checkpoint. Order
of params are important. [for now]
To run:
$ THEANO_FLAGS=mode=FAST_RUN,device=gpu0,floatX=float32 python -u models/three_tier/three_tier.py --exp BEST_3TIER --seq_len 512 --big_frame_size 8 --frame_size 2 --emb_size 256 --skip_conn False --dim 1024 --n_rnn 1 --rnn_type GRU --q_levels 256 --q_type linear --batch_size 128 --weight_norm True --learn_h0 True --which_set MUSIC
If you are using this code, please cite the paper.
@article{mehri2016samplernn, Author = {Soroush Mehri and Kundan Kumar and Ishaan Gulrajani and Rithesh Kumar and Shubham Jain and Jose Sotelo and Aaron Courville and Yoshua Bengio}, Title = {SampleRNN: An Unconditional End-to-End Neural Audio Generation Model}, Year = {2016}, Journal = {arXiv preprint arXiv:1612.07837}, }
Thanks to Richard Assar, now we have a Torch implementation available:
https://github.com/richardassar/SampleRNN_torch
- Talk by Yoshua Bengio at CBMM, MIT: Deep Generative Models for Speech and Images
- Follow-up project: Char2Wav: End-To-End Speech Synthesis
If needed or have interesting related project/results, please don't hesitate to contact us.