Skip to content

Emotional Scene Musicalization with Deep Neural Networks (MTAP 2020)

License

Notifications You must be signed in to change notification settings

gcunhase/Scene2Wav

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

49 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Scene2Wav

Scene2Wav

Contents

RequirementsHow to UseResultsHow to Cite

Requirements

This code was tested with Python 3.5+ and PyTorch 0.4.1 (or 0.4.1.post2)

The rest of the dependencies can be installed with pip install -r requirements.txt.

How to Use

0. Dataset and Pre-Processing

  • Data:

  • The .npz dataset should be copied in a subfolder in a datasets/ folder in the root of the repository

      .Scene2Wav
      +-- datasets
      |   +-- data_npz
      |       +-- my_data_train.npz
      |       +-- my_data_test.npz
      |   +-- custom_data_npz
      |       > Your custom `npz` dataset can go in here
    

1. Training

CUDA_VISIBLE_DEVICES=0 python train.py --exp TEST --frame_sizes 16 4 --n_rnn 2 --dataset data_npz --npz_filename video_feats_HSL_10fps_3secs_intAudio_pad_train.npz --npz_filename_test video_feats_HSL_10fps_3secs_pad_test.npz --cnn_pretrain cnnseq/cnn4_3secs_res_vanilla_HSL_bin_1D_CrossEntropy_ep_40_bs_30_lr_0.001_we_0.0001_asgd/ --cnn_seq2seq_pretrain cnnseq/cnnseq2seq4_3secs_HSL_bin_1D_res_stepPred_8_ep_20_bs_30_relu_layers_2_size_128_lr_0.001_we_1e-05_adam_asgdCNN_trainSize_3182_testSize_1139_cost_audio/
  • If you need to train encoder Scene2Wav with custom dataset (instead of using pre-trained one):
    • Pre-train CNN with Scene frames and Emotion scores
    python CNN_main.py --mode=train
    • Pre-train CNN-Seq2Seq end-to-end with the Scene frames and Audio
    python CNNSeq2Seq_main.py --mode=train

2. Generating Samples

Generate target, and baseline (CNNSeq2Seq) and ours (Scene2Wav)

python generate_audio_scene2wav.py

Path to checkpoint, emotion, and number of samples to generate are set inside the script

3. Evaluation

  • Emotion evaluation

    1. Install requirements
    pip install music21 vamp librosa midiutil
    1. Melodia plugin
      • Download
      • Install:
        • MacOS: copy all files in MTG-MELODIA 1.0 (OSX universal).zip to: /Library/Audio/Plug-Ins/Vamp
        • Linux: copy all files in MTG-MELODIA 1.0 (Linux 32/64-bit).zip to: /usr/local/lib/vamp
    2. Transform wav to midi and detect chords
    python emotion_evaluation.py --data_dir [data dirname] --infile [filename].wav --outfile [filename].mid
  • Human evaluation: Amazon MTurk

  • Perceptual audio metric

    1. Clone code and install requirements
    2. Copy perceptual_audio_metric.sh to metric_code/ and run

      P.S: Modify audio path and filenames you wish to compare

Results

  • Results saved in results/: training log, loss plots, model checkpoints and generated samples.

  • You can check some generated samples in results_generated_samples/ (tested with VLC Media Player).

Acknowledgement

In case you wish to use this code, please credit this repository or send me an email with any requests or questions.

@article{sergio2020jmta,
    author={Sergio, G. C. and Lee, M.},
    title={Scene2Wav: A Deep Convolutional Sequence-to-Conditional SampleRNN for Emotional Scene Musicalization},
    journal={Multimedia Tools and Applications},
    year={2020},
    pages={1--20},
    doi={10.1007/s11042-020-09636-5},
    issn={1573-7721},
    volume={2020}
}

Please also cite the pre-processing repository AnnotatedMV-PreProcessing as:

@software{gwena_cunha_2020_3910918,
  author       = {Gwenaelle Cunha Sergio},
  title        = {{gcunhase/AnnotatedMV-PreProcessing: Pre-Processing 
                   of Annotated Music Video Corpora (COGNIMUSE and
                   DEAP)}},
  month        = jun,
  year         = 2020,
  publisher    = {Zenodo},
  version      = {v2.0},
  doi          = {10.5281/zenodo.3910918},
  url          = {https://doi.org/10.5281/zenodo.3910918}
}

If you use the COGNIMUSE database:

@article{zlatintsi2017cognimuse,
  title={COGNIMUSE: A multimodal video database annotated with saliency, events, semantics and emotion with application to summarization},
  author={Zlatintsi, Athanasia and Koutras, Petros and Evangelopoulos, Georgios and Malandrakis, Nikolaos and Efthymiou, Niki and Pastra, Katerina and Potamianos, Alexandros and Maragos, Petros},
  journal={EURASIP Journal on Image and Video Processing},
  volume={2017},
  number={1},
  pages={54},
  year={2017},
  publisher={Springer}
}

If you use the DEAP database:

@article{koelstra2011deap,
  title={Deap: A database for emotion analysis; using physiological signals},
  author={Koelstra, Sander and Muhl, Christian and Soleymani, Mohammad and Lee, Jong-Seok and Yazdani, Ashkan and Ebrahimi, Touradj and Pun, Thierry and Nijholt, Anton and Patras, Ioannis},
  journal={IEEE transactions on affective computing},
  volume={3},
  number={1},
  pages={18--31},
  year={2011},
  publisher={IEEE}
}

Code based on deepsound-project's PyTorch's implementation of SampleRNN

Releases

No releases published

Packages

No packages published