Skip to content

Latest commit

 

History

History
27 lines (18 loc) · 1.85 KB

README.md

File metadata and controls

27 lines (18 loc) · 1.85 KB

Speech Enhancement

Tinkering with speech enhancement models.

Borrowed code, models and techniques from:

  • Improved Speech Enhancement with the Wave-U-Net ((arXiv)
  • Wave-U-Net: a multi-scale neural network for end-to-end audio source separation (arXiv)
  • Speech Denoising with Deep Feature Losses (arXiv, sound examples, GitHub)
  • MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis (arXiv, sound examples, GitHub)

Datasets

The following datasets are used:

  • The Univeristy of Edinburgh Noisy speech database for speech enhancement problem
  • The TUT Acoustic scenes 2016 dataset is used to train the scene classifier network, which is used for the loss function. (dataset paper)
  • The CHiME-Home (Computational Hearing in Multisource Environments) dataset (2015) is also used for the scene classifier, in some experiments
  • The "train-clean-100" dataset from Librispeech, mixed with the TUT acoustic scenes dataset.

Data format

At the moment, the algorithm uses 32-bit floating-point audio files at a 16kHz sampling rate to perform correctly. You can use sox to convert your file. To convert audiofile.wav to 32-bit floating-point audio at 16kHz sampling rate, run:

sox audiofile.wav -r 16000 -b 32 -e float audiofile.float.wav