Speech Enhancement

Tinkering with speech enhancement models.

Borrowed code, models and techniques from:

Improved Speech Enhancement with the Wave-U-Net ((arXiv)
Wave-U-Net: a multi-scale neural network for end-to-end audio source separation (arXiv)
Speech Denoising with Deep Feature Losses (arXiv, sound examples, GitHub)
MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis (arXiv, sound examples, GitHub)

Datasets

The following datasets are used:

The Univeristy of Edinburgh Noisy speech database for speech enhancement problem
The TUT Acoustic scenes 2016 dataset is used to train the scene classifier network, which is used for the loss function. (dataset paper)
The CHiME-Home (Computational Hearing in Multisource Environments) dataset (2015) is also used for the scene classifier, in some experiments
The "train-clean-100" dataset from Librispeech, mixed with the TUT acoustic scenes dataset.

Data format

At the moment, the algorithm uses 32-bit floating-point audio files at a 16kHz sampling rate to perform correctly. You can use sox to convert your file. To convert audiofile.wav to 32-bit floating-point audio at 16kHz sampling rate, run:

sox audiofile.wav -r 16000 -b 32 -e float audiofile.float.wav

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Speech Enhancement

Datasets

Data format

Files

README.md

Latest commit

History

README.md

File metadata and controls

Speech Enhancement

Datasets

Data format