Skip to content

Mel spectrogram computation and audio pre-processing #269

Answered by jongwook
napulen asked this question in Q&A
Discussion options

You must be logged in to vote

The first set of operations are written to be equivalent to what's done in librosa's amplitude_to_db, which uses the default top_db value of 80.0, which sets how small a value in the spectrogram can be compared to the largest value.

After those two lines, the values of log_sepc are roughly (but not strictly) in [-8.0, 0.0], and L123 puts them in [-1.0, 1.0] which is typical as an input to deep learning models.

A model trained from scratch without L123 would likely just work as well, but you may see significantly degraded performance without those lines on the released Whisper models because they expect the inputs in that range, and without L123 the input becomes out-of-distribution.

Replies: 1 comment 1 reply

Comment options

You must be logged in to vote
1 reply
@napulen
Comment options

Answer selected by jongwook
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
2 participants