Mel spectrogram computation and audio pre-processing #269

napulen · 2022-10-07T19:36:20Z

napulen
Oct 7, 2022

A quick question on stuff seen in the audio.py module.

Lines 121 to 122 in 9e653bd

    
           log_spec = torch.clamp(mel_spec, min=1e-10).log10() 
        
           log_spec = torch.maximum(log_spec, log_spec.max() - 8.0)

Is L122 a form of clipping like the one of L121? Couldn't you use torch.clamp for both of these? Just curious

whisper/whisper/audio.py

Line 123 in 9e653bd

log_spec = (log_spec + 4.0) / 4.0

Is that some form of normalization/binding the output to a range?

I'm also curious about the constants 8.0 and 4.0. I'm assuming these are just reasonable approximations, would you expect the performance to change significantly if these values were changed?

Any hints would be appreciated. Thank you!

Answered by jongwook

Oct 10, 2022

The first set of operations are written to be equivalent to what's done in librosa's amplitude_to_db, which uses the default top_db value of 80.0, which sets how small a value in the spectrogram can be compared to the largest value.

After those two lines, the values of log_sepc are roughly (but not strictly) in [-8.0, 0.0], and L123 puts them in [-1.0, 1.0] which is typical as an input to deep learning models.

A model trained from scratch without L123 would likely just work as well, but you may see significantly degraded performance without those lines on the released Whisper models because they expect the inputs in that range, and without L123 the input becomes out-of-distribution.

View full answer

jongwook · 2022-10-10T09:51:22Z

jongwook
Oct 10, 2022
Maintainer

The first set of operations are written to be equivalent to what's done in librosa's amplitude_to_db, which uses the default top_db value of 80.0, which sets how small a value in the spectrogram can be compared to the largest value.

After those two lines, the values of log_sepc are roughly (but not strictly) in [-8.0, 0.0], and L123 puts them in [-1.0, 1.0] which is typical as an input to deep learning models.

A model trained from scratch without L123 would likely just work as well, but you may see significantly degraded performance without those lines on the released Whisper models because they expect the inputs in that range, and without L123 the input becomes out-of-distribution.

1 reply

napulen Oct 10, 2022
Author

That all makes sense. Thank you very much!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mel spectrogram computation and audio pre-processing #269

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Mel spectrogram computation and audio pre-processing #269

napulen Oct 7, 2022

Replies: 1 comment · 1 reply

jongwook Oct 10, 2022 Maintainer

napulen Oct 10, 2022 Author

napulen
Oct 7, 2022

Replies: 1 comment 1 reply

jongwook
Oct 10, 2022
Maintainer

napulen Oct 10, 2022
Author