Running the model many times in 30 second music segment (amongst others) gives vastly different outputs #188
-
While trying to test how well each type of model (medium, medium.en and large) worked for music segments, I noticed that I wasn't getting the same output each time, and while sometimes the difference is small, sometimes the output doesn't make sense at all, and I was wondering if this was a problem other people had. I've also noticed problems with the timestamping. but I've seen a possible fix for that here so that's not part of the issue as of right now. Here is the code I used:
Here is the output I got running the code above with a test file:
As you can see, from the 5, only two are equal, the first one makes no sense at all, and the other two are different.
Noted, this is a particular and maybe challenging piece of audio, but I've also seen this happen (albeit to a lesser degree) with "easier" (slower) songs, for example, this is the output for Adele's "Easy on Me":
While they're not all that different, if you look at the timings, they're all different (particularly the final ones). I was wondering if this was a known problem and if there's anyway to fix it (maybe it some fine tuning of the parameters) as getting a completely (sometimes wrong) output might be a problem in the future. Also just a final question: for example here:
It recognizes that "So go easy one me" is a new sentence (which is correct), capitalizes it, but doesn't create a new line for it. Is that something that can be changed with parameters? Because it did on the 4th execution, so it sometimes does it sometimes doesn't. This would be very useful for verse by verse transcription of songs. Here are the audio files used (both on google drive): |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
This is expected behavior. From the paper:
Here's a quick rundown on temperature. I believe it's meant to prevent Whisper from getting stuck in a loop. Since background noise like music makes Whisper less confident, it's more likely to raise the temperature. If a deterministic result is important to you, you could try forcing the temperature to 0.0. |
Beta Was this translation helpful? Give feedback.
This is expected behavior. From the paper:
Here's a quick rundown on temperature. I believe it's meant to prevent Whisper from getting stuck in a loop.
Since background noise like music makes Whisper less confident, it's more likely to raise the temperature. If a deterministic result is important to you, you could try forcing the temperature to 0.0.