Running the model many times in 30 second music segment (amongst others) gives vastly different outputs #188

hugoredinho · 2022-09-29T11:37:50Z

hugoredinho
Sep 29, 2022

While trying to test how well each type of model (medium, medium.en and large) worked for music segments, I noticed that I wasn't getting the same output each time, and while sometimes the difference is small, sometimes the output doesn't make sense at all, and I was wondering if this was a problem other people had.

I've also noticed problems with the timestamping. but I've seen a possible fix for that here so that's not part of the issue as of right now.

Here is the code I used:

import whisper
import torch

def extract(file_path, model):
    result = model.transcribe(file_path, verbose=True, language = "English")

    text_segments = result["segments"]

    return text_segments

def clean_memory():
    torch.cuda.empty_cache()
    torch.cuda.synchronize()

if __name__ == '__main__':
    for i in range(5):
        clean_memory()
        model = whisper.load_model("medium", device="cuda")

        file_path = r"whisper_diff_output_test.mp3"

        print("Extracting for %s\n%s time\n" % (file_path, i+1))
        text_segments = extract(file_path, model)

Here is the output I got running the code above with a test file:

Extracting for whisper_diff_output_test.mp3
1 time

[00:00.000 --> 00:26.240] Why is myBig two alsoBidents day?
Extracting for whisper_diff_output_test.mp3
2 time

[00:00.000 --> 00:06.000] Yippie-yi-oh, yippie-yi-oh-kyah, yippie-yippie-yip, da-da-da-dee-da-da-dee-dee-dee-dee-dee-dee-dee-dee.
[00:06.000 --> 00:30.000] Yippie-yi-oh, yippie-yi-oh-kyah, yippie-yi-oh-kyah, where old cow had.
Extracting for whisper_diff_output_test.mp3
3 time

[00:00.000 --> 00:05.820] Yippee-yay-oh, yippee-yay-oh-ky-ay. Yip-yip-yip. Da-da-da-dee-da-da-dee-dee-dee-dee-dee-dee-dee.
[00:05.820 --> 00:27.320] Yippee-yay-oh-ky-ay. Yippee-yay-oh-ky-ay.
[00:27.320 --> 00:30.040] We're Oakcan Hands!
Extracting for whisper_diff_output_test.mp3
4 time

[00:00.000 --> 00:02.000] Oh
[00:02.000 --> 00:28.080] Yippee-yay, oh, hi-yip Yippee-yay, oh, hi-yip
[00:28.080 --> 00:33.080] Yippee-yay, oh, hi-yip
Extracting for whisper_diff_output_test.mp3
5 time

[00:00.000 --> 00:06.000] Yippee-yay-oh, yippee-yay-oh-kayay, yippee-yay-yip, da-da-da-da-dee-da-da-dee-dee-dee-dee-dee-dee.
[00:06.000 --> 00:30.000] Yippee-yay-oh-kayay, yippee-yay-oh-kayay, where old cow had.

As you can see, from the 5, only two are equal, the first one makes no sense at all, and the other two are different.
This issue also occurs if you always use the same model, you'll still get different outputs but they seem more related (I commented the clean_memory() line and moved the loading of the model to before the cycle and got this output:

Extracting for whisper_diff_output_test.mp3
1 time

[00:00.000 --> 00:05.500] Yippee-Ai-Yay, ho Yippee-Ai-O-Ki-Yay, Yippee-Ai-Yay, Da-da-da-dee-da-da-dee-dee-dee-dee-dee-dee.
[00:05.500 --> 00:25.040] Yippee-yay, oh, quiet!
[00:25.040 --> 00:27.400] Yippee-yay, oh, quiet!
[00:27.400 --> 00:35.800] We're in Ocotclott Hast.
Extracting for whisper_diff_output_test.mp3
2 time

[00:00.000 --> 00:06.000] Yippee-yay-ho, yippee-yay-o-kay-ay, yip-yip-yip, da-da-da-dee-da-da-dee-dee-dee-dee-dee-dee.
[00:06.000 --> 00:14.000] Turn on the post notifications and get the guys together
[00:14.000 --> 00:15.000] Yippee-yay, oh, quiet!
[00:15.000 --> 00:16.000] Yippee-yay, oh, quiet!
[00:16.000 --> 00:41.000] We're O'Cowhands.
Extracting for whisper_diff_output_test.mp3
3 time

[00:00.000 --> 00:03.280] Yippee-yay, ho, yippee-ay, oh, kai-ay, yip-yip-yip.
[00:03.280 --> 00:05.940] Da-da-da-da-dee-da-da-dee-dee-dee-dee-dee-dee.
[00:05.940 --> 00:30.000] Yippee-yay, oh, kai-ay, yippee-yay, oh, kai-ay, where old cow had.
Extracting for whisper_diff_output_test.mp3
4 time

[00:00.000 --> 00:04.340] Macht family
[00:30.000 --> 00:32.060] you
Extracting for whisper_diff_output_test.mp3
5 time

[00:00.000 --> 00:06.000] Yippee-yay-oh, yippee-yay-oh-kaya, yippee-yay-yip, da-da-da-da-dee-da-da-dee-dee-dee-dee-dee-dee-dee-dee.
[00:06.000 --> 00:30.000] Yippee-yay-oh-kaya, yippee-yay-oh-kaya, where old cow had.

Noted, this is a particular and maybe challenging piece of audio, but I've also seen this happen (albeit to a lesser degree) with "easier" (slower) songs, for example, this is the output for Adele's "Easy on Me":

Extracting for easy on me.m4a
1 time

[00:00.000 --> 00:26.000] There ain't no gold in this river That I've been washing my hands in forever
[00:26.000 --> 00:40.000] I know there is hope in these waters But I can't bring myself to swim when I am drowning
[00:40.000 --> 00:57.000] In this silence baby, let me in Go easy on me baby, I was still a child
[00:57.000 --> 01:13.000] Didn't get the chance to feel the world around me I had no time to choose what I chose to do
[01:13.000 --> 01:27.000] So go easy on me
[01:27.000 --> 01:40.000] There ain't no room for things to change When we are both so deeply stuck in our ways
[01:40.000 --> 01:53.000] You can't deny how hard I've tried I changed who I was to put you both first
[01:53.000 --> 02:07.000] But now I give up, go easy on me baby I was still a child
[02:07.000 --> 02:24.000] Didn't get the chance to feel the world around me I had no time to choose what I chose to do
[02:24.000 --> 02:39.000] So go easy on me
[02:39.000 --> 02:55.000] I had good intentions and the high hopes But I know right now, that probably doesn't even show
[02:55.000 --> 03:15.000] Go easy on me baby, I was still a child Didn't get the chance to feel the world around me
[03:15.000 --> 03:27.000] I had no time to choose what I chose to do So go easy on me
[03:45.000 --> 03:47.000] Thank you for watching!
Extracting for easy on me.m4a
2 time

[00:00.000 --> 00:26.000] There ain't no gold in this river That I've been washing my hands in forever
[00:26.000 --> 00:40.000] I know there is hope in these waters But I can't bring myself to swim when I am drowning
[00:40.000 --> 00:57.000] In this silence baby, let me in Go easy on me baby, I was still a child
[00:57.000 --> 01:13.000] Didn't get the chance to feel the world around me I had no time to choose what I chose to do
[01:13.000 --> 01:27.000] So go easy on me
[01:27.000 --> 01:40.000] There ain't no room for things to change When we are both so deeply stuck in our ways
[01:40.000 --> 01:53.000] You can't deny how hard I've tried I changed who I was to put you both first
[01:53.000 --> 02:07.000] But now I give up, go easy on me baby I was still a child
[02:07.000 --> 02:24.000] Didn't get the chance to feel the world around me I had no time to choose what I chose to do
[02:24.000 --> 02:39.000] So go easy on me
[02:39.000 --> 02:55.000] I had good intentions and the high hopes But I know right now, that probably doesn't even show
[02:55.000 --> 03:15.000] Go easy on me baby, I was still a child Didn't get the chance to feel the world around me
[03:15.000 --> 03:27.000] I had no time to choose what I chose to do So go easy on me
[03:50.000 --> 04:08.000] There ain't no room for things to change When we are both so deeply stuck in our ways
Extracting for easy on me.m4a
3 time

[00:00.000 --> 00:26.000] There ain't no gold in this river That I've been washing my hands in forever
[00:26.000 --> 00:40.000] I know there is hope in these waters But I can't bring myself to swim when I am drowning
[00:40.000 --> 00:57.000] In this silence baby, let me in Go easy on me baby, I was still a child
[00:57.000 --> 01:13.000] Didn't get the chance to feel the world around me I had no time to choose what I chose to do
[01:13.000 --> 01:27.000] So go easy on me
[01:27.000 --> 01:40.000] There ain't no room for things to change When we are both so deeply stuck in our ways
[01:40.000 --> 01:53.000] You can't deny how hard I've tried I changed who I was to put you both first
[01:53.000 --> 02:07.000] But now I give up, go easy on me baby I was still a child
[02:07.000 --> 02:24.000] Didn't get the chance to feel the world around me I had no time to choose what I chose to do
[02:24.000 --> 02:39.000] So go easy on me
[02:39.000 --> 02:55.000] I had good intentions and the high hopes But I know right now, that probably doesn't even show
[02:55.000 --> 03:15.000] Go easy on me baby, I was still a child Didn't get the chance to feel the world around me
[03:15.000 --> 03:27.000] I had no time to choose what I chose to do So go easy on me
[04:03.000 --> 04:14.000] I had good intentions and the high hopes But I know right now, that probably doesn't even show
Extracting for easy on me.m4a
4 time

[00:00.000 --> 00:26.000] There ain't no gold in this river That I've been washing my hands in forever
[00:26.000 --> 00:40.000] I know there is hope in these waters But I can't bring myself to swim when I am drowning
[00:40.000 --> 00:57.000] In this silence baby, let me in Go easy on me baby, I was still a child
[00:57.000 --> 01:13.000] Didn't get the chance to feel the world around me I had no time to choose what I chose to do
[01:13.000 --> 01:27.000] So go easy on me
[01:27.000 --> 01:40.000] There ain't no room for things to change When we are both so deeply stuck in our ways
[01:40.000 --> 01:53.000] You can't deny how hard I've tried I changed who I was to put you both first
[01:53.000 --> 02:07.000] But now I give up, go easy on me baby I was still a child
[02:07.000 --> 02:24.000] Didn't get the chance to feel the world around me I had no time to choose what I chose to do
[02:24.000 --> 02:39.000] So go easy on me
[02:39.000 --> 02:55.000] I had good intentions and the high hopes But I know right now, that probably doesn't even show
[02:55.000 --> 03:15.000] Go easy on me baby, I was still a child Didn't get the chance to feel the world around me
[03:15.000 --> 03:27.000] I had no time to choose what I chose to do So go easy on me
[03:45.000 --> 03:56.000] So go easy on me baby, I was still a child
Extracting for easy on me.m4a
5 time

[00:00.000 --> 00:26.000] There ain't no gold in this river That I've been washing my hands in forever
[00:26.000 --> 00:40.000] I know there is hope in these waters But I can't bring myself to swim when I am drowning
[00:40.000 --> 00:57.000] In this silence baby, let me in Go easy on me baby, I was still a child
[00:57.000 --> 01:13.000] Didn't get the chance to feel the world around me I had no time to choose what I chose to do
[01:13.000 --> 01:27.000] So go easy on me
[01:27.000 --> 01:40.000] There ain't no room for things to change When we are both so deeply stuck in our ways
[01:40.000 --> 01:53.000] You can't deny how hard I've tried I changed who I was to put you both first
[01:53.000 --> 02:07.000] But now I give up, go easy on me baby I was still a child
[02:07.000 --> 02:24.000] Didn't get the chance to feel the world around me I had no time to choose what I chose to do
[02:24.000 --> 02:39.000] So go easy on me
[02:39.000 --> 02:55.000] I had good intentions and the high hopes But I know right now, that probably doesn't even show
[02:55.000 --> 03:15.000] Go easy on me baby, I was still a child Didn't get the chance to feel the world around me
[03:15.000 --> 03:27.000] I had no time to choose what I chose to do So go easy on me
[04:02.000 --> 04:12.000] I had good intentions and the high hopes But I know right now, that probably doesn't even show

While they're not all that different, if you look at the timings, they're all different (particularly the final ones).

I was wondering if this was a known problem and if there's anyway to fix it (maybe it some fine tuning of the parameters) as getting a completely (sometimes wrong) output might be a problem in the future.

Also just a final question: for example here:

[03:15.000 --> 03:27.000] I had no time to choose what I chose to do So go easy on me

It recognizes that "So go easy one me" is a new sentence (which is correct), capitalizes it, but doesn't create a new line for it. Is that something that can be changed with parameters? Because it did on the 4th execution, so it sometimes does it sometimes doesn't. This would be very useful for verse by verse transcription of songs.

Here are the audio files used (both on google drive):
Test_whisper.mp3
Easy on me.m4a

Answered by ANonEntity

Sep 29, 2022

This is expected behavior. From the paper:

We start with temperature 0, i.e. always selecting the tokens with the highest probability,
and increase the temperature by 0.2 up to 1.0 when [...] the average log probability over
the generated tokens is lower than −1

Here's a quick rundown on temperature. I believe it's meant to prevent Whisper from getting stuck in a loop.

Since background noise like music makes Whisper less confident, it's more likely to raise the temperature. If a deterministic result is important to you, you could try forcing the temperature to 0.0.

View full answer

ANonEntity · 2022-09-29T12:04:53Z

ANonEntity
Sep 29, 2022

This is expected behavior. From the paper:

We start with temperature 0, i.e. always selecting the tokens with the highest probability,
and increase the temperature by 0.2 up to 1.0 when [...] the average log probability over
the generated tokens is lower than −1

Here's a quick rundown on temperature. I believe it's meant to prevent Whisper from getting stuck in a loop.

Since background noise like music makes Whisper less confident, it's more likely to raise the temperature. If a deterministic result is important to you, you could try forcing the temperature to 0.0.

1 reply

hugoredinho Sep 29, 2022
Author

Thank you very much! Will look more into it!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Running the model many times in 30 second music segment (amongst others) gives vastly different outputs #188

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Running the model many times in 30 second music segment (amongst others) gives vastly different outputs #188

hugoredinho Sep 29, 2022

Replies: 1 comment · 1 reply

ANonEntity Sep 29, 2022

hugoredinho Sep 29, 2022 Author

hugoredinho
Sep 29, 2022

Replies: 1 comment 1 reply

ANonEntity
Sep 29, 2022

hugoredinho Sep 29, 2022
Author