Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hallucination on silence #1724

Open
pprobst opened this issue Jan 4, 2024 · 23 comments
Open

Hallucination on silence #1724

pprobst opened this issue Jan 4, 2024 · 23 comments
Labels
bug Something isn't working

Comments

@pprobst
Copy link
Contributor

pprobst commented Jan 4, 2024

Hello! In some experiments, I've noticed that in audio files that have silence at the end (even ~1s of silence), whispercpp sometimes transcribes "bullshit" text from nonexistent speech. This does not happen when I'm using the evaluate/predict functions from transformers, or transcribe from whisperx (although the latter uses VAD), which makes me think there's a parameter or something in whispercpp that may be making it prone to hallucination in these cases. Note that I'm using a converted fine-tuned base model (h5 to ggml).

I'm using the latest 1.5.3 version, but this also happened in 1.5.2.

An example below:

λ ./main -f 1635687465_8386435.ogg -l pt -m ../eval/ggml-model.bin -pc

whisper_init_from_file_with_params_no_state: loading model from '../eval/ggml-model.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51865
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 512
whisper_model_load: n_audio_head  = 8
whisper_model_load: n_audio_layer = 6
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 512
whisper_model_load: n_text_head   = 8
whisper_model_load: n_text_layer  = 6
whisper_model_load: n_mels        = 80
whisper_model_load: ftype         = 1
whisper_model_load: qntvr         = 0
whisper_model_load: type          = 2 (base)
whisper_model_load: n_langs       = 99
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3050 6GB Laptop GPU, compute capability 8.6, VMM: yes
whisper_backend_init: using CUDA backend
whisper_model_load:     CUDA buffer size =   147.46 MB
whisper_model_load: model size    =  147.37 MB
whisper_backend_init: using CUDA backend
whisper_init_state: kv self size  =   16.52 MB
whisper_init_state: kv cross size =   18.43 MB
whisper_init_state: compute buffer (conv)   =   14.86 MB
whisper_init_state: compute buffer (encode) =   85.99 MB
whisper_init_state: compute buffer (cross)  =    4.78 MB
whisper_init_state: compute buffer (decode) =   96.48 MB

system_info: n_threads = 4 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | METAL = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | CUDA = 1 | COREML = 0 | OPENVINO = 0 |

main: processing '1635687465_8386435.wav' (118886 samples, 7.4 sec), 4 threads, 1 processors, 5 beams + best of 5, lang = pt, task = transcribe, timestamps = 1 ...


[00:00:00.000 --> 00:00:06.300]   ponto parágrafo planos musculares com aspecto habitual a faixa etária
[00:00:06.300 --> 00:00:36.300]   subcutâneo de l cinco e l cinco e l cinco l cinco sessenta e l cinco sessenta e l cinco sessenta e l cinco sessenta e l cinco sessenta e l cinco sessenta e l cinco sessenta e l cinco sessenta e l cinco sessenta e l cinco sessenta e l cinco sessenta e l cinco sessenta e l cinco sessenta e l cinco sessenta e l cinco sessenta e l cinco sessenta e l cinco sessenta e l cinco sessenta e l cinco sessenta e l cinco sessenta e l cinco sessenta e l cinco sessenta e l cinco sessenta e l cinco sessenta e l cinco sessenta e l cinco sessenta e l cinco sessenta e l cinco sessenta e l cinco sessenta e l cinco sessenta e l cinco sessenta e l cinco sessenta e l cinco sessenta e l cinco sessenta e l cinco


whisper_print_timings:     load time =   116.86 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =     9.17 ms
whisper_print_timings:   sample time =   325.28 ms /  1212 runs (    0.27 ms per run)
whisper_print_timings:   encode time =   120.70 ms /     2 runs (   60.35 ms per run)
whisper_print_timings:   decode time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:   batchd time =   555.86 ms /  1208 runs (    0.46 ms per run)
whisper_print_timings:   prompt time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:    total time =  1176.76 ms

The transcription in [00:00:00.000 --> 00:00:06.300] ponto parágrafo planos musculares com aspecto habitual a faixa etária is correct. But after that is just about 1s of silence. After transcribing the first segment, it "hangs" for a sec and then it hallucinates.

(note that the audio file being passed is OGG, but in code I'm converting it to WAV 16khz mono with ffmpeg)

@pprobst pprobst changed the title Transcription on silence Hallucination on silence Jan 4, 2024
@bobqianic
Copy link
Collaborator

Indeed, I've noticed that as well. I'll need some time to look into it more thoroughly.

@bobqianic bobqianic added the bug Something isn't working label Jan 4, 2024
@pprobst
Copy link
Contributor Author

pprobst commented Jan 4, 2024

Also: when the audio has a repetition of sounds, whispercpp also tends to hallucinate. Example:

Ground-truth: "íntegro íntegro íntegro íntegro íntegro íntegro íntegro"

Prediction: "íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro ínteg"

@mrfragger
Copy link

mrfragger commented Jan 4, 2024

[ ! -d output ] && mkdir output ; for f in *.mp3 ; do ffmpeg -hide_banner -i "$f" -c:a libopus -b:a 32k -af "silenceremove=start_periods=1:stop_periods=-1:start_threshold=-50dB:stop_threshold=-50dB:start_silence=1:start_duration=0:stop_duration=3:detection=peak",highpass=200,lowpass=3000,afftdn,volume=12dB,dynaudnorm output/"${f%.*}.opus" ; done

I pretty much remove all silence segments in audio before transcribing to avoid hallucination. Here is 3 seconds minimum of silence (stop_duration=3) to remove as well as hiss.

@pprobst
Copy link
Contributor Author

pprobst commented Jan 7, 2024

Hey guys. I had a good time today benchmarking and comparing different inference backends on the transcription of 3000 Brazilian Portuguese audio files of varying quality. While I had good results in terms of WER (word error rate, less is better) with HuggingFace's ASR pipeline and whisperX (about 3%), I struggled to achieve acceptable results with faster-whisper or whispercpp, which had a ~4x worse WER (about 13%). Furthermore, activating VAD in faster-whisper had minimal impact.

Then, since whisperX uses faster-whisper for its inference, I compared which parameters differed between them. After some tests, I achieved a 4x reduction in WER in faster-whisper by setting without_timestamps=True. Since my use-case has no use for timestamps, this is OK for me.

I proceeded to repeat the same procedure in whispercpp, by setting the following line to true,

/*.no_timestamps =*/ false,

and also achieved a 4x reduction in WER, and not a single hallucination like the ones I showed above.

I wonder why computing timestamps makes Whisper more prone to hallucinations.

@pprobst
Copy link
Contributor Author

pprobst commented Jan 7, 2024

Also: maybe it's a good idea to make it so that -nt in main.cpp not only does not print timestamps, but also does not compute them:

wparams.no_timestamps = params.no_timestamps;

@bobqianic
Copy link
Collaborator

After some tests, I achieved a 4x reduction in WER in faster-whisper by setting without_timestamps=True

That's really interesting. Have you experimented with OpenAI's official implementation of Whisper? It also generates timestamps.

https://github.com/openai/whisper

@pprobst
Copy link
Contributor Author

pprobst commented Jan 7, 2024

That's really interesting. Have you experimented with OpenAI's official implementation of Whisper? It also generates timestamps.

https://github.com/openai/whisper

I have not, but it makes sense to experiment with it. I'll probably do it in the next few days.

@ggerganov
Copy link
Owner

Also: maybe it's a good idea to make it so that -nt in main.cpp not only does not print timestamps, but also does not compute them:

wparams.no_timestamps = params.no_timestamps;

Yes, this should be updated. The reason is that "not computing timestamps" option was added just recently and before that, they were always computed but not being displayed. Now we can disable them properly

@pprobst
Copy link
Contributor Author

pprobst commented Jan 8, 2024

I still have to figure out how to load my fine-tuned model using the official OpenAI implementation. Still, preliminary results in the same dataset using the multilingual base model showed that setting word_timestamps=False and without_timestamps=True when calling the transcribe function improved WER from 64% to 54%.

@Sing303
Copy link

Sing303 commented Jan 15, 2024

If you set the context to 0, does the problem go away? Parameter: -mc 0
I'm having problems disappearing. Maybe timestamps get into context and break the "brain" of the model?

@pprobst
Copy link
Contributor Author

pprobst commented Jan 15, 2024

If you set the context to 0, does the problem go away? Parameter: -mc 0 I'm having problems disappearing. Maybe timestamps get into context and break the "brain" of the model?

It does not solve the issue, and the WER increases slightly. I tried a ton of parameters, and the only one that solved the issue was completely disabling timestamps.

@Sing303
Copy link

Sing303 commented Jan 16, 2024

@pprobst Could you provide a link to the file you are testing this problem on?

@pprobst
Copy link
Contributor Author

pprobst commented Jan 16, 2024

@pprobst Could you provide a link to the file you are testing this problem on?

Unfortunately, it's a private dataset that I have no permission to share 🫠
Although I have not replicated the experiment in other datasets, I believe the drop in accuracy when computing timestamps can occur in any dataset.

@bobqianic
Copy link
Collaborator

Give my latest PR #1768 a try. It's still a WIP, but if you compile it yourself, it should significantly reduce the hallucinations towards the end of the audio file.

@jettoblack
Copy link
Contributor

@bobqianic I'm trying this new build now and maybe it is better at the end, but I still see many hallucinations when there are long completely silent gaps in the middle of files, whisper.cpp just repeats the previous segment over and over with a 2-3s duration each time until the speech resumes. I have samples I can send you privately via email/discord/etc but I'd rather not post them on a public site, if that's ok with you. If necessary I'll try to come up with some public samples that reproduce the issue.

@bobqianic
Copy link
Collaborator

@bobqianic I'm trying this new build now and maybe it is better at the end, but I still see many hallucinations when there are long completely silent gaps in the middle of files, whisper.cpp just repeats the previous segment over and over with a 2-3s duration each time until the speech resumes. I have samples I can send you privately via email/discord/etc but I'd rather not post them on a public site, if that's ok with you. If necessary I'll try to come up with some public samples that reproduce the issue.

Discord:
bob20231894

@jettoblack
Copy link
Contributor

@bobqianic I'm trying this new build now and maybe it is better at the end, but I still see many hallucinations when there are long completely silent gaps in the middle of files, whisper.cpp just repeats the previous segment over and over with a 2-3s duration each time until the speech resumes. I have samples I can send you privately via email/discord/etc but I'd rather not post them on a public site, if that's ok with you. If necessary I'll try to come up with some public samples that reproduce the issue.

Discord: bob20231894

Ok thanks I sent you a friend request on discord.

@mrfragger
Copy link

openai/whisper#1962
two PRs on openai whisper seem to be most promising 1808 and 1963 in regardings to drastically reducing hallucinations.

@bygreencn
Copy link

@ggerganov any schedules to implement #1838 Skip silence around hallucinations?

@bobqianic
Copy link
Collaborator

@ggerganov any schedules to implement #1838 Skip silence around hallucinations?

#1768 (comment)

@bobqianic
Copy link
Collaborator

bobqianic commented Feb 10, 2024

Hey guys. I had a good time today benchmarking and comparing different inference backends on the transcription of 3000 Brazilian Portuguese audio files of varying quality. While I had good results in terms of WER (word error rate, less is better) with HuggingFace's ASR pipeline and whisperX (about 3%), I struggled to achieve acceptable results with faster-whisper or whispercpp, which had a ~4x worse WER (about 13%). Furthermore, activating VAD in faster-whisper had minimal impact.

Then, since whisperX uses faster-whisper for its inference, I compared which parameters differed between them. After some tests, I achieved a 4x reduction in WER in faster-whisper by setting without_timestamps=True. Since my use-case has no use for timestamps, this is OK for me.

I proceeded to repeat the same procedure in whispercpp, by setting the following line to true,

/*.no_timestamps =*/ false,

and also achieved a 4x reduction in WER, and not a single hallucination like the ones I showed above.

I wonder why computing timestamps makes Whisper more prone to hallucinations.

It's likely true. This is because the approach Whisper uses to transcribe audio with and without timestamps varies significantly. When transcribing without timestamps, it processes the audio in 30-second segments, sequentially moving from one chunk to the next. However, when transcribing with timestamps, it operates differently. It first determines whether a segment is complete. If so, it proceeds to the next 30-second segment. If not, it adjusts its position based on the last timestamp token before resuming transcription. For instance, let's say there's a 30-second segment, and the decoder encounters ...[TT_1264] (incomplete). Instead of transcribing from 30 to 60 seconds, it would adjust to start at 25.28 seconds within the segment and then transcribe from 25.28 to 55.28 seconds.

This is likely to result in repetition. Additionally, we must now include a timestamp token in our context, which is sized at 448, and half of this is reserved for the prompt, limiting the longest sequence we can generate to 224. Consequently, the actual information that can be accommodated within the context window is reduced, leading to diminished performance.

@pprobst
Copy link
Contributor Author

pprobst commented Feb 10, 2024

Very interesting! I'm thankful you took the time to investigate this further.

@RazeBerry
Copy link

Same problem here! Whispercpp, and I am not sure about regular whisper has substantial difficulties picking up a conversation after a long period of silence!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

8 participants