-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hallucination on silence #1724
Comments
Indeed, I've noticed that as well. I'll need some time to look into it more thoroughly. |
Also: when the audio has a repetition of sounds, whispercpp also tends to hallucinate. Example: Ground-truth: "íntegro íntegro íntegro íntegro íntegro íntegro íntegro" Prediction: "íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro ínteg" |
I pretty much remove all silence segments in audio before transcribing to avoid hallucination. Here is 3 seconds minimum of silence (stop_duration=3) to remove as well as hiss. |
Hey guys. I had a good time today benchmarking and comparing different inference backends on the transcription of 3000 Brazilian Portuguese audio files of varying quality. While I had good results in terms of WER (word error rate, less is better) with HuggingFace's ASR pipeline and whisperX (about 3%), I struggled to achieve acceptable results with faster-whisper or whispercpp, which had a ~4x worse WER (about 13%). Furthermore, activating VAD in faster-whisper had minimal impact. Then, since whisperX uses faster-whisper for its inference, I compared which parameters differed between them. After some tests, I achieved a 4x reduction in WER in faster-whisper by setting I proceeded to repeat the same procedure in whispercpp, by setting the following line to true, Line 4322 in 022756a
and also achieved a 4x reduction in WER, and not a single hallucination like the ones I showed above. I wonder why computing timestamps makes Whisper more prone to hallucinations. |
Also: maybe it's a good idea to make it so that
|
That's really interesting. Have you experimented with OpenAI's official implementation of Whisper? It also generates timestamps. |
I have not, but it makes sense to experiment with it. I'll probably do it in the next few days. |
Yes, this should be updated. The reason is that "not computing timestamps" option was added just recently and before that, they were always computed but not being displayed. Now we can disable them properly |
I still have to figure out how to load my fine-tuned model using the official OpenAI implementation. Still, preliminary results in the same dataset using the multilingual |
If you set the context to 0, does the problem go away? Parameter: -mc 0 |
It does not solve the issue, and the WER increases slightly. I tried a ton of parameters, and the only one that solved the issue was completely disabling timestamps. |
@pprobst Could you provide a link to the file you are testing this problem on? |
Unfortunately, it's a private dataset that I have no permission to share 🫠 |
Give my latest PR #1768 a try. It's still a WIP, but if you compile it yourself, it should significantly reduce the hallucinations towards the end of the audio file. |
@bobqianic I'm trying this new build now and maybe it is better at the end, but I still see many hallucinations when there are long completely silent gaps in the middle of files, whisper.cpp just repeats the previous segment over and over with a 2-3s duration each time until the speech resumes. I have samples I can send you privately via email/discord/etc but I'd rather not post them on a public site, if that's ok with you. If necessary I'll try to come up with some public samples that reproduce the issue. |
Discord: |
Ok thanks I sent you a friend request on discord. |
openai/whisper#1962 |
@ggerganov any schedules to implement #1838 Skip silence around hallucinations? |
|
It's likely true. This is because the approach Whisper uses to transcribe audio with and without timestamps varies significantly. When transcribing without timestamps, it processes the audio in 30-second segments, sequentially moving from one chunk to the next. However, when transcribing with timestamps, it operates differently. It first determines whether a segment is complete. If so, it proceeds to the next 30-second segment. If not, it adjusts its position based on the last timestamp token before resuming transcription. For instance, let's say there's a 30-second segment, and the decoder encounters This is likely to result in repetition. Additionally, we must now include a timestamp token in our context, which is sized at |
Very interesting! I'm thankful you took the time to investigate this further. |
Same problem here! Whispercpp, and I am not sure about regular whisper has substantial difficulties picking up a conversation after a long period of silence! |
Timestamp computation can degrade WER (ggerganov#1724)
Hello! In some experiments, I've noticed that in audio files that have silence at the end (even ~1s of silence), whispercpp sometimes transcribes "bullshit" text from nonexistent speech. This does not happen when I'm using the
evaluate
/predict
functions fromtransformers
, ortranscribe
fromwhisperx
(although the latter uses VAD), which makes me think there's a parameter or something in whispercpp that may be making it prone to hallucination in these cases. Note that I'm using a converted fine-tuned base model (h5 to ggml).I'm using the latest 1.5.3 version, but this also happened in 1.5.2.
An example below:
The transcription in
[00:00:00.000 --> 00:00:06.300] ponto parágrafo planos musculares com aspecto habitual a faixa etária
is correct. But after that is just about 1s of silence. After transcribing the first segment, it "hangs" for a sec and then it hallucinates.(note that the audio file being passed is OGG, but in code I'm converting it to WAV 16khz mono with ffmpeg)
The text was updated successfully, but these errors were encountered: