Fix the decoding issues #1768

bobqianic · 2024-01-14T15:15:26Z

revert change

Patch

ukolovda · 2024-02-20T10:10:11Z

Append issue with zero-filled WAV.
#1881

ukolovda · 2024-02-20T13:22:09Z

File from #1881 (zero filled WAV) give a gallucination in this version too.

$ ../whisper.cpp-bobqianic/main -m ./models/ggml-large-v3.bin -l ru --threads 8 -mc 0 samples/zeroes.wav
whisper_init_from_file_with_params_no_state: loading model from './models/ggml-large-v3.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51866
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 1280
whisper_model_load: n_audio_head  = 20
whisper_model_load: n_audio_layer = 32
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 1280
whisper_model_load: n_text_head   = 20
whisper_model_load: n_text_layer  = 32
whisper_model_load: n_mels        = 128
whisper_model_load: ftype         = 1
whisper_model_load: qntvr         = 0
whisper_model_load: type          = 5 (large v3)
whisper_model_load: adding 1609 extra tokens
whisper_model_load: n_langs       = 100
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4070, compute capability 8.9, VMM: yes
whisper_backend_init: using CUDA backend
whisper_model_load:    CUDA0 total size =  3094,86 MB (3 buffers)
whisper_model_load: model size    = 3094,36 MB
whisper_backend_init: using CUDA backend
whisper_init_state: kv self size  =  220,20 MB
whisper_init_state: kv cross size =  245,76 MB
whisper_init_state: compute buffer (conv)   =   35,50 MB
whisper_init_state: compute buffer (encode) =  233,50 MB
whisper_init_state: compute buffer (cross)  =   10,15 MB
whisper_init_state: compute buffer (decode) =  108,99 MB

system_info: n_threads = 8 / 12 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | METAL = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | CUDA = 1 | COREML = 0 | OPENVINO = 0 | 

run: processing 'samples/zeroes.wav' (19200 samples, 1,2 sec), 8 threads, 1 processors, 5 beams + best of 5, lang = ru, task = transcribe, timestamps = 1 ...


[00:00:00.000 --> 00:00:29.980]   Продолжение следует...


whisper_print_timings:     load time =   781,61 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =     4,81 ms
whisper_print_timings:   sample time =    28,10 ms /    79 runs (    0,36 ms per run)
whisper_print_timings:   encode time =   162,31 ms /     1 runs (  162,31 ms per run)
whisper_print_timings:   decode time =     0,00 ms /     1 runs (    0,00 ms per run)
whisper_print_timings:   batchd time =   482,89 ms /    77 runs (    6,27 ms per run)
whisper_print_timings:   prompt time =     0,00 ms /     1 runs (    0,00 ms per run)
whisper_print_timings:    total time =  1502,74 ms

linmi · 2024-02-21T08:43:26Z

-output-json-full has problems with the output format.

Language: Chinese

thewh1teagle · 2024-03-31T22:38:34Z

What's the status of this PR? is it safe to use?
I experience decoding issues
thewh1teagle/vibe#34

jwijffels · 2024-04-05T07:48:45Z

I'm thinking about including this pull request in the R wrapper at audio.whisper . There the current approach to handle some of the hallucinations is to use R packages audio.vadwebrtc or audio.vadsilero to detect silences or general non-voiced signals and either

instead of looping over different files in the main loop, loop over the detected non-silence sections in the audio.
or create a new audio file with only the voiced audio and recompute the timestamps later on by adding what was left out

I haven't looked into the extreme details on this pull request (only skimmed through the logic which was changed in main.cpp and whisper.cpp) but would it make sense already to incorporate this pull request in audio.whisper or are there a lot of changes to be expected here or is this pull request going to be split into a BPE change (#1854) and a change regarding how to handle non-speech?

ronyfadel · 2024-04-30T17:12:55Z

@bobqianic are you pursuing this at the moment?

bobqianic · 2024-04-30T17:45:39Z

@bobqianic are you pursuing this at the moment?

No, at least not in May. I'm really tied up with a lot of things this month.

bygreencn · 2024-05-14T12:35:18Z

I'm thinking about including this pull request in the R wrapper at audio.whisper . There the current approach to handle some of the hallucinations is to use R packages audio.vadwebrtc or audio.vadsilero to detect silences or general non-voiced signals and either

instead of looping over different files in the main loop, loop over the detected non-silence sections in the audio.

or create a new audio file with only the voiced audio and recompute the timestamps later on by adding what was left out

I haven't looked into the extreme details on this pull request (only skimmed through the logic which was changed in main.cpp and whisper.cpp) but would it make sense already to incorporate this pull request in audio.whisper or are there a lot of changes to be expected here or is this pull request going to be split into a BPE change (#1854) and a change regarding how to handle non-speech?

The best way to include Silero Voice Activity into whisper.cpp is to add thirdparty package of onnxruntime1.12.1 dll, then call silero onnx model. My branch had added it. Even VAD, the hallucinations on silent intervals is also happenning.

IntendedConsequence · 2024-05-21T07:25:27Z

The best way to include Silero Voice Activity into whisper.cpp is to add thirdparty package of onnxruntime1.12.1 dll, then call silero onnx model. My branch had added it. Even VAD, the hallucinations on silent intervals is also happenning.

I recommend considering a previous Silero VAD version, namely v3.1. The current version v4 (at the moment of writing) often hallucinates speech on lengthy chunks of silent or near-silent audio segments.
snakers4/silero-vad#369
snakers4/silero-vad#396

But you have to add a heavyweight dependency like onnxruntime just to run a 750KB model. The smallest size I could possibly reduce onnxruntime.dll to was about 2.2MB, which is still 3x the size of silero weights, and requires a lengthy custom build of onnxruntime from source with reduced operator set configs and other size reduction options. And prebuilt redistributables are easily 5-9 MB or more.

I have a working Silero v3.1 implementation in pure C, but as much as I would like to suggest it as an option, the code is quite bad, I wrote it as a personal project for learning low level neural nets.

ziegenberg · 2024-06-24T08:50:22Z

@bobqianic, Could you rebase your changes? I'd like to test those improvements of yours with production data on our setup.

examples/main/main.cpp

Fix compatibility issue

bobqianic · 2024-06-25T14:58:21Z

@ziegenberg I did some testing, and it LGTM. If the CI is mostly green, you can proceed with your testing now.

ziegenberg · 2024-06-25T15:09:23Z

I already did some testing and fixed some of the errors on my own. Looks promising. I see less hallucinations, but I need to do some more statistics. I will switch to your branch for the next tests.

Is your PR #1854 also related to this improvement?

bobqianic · 2024-06-25T15:15:41Z

Is your PR #1854 also related to this improvement?

PR #1854 is a subset of this PR, meaning this PR includes everything in PR #1854.

ziegenberg · 2024-06-25T15:28:41Z

What data/statistics would you need from my side to consider this PR validated and get it merged?

bobqianic · 2024-06-26T07:00:26Z

What data/statistics would you need from my side to consider this PR validated and get it merged?

Thank you. If you have the ground truth text, please calculate the WER.

Makememo · 2024-06-27T01:30:49Z

I tested the output of anime using medium.en and found a problem with time axis recognition in the middle.

file: https://dropover.cloud/f7

e020

ziegenberg · 2024-07-23T08:22:42Z

Hi @Makememo,
was this a singular incident or does this happen regularly?

Makememo · 2024-07-23T08:28:42Z

Hi @Makememo,

was this a singular incident or does this happen regularly?

I tested three videos and I had this problem.

The common feature of these videos is that there is a music section that begins to mess up the timeline.

ronyfadel · 2024-10-15T00:54:53Z

@ziegenberg @bobqianic just wanted to check in on this: is it still needed? Will it land any time soon?

ziegenberg · 2024-11-07T15:25:16Z

We now extensively tested this patch.

Summary

We see fewer hallucinations and an overall improvement in accuracy.

Details

Hallucinations still happen if there is a period of time with no spoken words or music without words. If this period of time is longer than 30 seconds, it completely messes up the next couple of minutes, and it hallucinates widely. We "fixed" this by processing the input with Silero VAD first and letting whisper.cpp only analyze the parts where speech was recognized using the --offset-t N and --duration N options. This works astonishingly well!

We have no real statistics to show as we have mostly lecture recordings with heavy use of German and Austrian German dialects. This makes generating a validated transcription very difficult. Whisper.cpp mostly corrects the grammar mistakes from the lecturers, which would result in a higher Word Error Rate, but in reality, the transcription is better.

Conclusion

We will use this patch in production from now on. In my opinion, this can be merged.

itsthisjustin · 2024-11-07T15:43:49Z

Will test on our end now too. Anything "special" that needs to be done in order to test this using the Swift package? @ziegenberg ?

ziegenberg · 2024-11-07T21:34:18Z

I have no experience with Swift, sorry.

ziegenberg · 2024-11-11T11:17:12Z

Hi @bobqianic, would you be willing to rebase this once more?

Add files via upload

71a65e7

This was linked to issues Jan 14, 2024

Invalid encoding #1761

Open

Unicode Error for Hindi transcription #1700

Open

bobqianic added the research🔬 label Jan 14, 2024

bobqianic mentioned this pull request Jan 14, 2024

examples: Fix the encoding issues on Windows #1313

Closed

4 tasks

Add files via upload

8301f88

This was linked to issues Jan 15, 2024

Streaming Output Repetition #1702

Open

Duplicate sentences in results and mistake timestamps #1745

Open

Audio less than 1s long silently fails all transcription #1603

Open

Automatically adds "Thank you" #1592

Open

bobqianic added 16 commits January 15, 2024 19:38

Add files via upload

1226204

revert change

c53c33b

Delete server directory

dfef69e

Merge pull request #1 from bobqianic/bobqianic-patch-1

7499e3c

revert change

Add files via upload

6648641

Add files via upload

9d0ebd1

Add files via upload

c8528a7

Merge pull request #2 from bobqianic/patch

7047d32

Patch

Add files via upload

96a9349

Fix ruby and go bindings

4b3a211

Add files via upload

3818acb

Add files via upload

b5c4d5c

Revert some changes

80589d2

Revert some changes

271c321

Merge branch 'ggerganov:master' into fix-decoding

5ea1d91

Remove hallucination by using token_nosp

41df3f0

bobqianic mentioned this pull request Jan 16, 2024

Hallucination on silence #1724

Open

edit some comments

2676819

bobqianic added the decoding Decoding related issues label Jan 17, 2024

bobqianic linked an issue Jan 17, 2024 that may be closed by this pull request

Prompt tokenization does not match openai/whisper #1098

Open

jwijffels mentioned this pull request Mar 25, 2024

Notes on repetitions bnosac/audio.whisper#38

Open

tamo mentioned this pull request May 28, 2024

JSON Output Contains Garbled Characters for Chinese Audio Transcription #2180

Open

bobqianic added 2 commits June 24, 2024 14:01

Merge branch 'master' into fix-decoding

f38b659

Update whisper.cpp

2b61aec

ziegenberg reviewed Jun 24, 2024

View reviewed changes

examples/main/main.cpp Outdated Show resolved Hide resolved

bobqianic added 2 commits June 25, 2024 15:53

Add files via upload

a53175a

Merge pull request #12 from bobqianic/base

7ea8a64

Fix compatibility issue

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix the decoding issues #1768

Fix the decoding issues #1768

bobqianic commented Jan 14, 2024 •

edited

Loading

ukolovda commented Feb 20, 2024 •

edited

Loading

ukolovda commented Feb 20, 2024 •

edited

Loading

linmi commented Feb 21, 2024 •

edited

Loading

thewh1teagle commented Mar 31, 2024

jwijffels commented Apr 5, 2024 •

edited

Loading

ronyfadel commented Apr 30, 2024

bobqianic commented Apr 30, 2024

bygreencn commented May 14, 2024

IntendedConsequence commented May 21, 2024

ziegenberg commented Jun 24, 2024

bobqianic commented Jun 25, 2024

ziegenberg commented Jun 25, 2024

bobqianic commented Jun 25, 2024

ziegenberg commented Jun 25, 2024

bobqianic commented Jun 26, 2024

Makememo commented Jun 27, 2024

ziegenberg commented Jul 23, 2024

Makememo commented Jul 23, 2024

ronyfadel commented Oct 15, 2024

ziegenberg commented Nov 7, 2024

itsthisjustin commented Nov 7, 2024

ziegenberg commented Nov 7, 2024

ziegenberg commented Nov 11, 2024

Fix the decoding issues #1768

Are you sure you want to change the base?

Fix the decoding issues #1768

Conversation

bobqianic commented Jan 14, 2024 • edited Loading

ukolovda commented Feb 20, 2024 • edited Loading

ukolovda commented Feb 20, 2024 • edited Loading

linmi commented Feb 21, 2024 • edited Loading

thewh1teagle commented Mar 31, 2024

jwijffels commented Apr 5, 2024 • edited Loading

ronyfadel commented Apr 30, 2024

bobqianic commented Apr 30, 2024

bygreencn commented May 14, 2024

IntendedConsequence commented May 21, 2024

ziegenberg commented Jun 24, 2024

bobqianic commented Jun 25, 2024

ziegenberg commented Jun 25, 2024

bobqianic commented Jun 25, 2024

ziegenberg commented Jun 25, 2024

bobqianic commented Jun 26, 2024

Makememo commented Jun 27, 2024

ziegenberg commented Jul 23, 2024

Makememo commented Jul 23, 2024

ronyfadel commented Oct 15, 2024

ziegenberg commented Nov 7, 2024

Summary

Details

Conclusion

itsthisjustin commented Nov 7, 2024

ziegenberg commented Nov 7, 2024

ziegenberg commented Nov 11, 2024

bobqianic commented Jan 14, 2024 •

edited

Loading

ukolovda commented Feb 20, 2024 •

edited

Loading

ukolovda commented Feb 20, 2024 •

edited

Loading

linmi commented Feb 21, 2024 •

edited

Loading

jwijffels commented Apr 5, 2024 •

edited

Loading