fix: remove hallucinations from silent audio #1588

ex3ndr · 2023-12-03T09:25:25Z

This line is very important since on silent audio it would heavily hallucinate.

ggerganov · 2023-12-03T10:02:29Z

Can we get some examples where this change makes a difference in the output?

bygreencn · 2023-12-04T14:04:14Z

It does reduce the incidence rate. but not fix it yet.
Moreover, could anyone check the information at openai 914 and openai 1155

jxy · 2023-12-05T22:11:27Z

the openai/whisper skips the current segment if the probability of no speech token is high https://github.com/openai/whisper/blob/e58f28804528831904c3b6f2c0e473f346223433/whisper/transcribe.py#L243-L255

we likely need this?

ex3ndr · 2023-12-06T22:32:30Z

I have re-tested it, but it feels to be marginal improvement. I will try to implement ignoring segments with high probability of silence

ex3ndr · 2023-12-06T22:58:34Z

BTW, i am testing on this short segments that i have recorded. I can't get most of them to be reliably detected as silence in most models.
silent.zip

bobqianic · 2023-12-07T00:15:12Z

in most models.

Whisper models?

ex3ndr · 2023-12-07T01:58:50Z

in most models.

Whisper models?

Yes.

ex3ndr · 2023-12-07T01:59:36Z

I have tried to log nosp probabilities instead of token probability and it is almost always zero for most cases that i don't think this is the reason it doesn't perform well.

ex3ndr · 2023-12-07T02:01:17Z

I have printed out the token then probability of a token then probability of a NOSP token at the same place.

bobqianic · 2024-01-16T01:25:37Z

At least this is heading in the right direction. I'm developing something similar to OpenAI's approach, utilizing token_nosp to detect periods of silence.

    def _main_loop(self, audio_features: Tensor, tokens: Tensor):
        n_batch = tokens.shape[0]
        sum_logprobs: Tensor = torch.zeros(n_batch, device=audio_features.device)
        no_speech_probs = [np.nan] * n_batch

        try:
            for i in range(self.sample_len):
                logits = self.inference.logits(tokens, audio_features)

                if (
                    i == 0 and self.tokenizer.no_speech is not None
                ):  # save no_speech_probs
                    probs_at_sot = logits[:, self.sot_index].float().softmax(dim=-1)
                    no_speech_probs = probs_at_sot[:, self.tokenizer.no_speech].tolist()

                # now we need to consider the logits at the last token only
                logits = logits[:, -1]

                # apply the logit filters, e.g. for suppressing or applying penalty to
                for logit_filter in self.logit_filters:
                    logit_filter.apply(logits, tokens)

                # expand the tokens tensor with the selected next tokens
                tokens, completed = self.decoder.update(tokens, logits, sum_logprobs)

                if completed or tokens.shape[-1] > self.n_ctx:
                    break
        finally:
            self.inference.cleanup_caching()

        return tokens, sum_logprobs, no_speech_probs

bobqianic · 2024-01-16T01:28:35Z

the openai/whisper skips the current segment if the probability of no speech token is high https://github.com/openai/whisper/blob/e58f28804528831904c3b6f2c0e473f346223433/whisper/transcribe.py#L243-L255

we likely need this?

Sure

bobqianic · 2024-01-16T22:42:51Z

whisper.cpp

-        logits[vocab.token_nosp] = -INFINITY; // TODO: ignore this token for now
+        // logits[vocab.token_nosp] = -INFINITY; // Uncommenting this would produce hallucinations on silent audio


Although it is said that token_nosp is the direction to solve the hallucination, it is definitely problematic for you to cancel the suppress token_nosp. First of all, we only hope that the output of the model contains meaningful and visible tokens (except for timestamps). Your cancellation of suppress token_nosp will cause this token to possibly appear in the output of the model, which is something we do not want to see. Secondly, the key to solving hallucination lies in finding a way to skip silence. token_nosp is used to tell you how likely it is that this segment is silent, so that we can skip silence. Therefore, merely cancelling suppress token_nosp without any other action cannot solve hallucination.

Update whisper.cpp

8f6c049

ex3ndr mentioned this pull request Dec 3, 2023

Whisper large v3 model repeats a lot #1507

Open

Merge branch 'master' into patch-1

819e0e6

misutoneko mentioned this pull request Dec 10, 2023

Automatically adds "Thank you" #1592

Open

bobqianic closed this Jan 16, 2024

bobqianic reviewed Jan 16, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: remove hallucinations from silent audio #1588

fix: remove hallucinations from silent audio #1588

ex3ndr commented Dec 3, 2023

ggerganov commented Dec 3, 2023

bygreencn commented Dec 4, 2023 •

edited

Loading

jxy commented Dec 5, 2023

ex3ndr commented Dec 6, 2023

ex3ndr commented Dec 6, 2023

bobqianic commented Dec 7, 2023

ex3ndr commented Dec 7, 2023

ex3ndr commented Dec 7, 2023

ex3ndr commented Dec 7, 2023

bobqianic commented Jan 16, 2024 •

edited

Loading

bobqianic commented Jan 16, 2024

bobqianic Jan 16, 2024

		logits[vocab.token_nosp] = -INFINITY; // TODO: ignore this token for now
		// logits[vocab.token_nosp] = -INFINITY; // Uncommenting this would produce hallucinations on silent audio

fix: remove hallucinations from silent audio #1588

fix: remove hallucinations from silent audio #1588

Conversation

ex3ndr commented Dec 3, 2023

ggerganov commented Dec 3, 2023

bygreencn commented Dec 4, 2023 • edited Loading

jxy commented Dec 5, 2023

ex3ndr commented Dec 6, 2023

ex3ndr commented Dec 6, 2023

bobqianic commented Dec 7, 2023

ex3ndr commented Dec 7, 2023

ex3ndr commented Dec 7, 2023

ex3ndr commented Dec 7, 2023

bobqianic commented Jan 16, 2024 • edited Loading

bobqianic commented Jan 16, 2024

bobqianic Jan 16, 2024

Choose a reason for hiding this comment

bygreencn commented Dec 4, 2023 •

edited

Loading

bobqianic commented Jan 16, 2024 •

edited

Loading