Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: remove hallucinations from silent audio #1588

Closed
wants to merge 2 commits into from

Conversation

ex3ndr
Copy link

@ex3ndr ex3ndr commented Dec 3, 2023

This line is very important since on silent audio it would heavily hallucinate.

@ggerganov
Copy link
Owner

Can we get some examples where this change makes a difference in the output?

@bygreencn
Copy link

bygreencn commented Dec 4, 2023

It does reduce the incidence rate. but not fix it yet.
Moreover, could anyone check the information at openai 914 and openai 1155

@jxy
Copy link
Contributor

jxy commented Dec 5, 2023

the openai/whisper skips the current segment if the probability of no speech token is high https://github.com/openai/whisper/blob/e58f28804528831904c3b6f2c0e473f346223433/whisper/transcribe.py#L243-L255

we likely need this?

@ex3ndr
Copy link
Author

ex3ndr commented Dec 6, 2023

I have re-tested it, but it feels to be marginal improvement. I will try to implement ignoring segments with high probability of silence

@ex3ndr
Copy link
Author

ex3ndr commented Dec 6, 2023

BTW, i am testing on this short segments that i have recorded. I can't get most of them to be reliably detected as silence in most models.
silent.zip

@bobqianic
Copy link
Collaborator

in most models.

Whisper models?

@ex3ndr
Copy link
Author

ex3ndr commented Dec 7, 2023

in most models.

Whisper models?

Yes.

@ex3ndr
Copy link
Author

ex3ndr commented Dec 7, 2023

I have tried to log nosp probabilities instead of token probability and it is almost always zero for most cases that i don't think this is the reason it doesn't perform well.

@ex3ndr
Copy link
Author

ex3ndr commented Dec 7, 2023

image I have printed out the token then probability of a token then probability of a NOSP token at the same place.

@bobqianic
Copy link
Collaborator

bobqianic commented Jan 16, 2024

At least this is heading in the right direction. I'm developing something similar to OpenAI's approach, utilizing token_nosp to detect periods of silence.

    def _main_loop(self, audio_features: Tensor, tokens: Tensor):
        n_batch = tokens.shape[0]
        sum_logprobs: Tensor = torch.zeros(n_batch, device=audio_features.device)
        no_speech_probs = [np.nan] * n_batch

        try:
            for i in range(self.sample_len):
                logits = self.inference.logits(tokens, audio_features)

                if (
                    i == 0 and self.tokenizer.no_speech is not None
                ):  # save no_speech_probs
                    probs_at_sot = logits[:, self.sot_index].float().softmax(dim=-1)
                    no_speech_probs = probs_at_sot[:, self.tokenizer.no_speech].tolist()

                # now we need to consider the logits at the last token only
                logits = logits[:, -1]

                # apply the logit filters, e.g. for suppressing or applying penalty to
                for logit_filter in self.logit_filters:
                    logit_filter.apply(logits, tokens)

                # expand the tokens tensor with the selected next tokens
                tokens, completed = self.decoder.update(tokens, logits, sum_logprobs)

                if completed or tokens.shape[-1] > self.n_ctx:
                    break
        finally:
            self.inference.cleanup_caching()

        return tokens, sum_logprobs, no_speech_probs

@bobqianic
Copy link
Collaborator

the openai/whisper skips the current segment if the probability of no speech token is high https://github.com/openai/whisper/blob/e58f28804528831904c3b6f2c0e473f346223433/whisper/transcribe.py#L243-L255

we likely need this?

Sure

@bobqianic bobqianic closed this Jan 16, 2024
logits[vocab.token_nosp] = -INFINITY; // TODO: ignore this token for now
// logits[vocab.token_nosp] = -INFINITY; // Uncommenting this would produce hallucinations on silent audio
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although it is said that token_nosp is the direction to solve the hallucination, it is definitely problematic for you to cancel the suppress token_nosp. First of all, we only hope that the output of the model contains meaningful and visible tokens (except for timestamps). Your cancellation of suppress token_nosp will cause this token to possibly appear in the output of the model, which is something we do not want to see. Secondly, the key to solving hallucination lies in finding a way to skip silence. token_nosp is used to tell you how likely it is that this segment is silent, so that we can skip silence. Therefore, merely cancelling suppress token_nosp without any other action cannot solve hallucination.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants