-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: remove hallucinations from silent audio #1588
Conversation
Can we get some examples where this change makes a difference in the output? |
It does reduce the incidence rate. but not fix it yet. |
the openai/whisper skips the current segment if the probability of no speech token is high https://github.com/openai/whisper/blob/e58f28804528831904c3b6f2c0e473f346223433/whisper/transcribe.py#L243-L255 we likely need this? |
I have re-tested it, but it feels to be marginal improvement. I will try to implement ignoring segments with high probability of silence |
BTW, i am testing on this short segments that i have recorded. I can't get most of them to be reliably detected as silence in most models. |
Whisper models? |
Yes. |
I have tried to log |
At least this is heading in the right direction. I'm developing something similar to OpenAI's approach, utilizing def _main_loop(self, audio_features: Tensor, tokens: Tensor):
n_batch = tokens.shape[0]
sum_logprobs: Tensor = torch.zeros(n_batch, device=audio_features.device)
no_speech_probs = [np.nan] * n_batch
try:
for i in range(self.sample_len):
logits = self.inference.logits(tokens, audio_features)
if (
i == 0 and self.tokenizer.no_speech is not None
): # save no_speech_probs
probs_at_sot = logits[:, self.sot_index].float().softmax(dim=-1)
no_speech_probs = probs_at_sot[:, self.tokenizer.no_speech].tolist()
# now we need to consider the logits at the last token only
logits = logits[:, -1]
# apply the logit filters, e.g. for suppressing or applying penalty to
for logit_filter in self.logit_filters:
logit_filter.apply(logits, tokens)
# expand the tokens tensor with the selected next tokens
tokens, completed = self.decoder.update(tokens, logits, sum_logprobs)
if completed or tokens.shape[-1] > self.n_ctx:
break
finally:
self.inference.cleanup_caching()
return tokens, sum_logprobs, no_speech_probs |
Sure |
logits[vocab.token_nosp] = -INFINITY; // TODO: ignore this token for now | ||
// logits[vocab.token_nosp] = -INFINITY; // Uncommenting this would produce hallucinations on silent audio |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although it is said that token_nosp
is the direction to solve the hallucination, it is definitely problematic for you to cancel the suppress token_nosp
. First of all, we only hope that the output of the model contains meaningful and visible tokens (except for timestamps). Your cancellation of suppress token_nosp
will cause this token to possibly appear in the output of the model, which is something we do not want to see. Secondly, the key to solving hallucination lies in finding a way to skip silence. token_nosp
is used to tell you how likely it is that this segment is silent, so that we can skip silence. Therefore, merely cancelling suppress token_nosp
without any other action cannot solve hallucination.
This line is very important since on silent audio it would heavily hallucinate.