Make whisper transcribe numbers in the actual spoken words #1041
-
Hi, is there a way to get whisper to transcribe numbers the way they are actually spoken rather than just converting them to numeric format? Yeah I know I can postprocess it but this is a suboptimal solution since for example for something like 2015 there are multiple ways you can say it like 'two thousand and fifteen' or 'two thousand fifteen' or 'twenty fifteen' and whisper just automatically converts them to the same number while I need the actual spoken words. Theres obviously some conversion process going on here so is there a way to turn it off? |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 14 replies
-
It's not an explicit conversion but the model predicting the most likely textual output end-to-end. You can try the following which blocks all numeric tokens and encourages the model to transcribe in them literally. from whisper.tokenizer import get_tokenizer
tokenizer = get_tokenizer(multilingual=False) # use multilingual=True if using multilingual model
number_tokens = [
i
for i in range(tokenizer.eot)
if all(c in "0123456789" for c in tokenizer.decode([i]).removeprefix(" "))
]
...
model.transcribe("audio.mp3", suppress_tokens=[-1] + number_tokens, ...) |
Beta Was this translation helpful? Give feedback.
-
@jongwook Hi, I tried this method, but I got this error: And I check the Tokenizer ,and the eot api return int ,int array. Is it correct?
|
Beta Was this translation helpful? Give feedback.
-
In case anyone is looking for how to do this for Hugging Face Transformers: |
Beta Was this translation helpful? Give feedback.
It's not an explicit conversion but the model predicting the most likely textual output end-to-end. You can try the following which blocks all numeric tokens and encourages the model to transcribe in them literally.