forcing whisper to transcribe in a specific language is NOT working, it always translates #2285

welliX · 2024-07-31T13:04:05Z

welliX
Jul 31, 2024

I have a wav file in which it is spoken: "This is a small TTS test sentenced to show general quality." .When doing
whisper X.wav
the result is OK:
Detected language: English
[00:00.000 --> 00:03.600] This is a small TTS test sentenced to show general quality.

However if I want recognition/transcription (NOT translation) in a different language, i get:
whisper --language fr --task transcribe X.wav
[00:00.000 --> 00:03.600] C'est un test de TTS pour montrer la qualité générale.

But I don't want a French translation of the English trsanscription, I want a French transcription of the English utterance (no matter if this may not result in a senseful sequence of French words)

Same with German:
whisper --language de --task transcribe X.wav
[00:00.000 --> 00:03.500] Das ist ein kleiner TTS-Test, die die Generalqualität zeigt.

How can whisper be forced to transcribe in the language which is set by --language ??

(I'm using openai-whisper-20230314 - maybe this has been corrected in newer versions?)
many thanks for any help!

ryanheise · 2024-07-31T13:43:18Z

ryanheise
Jul 31, 2024

Whisper wasn't trained to do that task. From its training on the transcribe task, it learns how to predict the transcript when given just the audio file and the language. So you have to give it the English audio file, then you also give it the language "English, and Whisper will predict the corresponding English transcript.

If you give Whisper the wrong language, such as giving it an English audio file but then telling whisper that the --language is French when it's actually not, then the resulting behaviour is not documented, and any resulting behaviour wouldn't accurately be described as a bug - the behaviour is simply unspecified here. If it ends up translating, that is probably because Whisper was independently also trained on the translate task to a limited degree (only to English) and the training data incidentally also probably contained some unclean data with different languages labelled with the wrong language and so it's getting its wires crossed. However, because --language is instructed to be French, Whisper will be primed to output French words, so that's probably tipped the odds in favour of this sort of behaviour happening.

If you want different behaviour, Whisper would need to be trained for that. It would need a training dataset where your audio file containing the utterance "This is a small TTS test sentenced to show general quality." is labelled as a French audio file, not an English one, and then you would need to have in the dataset the corresponding "French" transcript so that Whisper can train to be able to produce what you want. I'll assume that you probably don't want to retrain Whisper for your task since it is expensive to train models, but that would also mean you'd have to limit yourself to using Whisper within the capabilities it has from its current training.

P.S. I'm not actually sure what a French transcript of an English audio file should actually look like. I can imagine what a Japanese transcript of an English audio file might look like, maybe with the English words spelled out in Japanese characters like katakana. But French already contains all the letters of the English alphabet so I'd expect that if you are in France and you asked for a transcription of this "English" audio utterance, you should actually want the English transcription, and you should therefore instruct Whisper accordingly, by giving the --language English option. If the real issue is that you have multiple languages in the same audio recording, say it's mostly French but this one sentence is English, that is called code switching, and Whisper wasn't specifically trained on code switching either. It may have learnt how to do this sometimes where the training data incidentally happened to contain examples of code switching, but generally that is not the case. You would need to cut the audio up into the English part, and the French part, and run Whisper on the English part with --language English and run Whisper on the French part with --language French. You can search this discussion board for "code switching" to find discussions about that.

1 reply

welliX Jul 31, 2024
Author

Dear Ryan,
many thanks for your speedy and profound answer!
Guess I got it. Seems that I'm still somehow stuck in an old-fashioned view of Automatic speeech recognition (ASR) where you run a French ASR on an English utterance, expecting French words that are resembling the phonetics of the English input words.
kind regards!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

forcing whisper to transcribe in a specific language is NOT working, it always translates #2285

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

forcing whisper to transcribe in a specific language is NOT working, it always translates #2285

welliX Jul 31, 2024

Replies: 1 comment · 1 reply

ryanheise Jul 31, 2024

welliX Jul 31, 2024 Author

welliX
Jul 31, 2024

Replies: 1 comment 1 reply

ryanheise
Jul 31, 2024

welliX Jul 31, 2024
Author