Multi-Language Audio and Transcription Inconsistencies #2009

phischde · 2024-02-08T15:53:44Z

phischde
Feb 8, 2024

Hey Community!

I've been experimenting with Whisper (locally installed) for a project involving multi-language audio transcription. My audio samples contain several languages. However, the results have been somewhat perplexing, and I'm hoping to gain insights on what I am doing wrong.

I created a sample audio file that starts with a sentence in German, followed by two sentences in English, and concludes with two sentences in Spanish. Given that the default task for Whisper is set to transcribe (as per the whisper --help documentation), my expectation was a straightforward transcription, not translation. However, what I get back from Whisper is a translation to English.

Observed Behavior:

Default Model with No Explicit Task or Language Arguments: All output text was translated into English, contrary to my expectation of a transcription retaining the original languages.
Explicit Transcription Task with No Language Argument: Similar to the first scenario, the output was entirely in English, ignoring the multi-language nature of the audio. Using --task transcribe had no effect.
Setting the Language Argument to "de": Interestingly, this produced the correct transcription, preserving all three languages (German, English, and Spanish) as spoken in the audio.

It seems that Whisper can accurately detect and switch between languages when a language argument is specified. However, the output does not align with what one might expect based on the command line arguments used. This discrepancy leaves me unsure about the output's nature, whether it's a direct transcription or a translation.

I have noticed that switching to the medium or large model (using the --model medium or --model large argument), the transcription detects the different languages correctly and outputs them as spoken. Independent from the --task argument.

My Questions:

Has anyone else encountered similar behavior with Whisper when dealing with multi-language audio? If so, how did you address it?
Is there a way to ensure Whisper accurately transcribes multi-language audio without resorting to translation, especially without specifying a language parameter?
Any tips on interpreting the output format in terms of distinguishing between transcribed and translated text? In a perfect world, I would know for each segment which language it is but I am also find with a simple language=mixed information.

I appreciate any insights, experiences, or advice you can share.

Thank you in advance for your help!
Phil

System Info:

Apple Silicon, macOS 14.3
Python 3.11.5
Whisper Version: 20231117

glangford · 2024-02-08T17:00:29Z

glangford
Feb 8, 2024

Is there a way to ensure Whisper accurately transcribes multi-language audio without resorting to translation, especially without specifying a language parameter?

The short answer is no, and this comment from the maintainers sums it up:

"It's intended for monolingual audio inputs, and --language should specify the language used in the audio (English or otherwise). Whisper doesn't support code-switching inputs very well..."
#1160 (comment)

The behaviour when multiple languages are present seems to be unpredictable with the current models.

These past discussions may also be helpful.

1 reply

phischde Feb 8, 2024
Author

Thank you for the clarification and for directing me to those discussions. I wasn't aware that Whisper is primarily designed for monolingual audio inputs.

Since I still need to work with multilingual audio, has anyone else come up with a solution? I would assume using one tool to detect the language, splice the audio files, and transcribe them individually might be a viable approach?

Also, out of curiosity, are there any ongoing efforts or plans to enhance Whisper's capabilities with multi-language audio in the future?

Thanks again for your help and for pointing me towards those discussions. It's greatly appreciated!

glangford · 2024-02-18T14:31:35Z

glangford
Feb 18, 2024

Since I still need to work with multilingual audio, has anyone else come up with a solution?

One way of doing this is to combine speaker diarization with whisper. Here is standalone example code showing how this works.

pyannote is used to identify different speakers and the start/stop times for each. Then whisper identifies the language spoken before transcribing the audio segments from each individual speaker. In this example code, the speaker identities are ignored and the language is detected separately for each audio segment in isolation.

Notes

Install pyannote and follow the instructions for obtaining access and getting a token to use speaker-diarization-3.1 (it is open source but gated)
Modify the code to insert your Hugging Face authorization token and file name of your audio
I have only tested with CPU at this point, with whisper medium model quantized for faster testing
pyannote didn't seem to like the .mp3 I gave it, so I used a .wav instead

I tested this with a short audio (narrated in Portuguese with segments of Polish) and it worked well. I don't know how well pyannote will identify speakers across different languages, it is my first time trying it.

This is just a demo and obviously more code would be necessary to merge transcripts.

Partial output sample showing the start/stop times, the speaker id, the language identified, and the transcripts of each audio block:

Diarize audio
Initialize whisper
Process diarized blocks
start=2.7s stop=9.8s SPEAKER_04 pt
   Para a próxima segunda-feira, no Parlamento Polaco está marcada uma reunião da Comissão Pégasus e da Vigilância Ilegal.
start=10.6s stop=16.4s SPEAKER_04 pt
   Bartos Kramek, o representante da Fundação Open Dialogue, é uma das pessoas envolvidas neste caso.
start=17.2s stop=35.8s SPEAKER_00 pl
   Chodziło o podsłuchy, chodziło o rejestrowanie też obrazu w moich pokojach hotelowych, również pozyskiwanie danych na temat operacji, których dokonywałem, swoją kartą kredytową, jeżeli chodzi o usługi hotelowe. I również nie doszła prowokacja.
start=39.0s stop=45.6s SPEAKER_04 pt
   O advogado de Cramec, que é a século do Boá, também representa outras vítimas de escutas, telefônicas ilegais e vigilância.
start=47.9s stop=75.9s SPEAKER_01 pl
   To są prawa, których nie można pozbawić. Jedynym takim konstytucyjnym ograniczeniem jest migracja za zgodą sądu, która ma ściśle określone zasady w określonych momentach i tylko do konkretnych spraw. To, co jest materiałem spoza tych spraw ma być niszczone. Państwo nie ma prawa kontrolować obywateli, być strażnikiem ich moralności, gdyż one tak naprawdę w konstytucje.
start=71.5s stop=71.8s IGNORED
start=77.8s stop=86.7s SPEAKER_04 pt
   O antigo partido no poder, o Lei Justiça Nalset, garante que todas as atividades operacionais dos serviços secretos decorreram de forma legal.
start=87.3s stop=100.4s SPEAKER_02 pl
   Pegasus jest niezbędny. Nie wierzę w to, że jakikolwiek funkcjonariusz zarezykowałby podjęcie działań operacyjnych bez gromadzenia materiałów dowodowych, które zostały przekazane następnie do prokuratury i następnie zostały skierowane do sądu.
...

From Euronews, "Parlamento polaco investiga utilização do software Pegasus"
https://www.youtube.com/watch?v=XcEMou5uWcs

10 replies

dgoryeo Feb 29, 2024

Hi @glangford , I just tried the code but I'm gettingthe following error:

...\modules\conv.py", line 306, in _conv_forward
    return F.conv1d(input, weight, bias, self.stride,
RuntimeError: Given groups=1, weight of size [1280, 128, 3], expected input[1, 80, 3000] to have 128 channels, but got 80 channels instead

I believe the error occures at this line:

    mel = whisper.log_mel_spectrogram(audio_segment).to(model.device)
    _, probs = model.detect_language(mel)

glangford Feb 29, 2024

@dgoryeo Did you change the model to large-v3? If so just use large-v2 for the moment I will fix this in a future version, or you can change the call to log_mel_spectrogram. See

Error in the "large-v3" model #1778

dgoryeo Mar 1, 2024

@glangford , that was it. The culprit was large-v3.
I just tested your original code on a short 4min audio. It works well with large-v2. However it is quite slow. I wonder if it is because the segments by pyannote are too small, so the 30sec segment overhead becomes substantial.

glangford Mar 1, 2024

There is the overhead of pyannote, plus the overhead of doing the individual segments. Worth trying large-v2 quantized to try and speed things up, but I don't know of any other immediate way to improve performance if every segment could be a new language.

glangford Mar 1, 2024

@dgoryeo I guess one possibility is to merge adjacent segments from the same speaker. This will probably be faster, on the other hand the risk of hallucination is higher if there is a period of silence or sound effects between the two.

glangford · 2024-03-04T13:56:46Z

glangford
Mar 4, 2024

Here is updated code that supports output of srt or vtt files from the diarized transcripts. This is a little tricky because multiple whisper transcriptions are done to create a single file.

large-v3 should also work fine now.

At the bottom there is a sample main() showing how to use the new classes.

import os
from typing import Any, Optional, TextIO
from pyannote.audio import Pipeline, Audio
import whisper
from whisper.utils import WriteSRT, WriteVTT
from whisper import Whisper
import torch
from math import ceil, floor

def diarize_audio(HF_AUTH_TOKEN, AUDIO_FILE):
    pipeline = Pipeline.from_pretrained(
    "pyannote/speaker-diarization-3.1",
    use_auth_token=HF_AUTH_TOKEN)
    # Send pyannote pipeline to GPU (when available)
    device: str = ""
    if torch.cuda.is_available():
        device = "cuda"
    else:
        device = "cpu"
    pipeline.to(torch.device(device))
    print(f"Diarize audio on {device}")
    ### diarization = pipeline(AUDIO_FILE)
    io = Audio(mono='downmix', sample_rate=16000)
    waveform, sample_rate = io(AUDIO_FILE)
    diarization = pipeline({"waveform": waveform, "sample_rate": sample_rate})
    return diarization

class AppendResultsMixin:
    """Class to return srt or vtt file path and open mode of write or append
    to allow incremental writing.
    """
    first_call: bool = True
    output_path: str = ''
    
    def get_path_and_open_mode(self, *, audio_path: str, dir: str, ext: str) -> tuple[str, str]:
        mode: str
        if self.first_call:
            audio_basename = os.path.basename(audio_path)
            audio_basename = os.path.splitext(audio_basename)[0]
            self.output_path: str = os.path.join(dir, audio_basename + "." + ext)
            self.first_call = False
            mode = 'w' # open for write initially
        else:
            mode = 'a' # open for append after
        return self.output_path, mode

class WriteSRTIncremental(AppendResultsMixin, WriteSRT):
    """Incrementally create an SRT file with multiple calls appending new entries
      to the file.
    """
    srt_index: int = 1 # index for srt blocks retained across multiple calls

    def __call__(self, result: dict, audio_path: str, options: Optional[dict] = None, **kwargs):
        path, mode = self.get_path_and_open_mode(audio_path=audio_path, dir=self.output_dir, ext=self.extension)
        with open(path, mode, encoding="utf-8") as f:
            self.write_result(result, file=f, options=options, **kwargs) # type: ignore
            
    def write_result(
            self, result: dict, file: TextIO, options: Optional[dict] = None, **kwargs
        ):
            for (start, end, text) in self.iterate_result(result, options, **kwargs):
                print(f"{self.srt_index}\n{start} --> {end}\n{text}\n", file=file, flush=True)
                self.srt_index += 1

class WriteVTTIncremental(AppendResultsMixin, WriteVTT):
    """Incrementally create a VTT file with multiple calls appending new entries
      to the file.
    """
    def __call__(self, result: dict, audio_path: str, options: Optional[dict] = None, **kwargs):
        path, mode = self.get_path_and_open_mode(audio_path=audio_path, dir=self.output_dir, ext=self.extension)
        with open(path, mode, encoding="utf-8") as f:
            if mode != 'a': 
                print("WEBVTT\n", file=f)
            self.write_result(result, file=f, options=options, **kwargs) # type: ignore

    def write_result(
        self, result: dict, file: TextIO, options: Optional[dict] = None, **kwargs
    ):
        for start, end, text in self.iterate_result(result, options, **kwargs):
            print(f"{start} --> {end}\n{text}\n", file=file, flush=True)


class WhisperFacade:
    wmodel: Whisper

    def __init__(self, model:str, *, quantize=False) -> None:
        """Load the Whisper model and optionally quantize."""
        print("Initialize whisper")
        whisper_model = whisper.load_model(model)
        if quantize:
            print("Quantize")
            DTYPE = torch.qint8
            qmodel: Whisper = torch.quantization.quantize_dynamic(
                        whisper_model, {torch.nn.Linear}, dtype=DTYPE)
            del whisper_model
            self.wmodel = qmodel
        else:
            self.wmodel = whisper_model

    def _set_timing_for(self, segment: dict[str, float],  # simplified typing
                        offset: float) -> None:
        """For speech fragments in different parts of an audio file, patch the 
        whisper segment and word timing using the offset (typically the diarization offset)
        in seconds. This makes the timing accurate for subtitles when multiple 
        calls to whisper are used for various parts of the audio.
        """
        s = segment
        s['start'] += offset 
        s['end']   += offset
        # Update word start/stop times, if present
        if 'words' in s:
            w: dict[str, float] # simplified typing
            for w in s['words']: # type: ignore
                w['start'] += offset
                w['end'] += offset

    def load_audio(self, file_path: str):
        self.audio = whisper.load_audio(file_path)
    
    def transcribe(self, *, start: float, end: float, options: dict[str, Any] ) -> dict[str, Any]:
        """Transcribe from start time to end time (both in seconds)."""
        SAMPLE_RATE = 16_000 # 16kHz audio
        start_index = floor(start * SAMPLE_RATE)
        end_index = ceil(end * SAMPLE_RATE)
        audio_segment = self.audio[start_index:end_index]
        result = whisper.transcribe(self.wmodel, audio_segment, **options)
        #
        segments = result['segments']
        s: dict[str, float] # simplified typing
        for s in segments: # type: ignore
            self._set_timing_for(segment=s, offset=start)
        return result


# Demonstration of creating srt or vtt subtitles from multilanguage audio
HF_AUTH_TOKEN = "INSERT YOUR TOKEN"
AUDIO_FILE = "INSERT YOUR FILE.mp3"
torch.set_num_threads(6) # change as appropriate

def main():
    diarization = diarize_audio(HF_AUTH_TOKEN, AUDIO_FILE)
    model = WhisperFacade(model='medium', quantize=True)
    model.load_audio(AUDIO_FILE)
    #
    writer = WriteSRTIncremental('.')
    #writer = WriteVTTIncremental('.')
    whisper_options = {"verbose": None, "word_timestamps": True, 
                       "task": "transcribe", "suppress_tokens": ""}
    writer_options = {"max_line_width":55, "max_line_count":2, "highlight_words":False}
    print("Process diarized blocks")
    for turn, _, speaker in diarization.itertracks(yield_label=True):
        if turn.end - turn.start < 0.5: # Suppress short utterances (pyannote artifact)
            print(f"start={turn.start:.1f}s stop={turn.end:.1f}s IGNORED")
            continue
        result = model.transcribe(start=turn.start, end=turn.end, options=whisper_options)
        language = result['language']
        print(f"start={turn.start:.1f}s stop={turn.end:.1f}s lang={language} {speaker}")
        writer(result, AUDIO_FILE, writer_options )

if __name__ == '__main__' :
    main()

4 replies

NormalFall0 May 11, 2024

Hi @glangford (a third messge from me in span of 24H ~)
I wandering something,
My usual whisper code is a command on terminal that has these terms: "whisper', selected_file_with_format, '--language', 'en', '--model', 'medium.en','--word_timestamps','True', '--device', 'cuda', '--task', 'transcribe'])"
Something like that.

And I usually obtain all 4 times of files (Srt, txt, etc..)
I tried your code and it only produces the srt file I believe.

How to make it produce the 3 other file formats please?
If I want largeV2 I simply have to modify this line right (model = WhisperFacade(model='medium', quantize=True))
What does "quantize=True" do?
"Writer options" are specifid to pyanote? I am confused because I usually dont have those options in classical whisper.

Ultimately my ultimate question is, is this code capable ot producing a file where we can see 3-4 speakers speech, and the file be readable and .. usable afterwards?
This is co cool btw, I thank you for your stuff.
Is it always reliable in term who is speaking? I saw another repo where you had to "show the machine" who is speaking by giving it small snippets of audio or somethign to make the restuls more reliable, do you know anything about this aswell?
Thanks

glangford May 11, 2024

How to make it produce the 3 other file formats?

.srt and .vtt can be produced now, you will have to extend the code to produce the different formats.

If I want largeV2 I simply have to modify this line right (model = WhisperFacade(model='medium', quantize=True))

Yes

What does "quantize=True" do?

See

load whisper in float16 or int8, no external dependencies required #1990

"Writer options" are specifid to pyanote? I am confused because I usually dont have those options in classical whisper.

The writer options are passed to whisper. In normal whisper all of the options are grouped together (eg. on the command line) but they are separated out here for clarity. See the SubtitlesWriter class

whisper/whisper/utils.py

Line 134 in ba3f3cd

max_line_width = max_line_width or options.get("max_line_width")

Ultimately my ultimate question is, is this code capable ot producing a file where we can see 3-4 speakers speech, and the file be readable and .. usable afterwards?

That is the idea exactly, and if they are speaking different languages it will (hopefully) transcribe it correctly. This example code discards the speaker identify to keep things simple, but you could imagine that the speaker ID could be added in a variation of this code.

Is it always reliable in term who is speaking?

Both whisper and pyannote can make mistakes, so it's worth testing further and evaluating the results. I wrote this as an example and I haven't tested pyannote with a wide variety of content.

NormalFall0 May 12, 2024

.srt and .vtt can be produced now, you will have to extend the code to produce the different formats.

Hello @glangford , thanks for your answer. But I usually get all four formats (srt, vtt, txt, and tsv) with a normal whisper command. It is only when I say that I only want srt that the others do not appear
For the pyannote code.. that you provided.. I am not sure what to change to make it produce all these formats, or at least I need 2 (srt and txt) it seems it only produced srt, am I wrong?

Do you mind telling me what to change to make it produce the txt?
It is important because the txt sometimes shows mmissing lines that srt WOULD NOT show. Btw.

adrianguanipa Jul 15, 2024

Here is updated code that supports output of srt or vtt files from the diarized transcripts. This is a little tricky because multiple whisper transcriptions are done to create a single file.

large-v3 should also work fine now.

At the bottom there is a sample main() showing how to use the new classes.

import os
from typing import Any, Optional, TextIO
from pyannote.audio import Pipeline, Audio
import whisper
from whisper.utils import WriteSRT, WriteVTT
from whisper import Whisper
import torch
from math import ceil, floor

def diarize_audio(HF_AUTH_TOKEN, AUDIO_FILE):
    pipeline = Pipeline.from_pretrained(
    "pyannote/speaker-diarization-3.1",
    use_auth_token=HF_AUTH_TOKEN)
    # Send pyannote pipeline to GPU (when available)
    device: str = ""
    if torch.cuda.is_available():
        device = "cuda"
    else:
        device = "cpu"
    pipeline.to(torch.device(device))
    print(f"Diarize audio on {device}")
    ### diarization = pipeline(AUDIO_FILE)
    io = Audio(mono='downmix', sample_rate=16000)
    waveform, sample_rate = io(AUDIO_FILE)
    diarization = pipeline({"waveform": waveform, "sample_rate": sample_rate})
    return diarization

class AppendResultsMixin:
    """Class to return srt or vtt file path and open mode of write or append
    to allow incremental writing.
    """
    first_call: bool = True
    output_path: str = ''
    
    def get_path_and_open_mode(self, *, audio_path: str, dir: str, ext: str) -> tuple[str, str]:
        mode: str
        if self.first_call:
            audio_basename = os.path.basename(audio_path)
            audio_basename = os.path.splitext(audio_basename)[0]
            self.output_path: str = os.path.join(dir, audio_basename + "." + ext)
            self.first_call = False
            mode = 'w' # open for write initially
        else:
            mode = 'a' # open for append after
        return self.output_path, mode

class WriteSRTIncremental(AppendResultsMixin, WriteSRT):
    """Incrementally create an SRT file with multiple calls appending new entries
      to the file.
    """
    srt_index: int = 1 # index for srt blocks retained across multiple calls

    def __call__(self, result: dict, audio_path: str, options: Optional[dict] = None, **kwargs):
        path, mode = self.get_path_and_open_mode(audio_path=audio_path, dir=self.output_dir, ext=self.extension)
        with open(path, mode, encoding="utf-8") as f:
            self.write_result(result, file=f, options=options, **kwargs) # type: ignore
            
    def write_result(
            self, result: dict, file: TextIO, options: Optional[dict] = None, **kwargs
        ):
            for (start, end, text) in self.iterate_result(result, options, **kwargs):
                print(f"{self.srt_index}\n{start} --> {end}\n{text}\n", file=file, flush=True)
                self.srt_index += 1

class WriteVTTIncremental(AppendResultsMixin, WriteVTT):
    """Incrementally create a VTT file with multiple calls appending new entries
      to the file.
    """
    def __call__(self, result: dict, audio_path: str, options: Optional[dict] = None, **kwargs):
        path, mode = self.get_path_and_open_mode(audio_path=audio_path, dir=self.output_dir, ext=self.extension)
        with open(path, mode, encoding="utf-8") as f:
            if mode != 'a': 
                print("WEBVTT\n", file=f)
            self.write_result(result, file=f, options=options, **kwargs) # type: ignore

    def write_result(
        self, result: dict, file: TextIO, options: Optional[dict] = None, **kwargs
    ):
        for start, end, text in self.iterate_result(result, options, **kwargs):
            print(f"{start} --> {end}\n{text}\n", file=file, flush=True)


class WhisperFacade:
    wmodel: Whisper

    def __init__(self, model:str, *, quantize=False) -> None:
        """Load the Whisper model and optionally quantize."""
        print("Initialize whisper")
        whisper_model = whisper.load_model(model)
        if quantize:
            print("Quantize")
            DTYPE = torch.qint8
            qmodel: Whisper = torch.quantization.quantize_dynamic(
                        whisper_model, {torch.nn.Linear}, dtype=DTYPE)
            del whisper_model
            self.wmodel = qmodel
        else:
            self.wmodel = whisper_model

    def _set_timing_for(self, segment: dict[str, float],  # simplified typing
                        offset: float) -> None:
        """For speech fragments in different parts of an audio file, patch the 
        whisper segment and word timing using the offset (typically the diarization offset)
        in seconds. This makes the timing accurate for subtitles when multiple 
        calls to whisper are used for various parts of the audio.
        """
        s = segment
        s['start'] += offset 
        s['end']   += offset
        # Update word start/stop times, if present
        if 'words' in s:
            w: dict[str, float] # simplified typing
            for w in s['words']: # type: ignore
                w['start'] += offset
                w['end'] += offset

    def load_audio(self, file_path: str):
        self.audio = whisper.load_audio(file_path)
    
    def transcribe(self, *, start: float, end: float, options: dict[str, Any] ) -> dict[str, Any]:
        """Transcribe from start time to end time (both in seconds)."""
        SAMPLE_RATE = 16_000 # 16kHz audio
        start_index = floor(start * SAMPLE_RATE)
        end_index = ceil(end * SAMPLE_RATE)
        audio_segment = self.audio[start_index:end_index]
        result = whisper.transcribe(self.wmodel, audio_segment, **options)
        #
        segments = result['segments']
        s: dict[str, float] # simplified typing
        for s in segments: # type: ignore
            self._set_timing_for(segment=s, offset=start)
        return result


# Demonstration of creating srt or vtt subtitles from multilanguage audio
HF_AUTH_TOKEN = "INSERT YOUR TOKEN"
AUDIO_FILE = "INSERT YOUR FILE.mp3"
torch.set_num_threads(6) # change as appropriate

def main():
    diarization = diarize_audio(HF_AUTH_TOKEN, AUDIO_FILE)
    model = WhisperFacade(model='medium', quantize=True)
    model.load_audio(AUDIO_FILE)
    #
    writer = WriteSRTIncremental('.')
    #writer = WriteVTTIncremental('.')
    whisper_options = {"verbose": None, "word_timestamps": True, 
                       "task": "transcribe", "suppress_tokens": ""}
    writer_options = {"max_line_width":55, "max_line_count":2, "highlight_words":False}
    print("Process diarized blocks")
    for turn, _, speaker in diarization.itertracks(yield_label=True):
        if turn.end - turn.start < 0.5: # Suppress short utterances (pyannote artifact)
            print(f"start={turn.start:.1f}s stop={turn.end:.1f}s IGNORED")
            continue
        result = model.transcribe(start=turn.start, end=turn.end, options=whisper_options)
        language = result['language']
        print(f"start={turn.start:.1f}s stop={turn.end:.1f}s lang={language} {speaker}")
        writer(result, AUDIO_FILE, writer_options )

if __name__ == '__main__' :
    main()

This is great, one thing I wish we could with the smaller segments is not to end so quickly, if someone says "Hello", and there is nothing else after then it would be nice if it could remain visible for a few milliseconds over, any idea if there is a parameter for that? Thanks.

Eleonoreer · 2024-04-29T19:33:00Z

Eleonoreer
Apr 29, 2024

Does the diarization model support the Spanish language if my audio is exclusively in that language?
@glangford

1 reply

glangford Apr 29, 2024

@Eleonoreer I haven't tried it but it should work...let us know if it doesn't

meera · 2024-07-01T06:44:55Z

meera
Jul 1, 2024

I have audio with a single speaker switching between native (Marathi) language and English.
This is very common scenario in India, where we speak a mixture of English and Marathi or other native languages.

The speaker diarization algorithms will not work, as it is same speaker switching the languages.

Has anyone encountered this problem and found a solution?

8 replies

ChrisAmadeus Aug 8, 2024

I have met the same problem. I have a dataset of a speaker switching between Cantonese and English (which is often in Hong Kong), and Whisper often just translates these English words in this dataset into Cantonese instead of transcribing them...

egorsmkv Aug 8, 2024

I have met the same problem. I have a dataset of a speaker switching between Cantonese and English (which is often in Hong Kong), and Whisper often just translates these English words in this dataset into Cantonese instead of transcribing them...

you can look at this - NVIDIA/NeMo@main...wd929:NeMo:code-switch

ChrisAmadeus Aug 9, 2024

I have met the same problem. I have a dataset of a speaker switching between Cantonese and English (which is often in Hong Kong), and Whisper often just translates these English words in this dataset into Cantonese instead of transcribing them...

you can look at this - NVIDIA/NeMo@main...wd929:NeMo:code-switch

Sorry but I can't understand which parts of these codes are related to this problem. There are so many changes in this log.
Could you help point the key parts out?

ChrisAmadeus Aug 9, 2024

BTW I find this which uses mixed cantonese and english to fine-tune the model.
It seems that directly fine-tuning without making changes to the model structure it totally ok

egorsmkv Aug 9, 2024

I have met the same problem. I have a dataset of a speaker switching between Cantonese and English (which is often in Hong Kong), and Whisper often just translates these English words in this dataset into Cantonese instead of transcribing them...

https://arxiv.org/pdf/2407.04368

Barshan-Mandal · 2024-08-17T05:46:52Z

Barshan-Mandal
Aug 17, 2024

whisper is baddest while it comes to bengali

0 replies

toanhn · 2024-10-11T16:22:21Z

toanhn
Oct 11, 2024

For the mixed (multiple) languages in an audio file, I think Whisper should learn from Microsoft Cognitive which supports mixed languages.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-Language Audio and Transcription Inconsistencies #2009

{{title}}

Replies: 7 comments 24 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Multi-Language Audio and Transcription Inconsistencies #2009

Replies: 7 comments · 24 replies

phischde Feb 8, 2024 Author

Replies: 7 comments 24 replies

phischde Feb 8, 2024
Author