How can we train the model and tokenizer on a new language (wasn't part of the model training)? #2388

NazimHAli · 2024-10-13T18:44:48Z

NazimHAli
Oct 13, 2024

I'm looking for a guide or example on how to train the model and tokenizer on a new language. Any language that wasn't pre-trained/not listed in the tokenizers list of languages. There's a similar thread, but it's about fine-tuning on a locale of a language that's already pre-trained.

Edit: I tried to fine-tune, but it won't work if the language isn't in the list of supported languages. And I can't use any of the current languages because they're not the same.

[/usr/local/lib/python3.10/dist-packages/transformers/models/whisper/tokenization_whisper.py](https://localhost:8080/#) in prefix_tokens()
    418             else:
    419                 is_language_code = len(self.language) == 2
--> 420                 raise ValueError(
    421                     f"Unsupported language: {self.language}. Language should be one of:"
    422                     f" {list(TO_LANGUAGE_CODE.values()) if is_language_code else list(TO_LANGUAGE_CODE.keys())}."

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How can we train the model and tokenizer on a new language (wasn't part of the model training)? #2388

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

How can we train the model and tokenizer on a new language (wasn't part of the model training)? #2388

NazimHAli Oct 13, 2024

Replies: 0 comments

NazimHAli
Oct 13, 2024