Wrong tokenizer/vocab for the 'Helsinki-NLP/opus-mt-tc-big-en-ko' model #81

regpath · 2022-09-13T01:45:11Z

The translation result from English to Korean using the 'Helsinki-NLP/opus-mt-tc-big-en-ko' model does not make sense at all

from transformers import MarianMTModel, MarianTokenizer
src_text = [
    "2, 4, 6 etc. are even numbers.",
    "Yes."
]

tokenizer = MarianTokenizer.from_pretrained(MODEL_PATH3)
model = MarianMTModel.from_pretrained(MODEL_PATH3)
translated = model.generate(**tokenizer(src_text, return_tensors="pt", padding=True))

for t in translated:
    print( tokenizer.decode(t, skip_special_tokens=True) )

The result is not ['2, 4, 6 등은 짝수입니다.', '그래'] as in the example, but ['그들은,우리는,우리는 모자입니다. 신뢰할 수 있습니다.', 'ATP입니다.'] which does not make sense at all.

I tried some more sentences and believe that correct tokenizer or vocab file can correct this problem.
Could you take a look at it?

The text was updated successfully, but these errors were encountered:

regpath changed the title ~~Wrong tokenizer for the 'Helsinki-NLP/opus-mt-tc-big-en-ko' model~~ Wrong tokenizer/vocab for the 'Helsinki-NLP/opus-mt-tc-big-en-ko' model Sep 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wrong tokenizer/vocab for the 'Helsinki-NLP/opus-mt-tc-big-en-ko' model #81

Wrong tokenizer/vocab for the 'Helsinki-NLP/opus-mt-tc-big-en-ko' model #81

regpath commented Sep 13, 2022 •

edited

Loading

Wrong tokenizer/vocab for the 'Helsinki-NLP/opus-mt-tc-big-en-ko' model #81

Wrong tokenizer/vocab for the 'Helsinki-NLP/opus-mt-tc-big-en-ko' model #81

Comments

regpath commented Sep 13, 2022 • edited Loading

regpath commented Sep 13, 2022 •

edited

Loading