Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong tokenizer/vocab for the 'Helsinki-NLP/opus-mt-tc-big-en-ko' model #81

Open
regpath opened this issue Sep 13, 2022 · 0 comments
Open

Comments

@regpath
Copy link

regpath commented Sep 13, 2022

The translation result from English to Korean using the 'Helsinki-NLP/opus-mt-tc-big-en-ko' model does not make sense at all

from transformers import MarianMTModel, MarianTokenizer
src_text = [
    "2, 4, 6 etc. are even numbers.",
    "Yes."
]

tokenizer = MarianTokenizer.from_pretrained(MODEL_PATH3)
model = MarianMTModel.from_pretrained(MODEL_PATH3)
translated = model.generate(**tokenizer(src_text, return_tensors="pt", padding=True))

for t in translated:
    print( tokenizer.decode(t, skip_special_tokens=True) )

The result is not ['2, 4, 6 등은 짝수입니다.', '그래'] as in the example, but ['그들은,우리는,우리는 모자입니다. 신뢰할 수 있습니다.', 'ATP입니다.'] which does not make sense at all.

I tried some more sentences and believe that correct tokenizer or vocab file can correct this problem.
Could you take a look at it?

@regpath regpath changed the title Wrong tokenizer for the 'Helsinki-NLP/opus-mt-tc-big-en-ko' model Wrong tokenizer/vocab for the 'Helsinki-NLP/opus-mt-tc-big-en-ko' model Sep 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant