Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Marian's Sentencepiece Not Passing Case Encoding Command Through Marian #420

Open
Kiryukhasemenov opened this issue Mar 10, 2024 · 1 comment
Labels

Comments

@Kiryukhasemenov
Copy link

###Summary:
I am trying to reproduce the new feature of your sentencepiece version presented in the paper. Although I can run it with your sentencepiece itself, it does not seem to work within the whole Marian's sentencepiece pipeline. The params seem to be passed through marian but lost on the way to sentencepiece.

Bug description

I was running the marian training together with the inbuilt sentencepiece vocabulary.

In the training configuration, I put the following parameters into the sentencepiece options:

sentencepiece-options: "--treat_whitespace_as_suffix --encode_unicode_case --remove_extra_whitespaces=false --encode_case --decode_case --character_coverage=0.988"

All the parameters were detected by the marian (see stdout.txt):

[2024-03-06 00:23:24] [config] sentencepiece-options: --treat_whitespace_as_suffix --encode_unicode_case

However, when sentencepiece is invoked, this param seems lost:

  encode_case: 0
  decode_case: 0

Necessary to add:

  1. I tried passing other parameters through sentencepiece options (such as --character_coverage), as well as explicit True values of the --treat_whitespace_as_suffix and --encode_unicode_case params. Finally, I tried various orderings of these parameters. Everything resulted with the same thing.
  2. I tried installing the marian's sentencepiece separately with this command:
run spm_train --encode_unicode_case --treat_whitespace_as_suffix --input csuk_toy1M.txt --model_prefix case_encoded

and it worked, it also was reflected in the log:

normalizer_spec {
...
  encode_case: 1
  decode_case: 0
}
denormalizer_spec {
...
  encode_case: 0
  decode_case: 1
}

Context

  • Marian version: v1.12.14 ba5df660 2023-11-22 02:00:31 -0800 (commit ba5df6606f96eaab26a18a668317072a2d6742e4, (HEAD -> master, origin/master, origin/HEAD))
  • sentencepiece version: sentencepiece 0.1.94 (commit fb6f8e408d2078ebfedc8ccc33985fef03c50b0e (HEAD))

Will appreciate any help!

@snukky
Copy link
Member

snukky commented Apr 1, 2024

Thanks for reporting. Cc @rjai

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants