Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[TokenizerFast] can_save_slow_tokenizer as a property for when vocab_file's folder was removed #25626

Merged
merged 11 commits into from
Aug 31, 2023

Conversation

ArthurZucker
Copy link
Collaborator

@ArthurZucker ArthurZucker commented Aug 21, 2023

What does this PR do?

Fixes #25602, making can_save_slow a property rather than an attribute as we need to check if the vocab_file still exists!

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Aug 21, 2023

The documentation is not available anymore as the PR was closed or merged.

@ArthurZucker ArthurZucker marked this pull request as ready for review August 21, 2023 11:34
@ArthurZucker ArthurZucker changed the title [TokenizerFast] Warn when vocab_file folder was removed [TokenizerFast] can_save_slow as a property for when vocab_file's folder was removed Aug 30, 2023
@ArthurZucker ArthurZucker changed the title [TokenizerFast] can_save_slow as a property for when vocab_file's folder was removed [TokenizerFast] can_save_slow_tokenizer as a property for when vocab_file's folder was removed Aug 30, 2023
Copy link
Collaborator

@amyeroberts amyeroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for updating!

@@ -189,6 +189,10 @@ def __init__(
for k in self.fairseq_tokens_to_ids.keys():
self.unique_no_split_tokens.append(k)

@property
def can_save_slow_tokenizer(self) -> bool:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we know why the tokenizer didn't have this as an attribute before?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haha good question, I think it's because it was not really detected. In this special case the folder containing the sentencepiece model was deleted, which if it is in the cache of transformers then never / rarely happens

@ArthurZucker ArthurZucker merged commit 3b39b90 into huggingface:main Aug 31, 2023
3 checks passed
parambharat pushed a commit to parambharat/transformers that referenced this pull request Sep 26, 2023
…ocab_file`'s folder was removed (huggingface#25626)

* pad token should be None by default

* fix tests

* nits

* check if isfile vocabfile

* add warning if sp model folder was deleted

* save SPM when missing folder for sloz

* update the ` can_save_slow_tokenizer`  to be a property

* first batch

* second batch

* missing one
blbadger pushed a commit to blbadger/transformers that referenced this pull request Nov 8, 2023
…ocab_file`'s folder was removed (huggingface#25626)

* pad token should be None by default

* fix tests

* nits

* check if isfile vocabfile

* add warning if sp model folder was deleted

* save SPM when missing folder for sloz

* update the ` can_save_slow_tokenizer`  to be a property

* first batch

* second batch

* missing one
EduardoPach pushed a commit to EduardoPach/transformers that referenced this pull request Nov 18, 2023
…ocab_file`'s folder was removed (huggingface#25626)

* pad token should be None by default

* fix tests

* nits

* check if isfile vocabfile

* add warning if sp model folder was deleted

* save SPM when missing folder for sloz

* update the ` can_save_slow_tokenizer`  to be a property

* first batch

* second batch

* missing one
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

llama fast tokenizer: FileNotFound error when saving model checkpoint and self.vocab_file does not exist
3 participants