Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

llama fast tokenizer: FileNotFound error when saving model checkpoint and self.vocab_file does not exist #25602

Closed
4 tasks
ZhangShiyue opened this issue Aug 18, 2023 · 3 comments · Fixed by #25626

Comments

@ZhangShiyue
Copy link

System Info

transformers==4.31.0
torch==2.0.1

Who can help?

@ArthurZucker

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

Traceback (most recent call last):
....
  File "python3.9/site-packages/transformers/trainer.py", line 1539, in train
    return inner_training_loop(
  File "python3.9/site-packages/transformers/trainer.py", line 1916, in _inner_training_loop
    self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
  File "python3.9/site-packages/transformers/trainer.py", line 2237, in _maybe_log_save_evaluate
    self._save_checkpoint(model, trial, metrics=metrics)
  File "python3.9/site-packages/transformers/trainer.py", line 2294, in _save_checkpoint
    self.save_model(output_dir, _internal_call=True)
  File "python3.9/site-packages/transformers/trainer.py", line 2749, in save_model
    self._save(output_dir, state_dict=state_dict)
  File "python3.9/site-packages/transformers/trainer.py", line 2832, in _save
    self.tokenizer.save_pretrained(output_dir)
  File "python3.9/site-packages/transformers/tokenization_utils_base.py", line 2221, in save_pretrained
    save_files = self._save_pretrained(
  File "python3.9/site-packages/transformers/tokenization_utils_fast.py", line 595, in _save_pretrained
    vocab_files = self.save_vocabulary(save_directory, filename_prefix=filename_prefix)
  File "python3.9/site-packages/transformers/models/llama/tokenization_llama_fast.py", line 186, in save_vocabulary
    copyfile(self.vocab_file, out_vocab_file)
  File "/opt/bb/lib/python3.9/shutil.py", line 264, in copyfile
    with open(src, 'rb') as fsrc:
FileNotFoundError: [Errno 2] No such file or directory: './model/tokenizer.model'

Expected behavior

When I finetune llama, it threw out this error when saving the first checkpoint because the original model directory was deleted.

I noticed that in

if os.path.abspath(self.vocab_file) != os.path.abspath(out_vocab_file) and os.path.isfile(self.vocab_file):
, it checks whether self.vocab_file exists.

Could this be added to tokenization_llama_fast.py too?

@ArthurZucker
Copy link
Collaborator

ArthurZucker commented Aug 21, 2023

Sure, the problem is that in fast we cannot recover the content of vocab_file if the repo was deleted. We can produce a warning however, mentioning that you won't be able to initialize a slow tokenizer. Opening a PR to fix this! Thanks for reporting

@ZhangShiyue
Copy link
Author

Thanks a lot! Does it mean a fast tokenizer can still be initialized if vocab_file does not exist?

@ArthurZucker
Copy link
Collaborator

It depends, if you have a tokenizer.json file then yes, if not, then you cannot convert the slow tokenizer if the vocab_file (which in this case is the sentencepiece model) was deleted no?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants