llama fast tokenizer: FileNotFound error when saving model checkpoint and self.vocab_file does not exist #25602

ZhangShiyue · 2023-08-18T18:37:15Z

System Info

transformers==4.31.0
torch==2.0.1

Who can help?

@ArthurZucker

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

Traceback (most recent call last):
....
  File "python3.9/site-packages/transformers/trainer.py", line 1539, in train
    return inner_training_loop(
  File "python3.9/site-packages/transformers/trainer.py", line 1916, in _inner_training_loop
    self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
  File "python3.9/site-packages/transformers/trainer.py", line 2237, in _maybe_log_save_evaluate
    self._save_checkpoint(model, trial, metrics=metrics)
  File "python3.9/site-packages/transformers/trainer.py", line 2294, in _save_checkpoint
    self.save_model(output_dir, _internal_call=True)
  File "python3.9/site-packages/transformers/trainer.py", line 2749, in save_model
    self._save(output_dir, state_dict=state_dict)
  File "python3.9/site-packages/transformers/trainer.py", line 2832, in _save
    self.tokenizer.save_pretrained(output_dir)
  File "python3.9/site-packages/transformers/tokenization_utils_base.py", line 2221, in save_pretrained
    save_files = self._save_pretrained(
  File "python3.9/site-packages/transformers/tokenization_utils_fast.py", line 595, in _save_pretrained
    vocab_files = self.save_vocabulary(save_directory, filename_prefix=filename_prefix)
  File "python3.9/site-packages/transformers/models/llama/tokenization_llama_fast.py", line 186, in save_vocabulary
    copyfile(self.vocab_file, out_vocab_file)
  File "/opt/bb/lib/python3.9/shutil.py", line 264, in copyfile
    with open(src, 'rb') as fsrc:
FileNotFoundError: [Errno 2] No such file or directory: './model/tokenizer.model'

Expected behavior

When I finetune llama, it threw out this error when saving the first checkpoint because the original model directory was deleted.

I noticed that in

transformers/src/transformers/models/llama/tokenization_llama.py

Line 281 in ef15342

    
           if os.path.abspath(self.vocab_file) != os.path.abspath(out_vocab_file) and os.path.isfile(self.vocab_file):

, it checks whether self.vocab_file exists.

Could this be added to tokenization_llama_fast.py too?

The text was updated successfully, but these errors were encountered:

ArthurZucker · 2023-08-21T06:49:40Z

Sure, the problem is that in fast we cannot recover the content of vocab_file if the repo was deleted. We can produce a warning however, mentioning that you won't be able to initialize a slow tokenizer. Opening a PR to fix this! Thanks for reporting

ZhangShiyue · 2023-08-21T16:05:05Z

Thanks a lot! Does it mean a fast tokenizer can still be initialized if vocab_file does not exist?

ArthurZucker · 2023-08-22T06:11:46Z

It depends, if you have a tokenizer.json file then yes, if not, then you cannot convert the slow tokenizer if the vocab_file (which in this case is the sentencepiece model) was deleted no?

This was referenced Aug 21, 2023

[TokenizerFast] Warn when vocab_file folder was removed ArthurZucker/transformers#4

Closed

[TokenizerFast] can_save_slow_tokenizer as a property for when vocab_file's folder was removed #25626

Merged

ArthurZucker closed this as completed in #25626 Aug 31, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llama fast tokenizer: FileNotFound error when saving model checkpoint and self.vocab_file does not exist #25602

llama fast tokenizer: FileNotFound error when saving model checkpoint and self.vocab_file does not exist #25602

ZhangShiyue commented Aug 18, 2023

ArthurZucker commented Aug 21, 2023 •

edited

Loading

ZhangShiyue commented Aug 21, 2023

ArthurZucker commented Aug 22, 2023

llama fast tokenizer: FileNotFound error when saving model checkpoint and self.vocab_file does not exist #25602

llama fast tokenizer: FileNotFound error when saving model checkpoint and self.vocab_file does not exist #25602

Comments

ZhangShiyue commented Aug 18, 2023

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

ArthurZucker commented Aug 21, 2023 • edited Loading

ZhangShiyue commented Aug 21, 2023

ArthurZucker commented Aug 22, 2023

ArthurZucker commented Aug 21, 2023 •

edited

Loading