Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adjust HuggingFaceModel token embedding resizing to only occur when necessary #2027

Merged
merged 13 commits into from
Mar 6, 2023

Conversation

dakinggg
Copy link
Contributor

@dakinggg dakinggg commented Mar 2, 2023

What does this PR do?

Previously, HuggingFaceModel automatically resizes the model vocab size to match whatever the tokenizer vocab size is. This can cause an issue when the model vocab size is rounded to a multiple of 8 or 64 and intentionally does not match the tokenizer vocab size. This PR changes our behavior to only resize the model vocab size when it is necessary, meaning when the model vocab size is less than the tokenizer vocab size. When the tokenizer vocab size is less than the model vocab size, we just raise a warning now.

What issue(s) does this change relate to?

Closes CO-1861

Before submitting

  • Have you read the contributor guidelines?
  • Was this change discussed/approved in a GitHub issue first? It is much more likely to be merged if so.
  • Did you update any related docs and document your change?
  • Did you update any related tests and add any new tests related to your change? (see testing)
  • Did you run the tests locally to make sure they pass?
  • Did you run pre-commit on your change? (see the pre-commit section of prerequisites)

@dakinggg dakinggg requested a review from a team as a code owner March 2, 2023 22:55
@dskhudia
Copy link
Contributor

dskhudia commented Mar 2, 2023

I think this was the issue a customer was running into. They had a checkpoint created by composer but couldn't load state_dict['state']['model'] into the HF model due to vocab size mismatch.

IMO we shouldn't resize at all and should let it fail. We are silently changing the number of params in a model and that creates issues with loading the checkpoint outside of composer.

@dakinggg
Copy link
Contributor Author

dakinggg commented Mar 2, 2023

Yup, that is correct @dskhudia. For the case where embedding vocab size is less then tokenizer vocab size, the training run will crash at some unknown point, with a nasty CUDA error. The biggest issue is that this could happen deep into training. I'd like to try to avoid users getting to that point. Does that sound reasonable to you? Or you would still like to just raise a warning and let the user deal with it?

@dakinggg
Copy link
Contributor Author

dakinggg commented Mar 2, 2023

Or are you suggesting we raise an error here instead?

@dskhudia
Copy link
Contributor

dskhudia commented Mar 2, 2023

I was suggesting to raise an error.

@alextrott16
Copy link
Contributor

For the first time ever, I think I might disagree with Daya :) But I am not usually thinking about the model's life post-composer.

I agree that confusion can arise from having composer change the model parameters without telling you. It would make it difficult to correctly instantiate a HF model that could accept the trained weights outside composer. You'd have to construct the model with the vocab size that composer imposed, which could be easy to lose track of. Happily, that info should be packed right beside the actual weights in the composer checkpoint, though. Because Daniel has HuggingFaceModel use the save_pretrained stuff, all the necessary metadata gets packed into the composer checkpoint. So, the final model config is still available when the user gets the weights for whatever their downstream use is. (@dakinggg correct me if I'm wrong.)

I disagree in that I think an error may be too restrictive. The error would put the same burden on the user (just earlier in the process) to always remember the correct vocab size in the very possible event that the default model config does not give you a vocab size that matches the tokenizer you want to use. Plus, it may prevent you from using pre-trained HF weights if you set a non-default vocab size in the model config. That could create hassles when working within composer.

One use case that comes to mind is UL2R, where you may have to add special tokens to the vocab. In this case, it is necessary to get the pre-trained weights from HF and then use resize_token_embeddings to add tokens for the new vocab. Our HuggingFaceModel makes it really easy to handle this use-case simply by passing in a tokenizer that has had the special tokens added. It's great!

To the extent that we should not let this happen without the user's permission, I would be in favor of adding a allow_embedding_resizing flag to the inputs of HuggingFaceModel which determines whether the behavior in this PR occurs, or whether an error is raised.

As a side note, I think we should also add similar support for using pre-trained weights fed into the Trainer's load_path argument. We should be able to gracefully load embeddings when the checkpoint and model embeddings have different sizes (possibly subject to a permission flag the user has to set).

@dakinggg
Copy link
Contributor Author

dakinggg commented Mar 4, 2023

Thanks for the input! I'll go forward with the argument to control the behavior, which will default to erroring out rather than changing the model shape. And I made a JIRA to look into the ask about allowing composer to gracefully handle different shaped embeddings as well.

@dakinggg dakinggg requested a review from dskhudia March 4, 2023 02:08
@dakinggg
Copy link
Contributor Author

dakinggg commented Mar 4, 2023

ready for re-review @alextrott16 @dskhudia

@dskhudia
Copy link
Contributor

dskhudia commented Mar 4, 2023

@alextrott16 I agree that it makes one user's (the one who is training) life easier at the expense of another (the one who has to prepare for serving). An example supporting your case though:

>>> import transformers
>>> model_name = 'google/flan-t5-xl'
>>> config = transformers.AutoConfig.from_pretrained(model_name)
>>> config.vocab_size
32128
>>> tokenizer = transformers.AutoTokenizer.from_pretrained(model_name)
>>> tokenizer.vocab_size
32100

@dakinggg dakinggg enabled auto-merge (squash) March 6, 2023 20:08
@dakinggg dakinggg merged commit b2e4bb0 into mosaicml:dev Mar 6, 2023
@dakinggg dakinggg deleted the hf_resize branch June 1, 2023 00:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants