Adjust HuggingFaceModel token embedding resizing to only occur when necessary #2027

dakinggg · 2023-03-02T22:55:50Z

What does this PR do?

Previously, HuggingFaceModel automatically resizes the model vocab size to match whatever the tokenizer vocab size is. This can cause an issue when the model vocab size is rounded to a multiple of 8 or 64 and intentionally does not match the tokenizer vocab size. This PR changes our behavior to only resize the model vocab size when it is necessary, meaning when the model vocab size is less than the tokenizer vocab size. When the tokenizer vocab size is less than the model vocab size, we just raise a warning now.

What issue(s) does this change relate to?

Closes CO-1861

Before submitting

Have you read the contributor guidelines?
Was this change discussed/approved in a GitHub issue first? It is much more likely to be merged if so.
Did you update any related docs and document your change?
Did you update any related tests and add any new tests related to your change? (see testing)
Did you run the tests locally to make sure they pass?
Did you run pre-commit on your change? (see the pre-commit section of prerequisites)

dskhudia · 2023-03-02T23:21:30Z

I think this was the issue a customer was running into. They had a checkpoint created by composer but couldn't load state_dict['state']['model'] into the HF model due to vocab size mismatch.

IMO we shouldn't resize at all and should let it fail. We are silently changing the number of params in a model and that creates issues with loading the checkpoint outside of composer.

dakinggg · 2023-03-02T23:31:44Z

Yup, that is correct @dskhudia. For the case where embedding vocab size is less then tokenizer vocab size, the training run will crash at some unknown point, with a nasty CUDA error. The biggest issue is that this could happen deep into training. I'd like to try to avoid users getting to that point. Does that sound reasonable to you? Or you would still like to just raise a warning and let the user deal with it?

dakinggg · 2023-03-02T23:33:30Z

Or are you suggesting we raise an error here instead?

dskhudia · 2023-03-02T23:43:52Z

I was suggesting to raise an error.

alextrott16 · 2023-03-03T01:17:51Z

For the first time ever, I think I might disagree with Daya :) But I am not usually thinking about the model's life post-composer.

I agree that confusion can arise from having composer change the model parameters without telling you. It would make it difficult to correctly instantiate a HF model that could accept the trained weights outside composer. You'd have to construct the model with the vocab size that composer imposed, which could be easy to lose track of. Happily, that info should be packed right beside the actual weights in the composer checkpoint, though. Because Daniel has HuggingFaceModel use the save_pretrained stuff, all the necessary metadata gets packed into the composer checkpoint. So, the final model config is still available when the user gets the weights for whatever their downstream use is. (@dakinggg correct me if I'm wrong.)

I disagree in that I think an error may be too restrictive. The error would put the same burden on the user (just earlier in the process) to always remember the correct vocab size in the very possible event that the default model config does not give you a vocab size that matches the tokenizer you want to use. Plus, it may prevent you from using pre-trained HF weights if you set a non-default vocab size in the model config. That could create hassles when working within composer.

One use case that comes to mind is UL2R, where you may have to add special tokens to the vocab. In this case, it is necessary to get the pre-trained weights from HF and then use resize_token_embeddings to add tokens for the new vocab. Our HuggingFaceModel makes it really easy to handle this use-case simply by passing in a tokenizer that has had the special tokens added. It's great!

To the extent that we should not let this happen without the user's permission, I would be in favor of adding a allow_embedding_resizing flag to the inputs of HuggingFaceModel which determines whether the behavior in this PR occurs, or whether an error is raised.

As a side note, I think we should also add similar support for using pre-trained weights fed into the Trainer's load_path argument. We should be able to gracefully load embeddings when the checkpoint and model embeddings have different sizes (possibly subject to a permission flag the user has to set).

dakinggg · 2023-03-04T01:45:25Z

Thanks for the input! I'll go forward with the argument to control the behavior, which will default to erroring out rather than changing the model shape. And I made a JIRA to look into the ask about allowing composer to gracefully handle different shaped embeddings as well.

dakinggg · 2023-03-04T02:09:04Z

ready for re-review @alextrott16 @dskhudia

dskhudia · 2023-03-04T02:49:26Z

@alextrott16 I agree that it makes one user's (the one who is training) life easier at the expense of another (the one who has to prepare for serving). An example supporting your case though:

>>> import transformers
>>> model_name = 'google/flan-t5-xl'
>>> config = transformers.AutoConfig.from_pretrained(model_name)
>>> config.vocab_size
32128
>>> tokenizer = transformers.AutoTokenizer.from_pretrained(model_name)
>>> tokenizer.vocab_size
32100

dakinggg added 2 commits March 2, 2023 14:45

adjust embedding resizing to only occur when actually necessary

b9a2c10

adjust embedding resizing to only occur when actually necessary

c66e2c4

dakinggg requested a review from a team as a code owner March 2, 2023 22:55

dakinggg added 5 commits March 2, 2023 14:58

change comments

11e5645

change text

48f9bcd

add spaces

68ab616

period

5038442

typo

70a8ef2

dakinggg requested review from alextrott16 and mvpatel2000 March 2, 2023 23:05

Merge branch 'dev' into hf_resize

9b19692

adjust the cases

617756a

dakinggg requested a review from dskhudia March 4, 2023 02:08

dskhudia approved these changes Mar 4, 2023

View reviewed changes

Merge branch 'dev' into hf_resize

ca8bb84

alextrott16 approved these changes Mar 6, 2023

View reviewed changes

dakinggg and others added 2 commits March 6, 2023 12:03

Merge branch 'dev' into hf_resize

16d63a6

resize embeddings in failing test

f5cb0f6

dakinggg enabled auto-merge (squash) March 6, 2023 20:08

resize embeddings in test fixture

c9d92c9

dakinggg merged commit b2e4bb0 into mosaicml:dev Mar 6, 2023

bandish-shah pushed a commit that referenced this pull request Mar 14, 2023

Adjust how HuggingFaceModel handles embedding resizing (#2027)

7898b26

dakinggg deleted the hf_resize branch June 1, 2023 00:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adjust HuggingFaceModel token embedding resizing to only occur when necessary #2027

Adjust HuggingFaceModel token embedding resizing to only occur when necessary #2027

dakinggg commented Mar 2, 2023 •

edited by jira bot

Loading

dskhudia commented Mar 2, 2023 •

edited

Loading

dakinggg commented Mar 2, 2023

dakinggg commented Mar 2, 2023

dskhudia commented Mar 2, 2023

alextrott16 commented Mar 3, 2023

dakinggg commented Mar 4, 2023

dakinggg commented Mar 4, 2023

dskhudia commented Mar 4, 2023

Adjust HuggingFaceModel token embedding resizing to only occur when necessary #2027

Adjust HuggingFaceModel token embedding resizing to only occur when necessary #2027

Conversation

dakinggg commented Mar 2, 2023 • edited by jira bot Loading

What does this PR do?

What issue(s) does this change relate to?

Before submitting

dskhudia commented Mar 2, 2023 • edited Loading

dakinggg commented Mar 2, 2023

dakinggg commented Mar 2, 2023

dskhudia commented Mar 2, 2023

alextrott16 commented Mar 3, 2023

dakinggg commented Mar 4, 2023

dakinggg commented Mar 4, 2023

dskhudia commented Mar 4, 2023

dakinggg commented Mar 2, 2023 •

edited by jira bot

Loading

dskhudia commented Mar 2, 2023 •

edited

Loading