Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Correct GPT-J voab_size #13617

Closed

Conversation

patil-suraj
Copy link
Contributor

@patil-suraj patil-suraj commented Sep 17, 2021

What does this PR do?

Corrects GPT-J vocab_size. GPT-J has 50400 vocab_size but the tokenizer’s vocab size is 50257. And the run_clm.py script always resizes the token embeddings using len(tokenizer), so when the model is fine-tuned with that script, it changes the embedding size, which results in shape mismatch as described in #13499.
These extra tokens are for the sake of efficiency on TPU and are not used by the model.

This however would break all existing downloaded models, since those checkpoints will have 50400 vocab_size. The solution would be to manually change the vocab_size in the config.json file to 50257.

Fixes #13499, Fixes #13581

@patrickvonplaten
Copy link
Contributor

Should we update the official GPT-J config then as well?

@patil-suraj
Copy link
Contributor Author

patil-suraj commented Sep 17, 2021

Yes, will update the official configs and weights as well, if the PR is approved.

@patrickvonplaten
Copy link
Contributor

T5 actually has the same issue where there is a mismatch between model's vocab size and tokenizer's vocab size which led to quite some confusion so very much in favor of changing it here (especially since there hasn't been an official release yet).

Note that the official GPT-J repo has two branches so we should make sure to update both

Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is an optimization reason that this vocab size was picked (having all dimensions of the embedding matric be a multiple of some nice power of 2). Removing it will remove this optimization which probably gives a nice speedup.

The language modeling examples should be adapted in some way I believe, not the other way around.

@patil-suraj
Copy link
Contributor Author

Yes, the extra tokens are there for efficiency reasons on TPU, since the original implementation uses model parallelism.
I agree that we could modify the script and maybe add a flag, so embeddings are only resized if the user explicitly passes the flag.

But if a user wants to add new tokens to the tokenizer and resize embeddings, it will reduce the vocab_size, which is a bit confusing IMO (that is the case with T5)

@LysandreJik
Copy link
Member

I understand the change and think that this will indeed lead to painful errors. If modifying the model size is not an option for optimization's sake and not resizing the model leads to painful errors, how about resizing the tokenizer by adding new (unused) tokens to it?

Resize the tokenizer to 50400 so that it matches the model's vocab size, and put unused values in the unused range. WDYT?

@sgugger
Copy link
Collaborator

sgugger commented Sep 17, 2021

I think @LysandreJik 's solution is the best!

@patrickvonplaten
Copy link
Contributor

Think I'm fine with adding unused tokens to the tokenizer.

I'm a bit worried about the community being confused as GPT-J uses the official GPT2 tokenzier which has 50257 tokens (with the last token being the EOS token). So reading that GPT-J uses the official GPT2 tokenizer: https://github.com/kingoflolz/mesh-transformer-jax#model-details with 50257 and then seeing a different vocab size in the hub might be a bit confusing for people (maybe with a good warning it's fine?)

Also think though that @LysandreJik is the best solution here

@LysandreJik
Copy link
Member

Good call! Maybe a mention on the model card mentioning why the model has a vocab size of 50400 instead of the 50257 tokens it has been trained with is a solution?

@LysandreJik
Copy link
Member

cc @StellaAthena

@patil-suraj
Copy link
Contributor Author

patil-suraj commented Sep 17, 2021

Sounds good to me @LysandreJik!

Maybe a mention on the model card mentioning why the model has a vocab size of 50400 instead of the 50257 tokens it has been trained with is a solution?

I think the model card already mentions this and maybe we could also put this in docs as well.

@patil-suraj
Copy link
Contributor Author

added extra tokens and updated the tokenizer, #13696 adds a note about this in the docs.
Will close this PR now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

gpt-j input shape after finetuning Torch size missmatch in GPT-J model (Error)
4 participants