-
Notifications
You must be signed in to change notification settings - Fork 26.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Correct GPT-J voab_size #13617
Correct GPT-J voab_size #13617
Conversation
Should we update the official GPT-J config then as well? |
Yes, will update the official configs and weights as well, if the PR is approved. |
T5 actually has the same issue where there is a mismatch between model's vocab size and tokenizer's vocab size which led to quite some confusion so very much in favor of changing it here (especially since there hasn't been an official release yet). Note that the official GPT-J repo has two branches so we should make sure to update both |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there is an optimization reason that this vocab size was picked (having all dimensions of the embedding matric be a multiple of some nice power of 2). Removing it will remove this optimization which probably gives a nice speedup.
The language modeling examples should be adapted in some way I believe, not the other way around.
Yes, the extra tokens are there for efficiency reasons on TPU, since the original implementation uses model parallelism. But if a user wants to add new tokens to the tokenizer and resize embeddings, it will reduce the |
I understand the change and think that this will indeed lead to painful errors. If modifying the model size is not an option for optimization's sake and not resizing the model leads to painful errors, how about resizing the tokenizer by adding new (unused) tokens to it? Resize the tokenizer to 50400 so that it matches the model's vocab size, and put unused values in the unused range. WDYT? |
I think @LysandreJik 's solution is the best! |
Think I'm fine with adding unused tokens to the tokenizer. I'm a bit worried about the community being confused as GPT-J uses the official GPT2 tokenzier which has 50257 tokens (with the last token being the EOS token). So reading that GPT-J uses the official GPT2 tokenizer: https://github.com/kingoflolz/mesh-transformer-jax#model-details with 50257 and then seeing a different vocab size in the hub might be a bit confusing for people (maybe with a good warning it's fine?) Also think though that @LysandreJik is the best solution here |
Good call! Maybe a mention on the model card mentioning why the model has a vocab size of 50400 instead of the 50257 tokens it has been trained with is a solution? |
Sounds good to me @LysandreJik!
I think the model card already mentions this and maybe we could also put this in docs as well. |
added extra tokens and updated the tokenizer, #13696 adds a note about this in the docs. |
What does this PR do?
Corrects GPT-J
vocab_size
. GPT-J has 50400vocab_size
but the tokenizer’s vocab size is 50257. And therun_clm.py
script always resizes the token embeddings usinglen(tokenizer)
, so when the model is fine-tuned with that script, it changes the embedding size, which results in shape mismatch as described in #13499.These extra tokens are for the sake of efficiency on TPU and are not used by the model.
This however would break all existing downloaded models, since those checkpoints will have 50400
vocab_size
. The solution would be to manually change thevocab_size
in theconfig.json
file to 50257.Fixes #13499, Fixes #13581