Correct GPT-J voab_size #13617

patil-suraj · 2021-09-17T05:23:35Z

What does this PR do?

Corrects GPT-J vocab_size. GPT-J has 50400 vocab_size but the tokenizer’s vocab size is 50257. And the run_clm.py script always resizes the token embeddings using len(tokenizer), so when the model is fine-tuned with that script, it changes the embedding size, which results in shape mismatch as described in #13499.
These extra tokens are for the sake of efficiency on TPU and are not used by the model.

This however would break all existing downloaded models, since those checkpoints will have 50400 vocab_size. The solution would be to manually change the vocab_size in the config.json file to 50257.

Fixes #13499, Fixes #13581

patrickvonplaten · 2021-09-17T09:15:29Z

Should we update the official GPT-J config then as well?

patil-suraj · 2021-09-17T09:17:11Z

Yes, will update the official configs and weights as well, if the PR is approved.

patrickvonplaten · 2021-09-17T09:22:44Z

T5 actually has the same issue where there is a mismatch between model's vocab size and tokenizer's vocab size which led to quite some confusion so very much in favor of changing it here (especially since there hasn't been an official release yet).

Note that the official GPT-J repo has two branches so we should make sure to update both

sgugger

I think there is an optimization reason that this vocab size was picked (having all dimensions of the embedding matric be a multiple of some nice power of 2). Removing it will remove this optimization which probably gives a nice speedup.

The language modeling examples should be adapted in some way I believe, not the other way around.

patil-suraj · 2021-09-17T12:27:25Z

Yes, the extra tokens are there for efficiency reasons on TPU, since the original implementation uses model parallelism.
I agree that we could modify the script and maybe add a flag, so embeddings are only resized if the user explicitly passes the flag.

But if a user wants to add new tokens to the tokenizer and resize embeddings, it will reduce the vocab_size, which is a bit confusing IMO (that is the case with T5)

LysandreJik · 2021-09-17T13:05:01Z

I understand the change and think that this will indeed lead to painful errors. If modifying the model size is not an option for optimization's sake and not resizing the model leads to painful errors, how about resizing the tokenizer by adding new (unused) tokens to it?

Resize the tokenizer to 50400 so that it matches the model's vocab size, and put unused values in the unused range. WDYT?

sgugger · 2021-09-17T13:10:22Z

I think @LysandreJik 's solution is the best!

patrickvonplaten · 2021-09-17T14:53:17Z

Think I'm fine with adding unused tokens to the tokenizer.

I'm a bit worried about the community being confused as GPT-J uses the official GPT2 tokenzier which has 50257 tokens (with the last token being the EOS token). So reading that GPT-J uses the official GPT2 tokenizer: https://github.com/kingoflolz/mesh-transformer-jax#model-details with 50257 and then seeing a different vocab size in the hub might be a bit confusing for people (maybe with a good warning it's fine?)

Also think though that @LysandreJik is the best solution here

LysandreJik · 2021-09-17T15:30:26Z

Good call! Maybe a mention on the model card mentioning why the model has a vocab size of 50400 instead of the 50257 tokens it has been trained with is a solution?

LysandreJik · 2021-09-17T15:31:13Z

cc @StellaAthena

patil-suraj · 2021-09-17T15:33:11Z

Sounds good to me @LysandreJik!

Maybe a mention on the model card mentioning why the model has a vocab size of 50400 instead of the 50257 tokens it has been trained with is a solution?

I think the model card already mentions this and maybe we could also put this in docs as well.

patil-suraj · 2021-09-22T15:48:05Z

added extra tokens and updated the tokenizer, #13696 adds a note about this in the docs.
Will close this PR now.

correct voab_size

c8c9cd6

patil-suraj mentioned this pull request Sep 17, 2021

gpt-j input shape after finetuning #13581

Closed

patil-suraj requested review from sgugger, LysandreJik and patrickvonplaten September 17, 2021 06:06

patrickvonplaten approved these changes Sep 17, 2021

View reviewed changes

sgugger reviewed Sep 17, 2021

View reviewed changes

patil-suraj closed this Sep 22, 2021

patil-suraj deleted the fix-gptj-vocab_size branch September 22, 2021 15:48

LysandreJik mentioned this pull request Sep 23, 2021

Torch size missmatch in GPT-J model (Error) #13499

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Correct GPT-J voab_size #13617

Correct GPT-J voab_size #13617

patil-suraj commented Sep 17, 2021 •

edited

Loading

patrickvonplaten commented Sep 17, 2021

patil-suraj commented Sep 17, 2021 •

edited

Loading

patrickvonplaten commented Sep 17, 2021

sgugger left a comment

patil-suraj commented Sep 17, 2021

LysandreJik commented Sep 17, 2021

sgugger commented Sep 17, 2021

patrickvonplaten commented Sep 17, 2021

LysandreJik commented Sep 17, 2021

LysandreJik commented Sep 17, 2021

patil-suraj commented Sep 17, 2021 •

edited

Loading

patil-suraj commented Sep 22, 2021

Correct GPT-J voab_size #13617

Correct GPT-J voab_size #13617

Conversation

patil-suraj commented Sep 17, 2021 • edited Loading

What does this PR do?

patrickvonplaten commented Sep 17, 2021

patil-suraj commented Sep 17, 2021 • edited Loading

patrickvonplaten commented Sep 17, 2021

sgugger left a comment

Choose a reason for hiding this comment

patil-suraj commented Sep 17, 2021

LysandreJik commented Sep 17, 2021

sgugger commented Sep 17, 2021

patrickvonplaten commented Sep 17, 2021

LysandreJik commented Sep 17, 2021

LysandreJik commented Sep 17, 2021

patil-suraj commented Sep 17, 2021 • edited Loading

patil-suraj commented Sep 22, 2021

patil-suraj commented Sep 17, 2021 •

edited

Loading

patil-suraj commented Sep 17, 2021 •

edited

Loading

patil-suraj commented Sep 17, 2021 •

edited

Loading