Torch size missmatch in GPT-J model (Error) #13499

MantasLukauskas · 2021-09-09T13:31:21Z

Environment info

transformers version: 4.11.0.dev0
Platform: Linux-4.19.0-10-cloud-amd64-x86_64-with-debian-10.5
Python version: 3.7.8
PyTorch version (GPU?): 1.7.1+cu110 (True)
Tensorflow version (GPU?): 2.4.1 (True)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?:
Using distributed or parallel set-up in script?:

Who can help

@patrickvonplaten, @LysandreJik, @patil-suraj

Information

I trained/fine-tuned the GPT-J model (model here: https://huggingface.co/EleutherAI/gpt-j-6B), fp16 like in this one suggestion (#13329) and now when I try to use the pipeline to load the model (as shown below) I got torch dimensions error (shown below as well). What solutions can be applied there?

from transformers import AutoModelForCausalLM, AutoTokenizer
2021-09-09 13:23:45.897519: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
model = AutoModelForCausalLM.from_pretrained("GPT-J/checkpoint-20000")
Traceback (most recent call last):
File "", line 1, in
File "/opt/conda/lib/python3.7/site-packages/transformers/models/auto/auto_factory.py", line 388, in from_pretrained
return model_class.from_pretrained(pretrained_model_name_or_path, *model_args, config=config, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/transformers/modeling_utils.py", line 1376, in from_pretrained
_fast_init=_fast_init,
File "/opt/conda/lib/python3.7/site-packages/transformers/modeling_utils.py", line 1523, in _load_state_dict_into_model
raise RuntimeError(f"Error(s) in loading state_dict for {model.class.name}:\n\t{error_msg}")
RuntimeError: Error(s) in loading state_dict for GPTJForCausalLM:
size mismatch for lm_head.weight: copying a param with shape torch.Size([50400, 4096]) from checkpoint, the shape in current model is torch.Size([50257, 4096]).
size mismatch for lm_head.bias: copying a param with shape torch.Size([50400]) from checkpoint, the shape in current model is torch.Size([50257]).

Any ideas on how to solve it?

The text was updated successfully, but these errors were encountered:

patrickvonplaten · 2021-09-14T15:09:58Z

Hey @MantasLukauskas - could you please provide a reproducible code snippet? :-)

MantasLukauskas · 2021-09-14T15:15:35Z

Hey @patrickvonplaten,

Fine-tuning was made in this one:
deepspeed --num_gpus 2 run_clm.py --model_name_or_path EleutherAI/gpt-j-6B --num_train_epochs 10 --per_device_train_batch_size 2 --per_device_eval_batch_size 8 --train_file train.txt --validation_file test.txt --do_train --do_eval --output_dir GPT-J --save_steps 5000 --block_size=512 --evaluation_strategy "epoch" --logging_steps 200 --logging_dir GPT-J/runs --model_revision float16 --fp16 --deepspeed zero3.json

You can use save steps 10 for faster saving :)

After that, I try to load my model like in doc just use my model instead of EleutherAI/gpt-j-6B. Interesting that if I use pretrained Eleuther model everything works, but when I use my fine-tuned model with run_clm.py error occurs. Maybe that will help you to solve it or just have an idea of what is wrong there?

from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("EleutherAI/gpt-j-6B")
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-j-6B")
prompt = "In a shocking finding, scientists discovered a herd of unicorns living in a remote, "
... "previously unexplored valley, in the Andes Mountains. Even more surprising to the "
... "researchers was the fact that the unicorns spoke perfect English."
input_ids = tokenizer(prompt, return_tensors="pt").input_ids
gen_tokens = model.generate(input_ids, do_sample=True, temperature=0.9, max_length=100,)
gen_text = tokenizer.batch_decode(gen_tokens)[0]

patil-suraj · 2021-09-16T10:13:17Z

related #13581

LysandreJik · 2021-09-23T21:09:03Z

Should be fixed by #13617 (comment)

MantasLukauskas changed the title ~~Torch missmatch in GPT-J model~~ Torch size missmatch in GPT-J model Sep 9, 2021

MantasLukauskas changed the title ~~Torch size missmatch in GPT-J model~~ Torch size missmatch in GPT-J model (Error) Sep 9, 2021

patil-suraj mentioned this issue Sep 17, 2021

Correct GPT-J voab_size #13617

Closed

LysandreJik closed this as completed Sep 23, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Torch size missmatch in GPT-J model (Error) #13499

Torch size missmatch in GPT-J model (Error) #13499

MantasLukauskas commented Sep 9, 2021 •

edited by LysandreJik

Loading

patrickvonplaten commented Sep 14, 2021

MantasLukauskas commented Sep 14, 2021

patil-suraj commented Sep 16, 2021

LysandreJik commented Sep 23, 2021

Torch size missmatch in GPT-J model (Error) #13499

Torch size missmatch in GPT-J model (Error) #13499

Comments

MantasLukauskas commented Sep 9, 2021 • edited by LysandreJik Loading

Environment info

Who can help

Information

patrickvonplaten commented Sep 14, 2021

MantasLukauskas commented Sep 14, 2021

patil-suraj commented Sep 16, 2021

LysandreJik commented Sep 23, 2021

MantasLukauskas commented Sep 9, 2021 •

edited by LysandreJik

Loading