GPT-J-6B in run_clm.py #13329

MantasLukauskas · 2021-08-30T10:03:43Z

Environment info

transformers version: 4.10.0.dev0
Platform: Linux-4.19.0-10-cloud-amd64-x86_64-with-debian-10.5
Python version: 3.7.8
PyTorch version (GPU?): 1.7.1+cu110 (True)
Tensorflow version (GPU?): 2.4.1 (True)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?:
Using distributed or parallel set-up in script?:

Who can help

text generation: @patrickvonplaten
trainer: @sgugger
pipelines: @LysandreJik

-->

Information

The model I am using GPT-J from HuggingFaceHub models, there is KeyError with this model, error listed below:

Traceback (most recent call last):
File "run_clm.py", line 522, in
main()
File "run_clm.py", line 320, in main
config = AutoConfig.from_pretrained(model_args.model_name_or_path, **config_kwargs)
File "/opt/conda/lib/python3.7/site-packages/transformers/models/auto/configuration_auto.py", line 514, in from_pretrained
config_class = CONFIG_MAPPING[config_dict["model_type"]]
File "/opt/conda/lib/python3.7/site-packages/transformers/models/auto/configuration_auto.py", line 263, in getitem
raise KeyError(key)
KeyError: 'gptj'

The text was updated successfully, but these errors were encountered:

LysandreJik · 2021-08-30T10:08:24Z

Hello @MantasLukauskas, GPT-J is not yet merged into transformers, see #13022

MantasLukauskas · 2021-08-30T10:09:58Z

@LysandreJik Is there any way to do a workaround for fine-tuning it because as I see merge could take some time

LysandreJik · 2021-08-30T10:12:31Z

You could checkout the PR directly and try fine-tuning it with the GPT-J code!

git remote add StellaAthena https://github.com/StellaAthena/transformers
git fetch StellaAthena
git checkout -b gptj StellaAthena/master

StellaAthena · 2021-08-30T13:41:31Z

You can also install directly from my fork with
pip install -e git+https://github.com/StellaAthena/transformers#egg=transformers

dimaischenko · 2021-08-30T14:14:16Z

@StellaAthena I am trying to fine-tune GPT-J from your branch. But neither Tesla A100 with 40GB GPU RAM (Google Cloud) nor TPU v3-8 allow for this. OOM error in both cases.

I am setting batch_size = 1, gradient_checkpointing, trying different block_sizes 1024, 512, 256. There is OOM error in all cases.

Is it possible to fine-tune it on such devices?

MantasLukauskas · 2021-08-30T14:32:12Z

@dimaischenko Do you use run_clm.py for fine-tune or do that in another way?

dimaischenko · 2021-08-30T14:34:01Z

@MantasLukauskas Yes, by run_clm.py

MantasLukauskas · 2021-08-30T19:09:50Z

@dimaischenko I got error "RuntimeError: [enforce fail at CPUAllocator.cpp:65] . DefaultCPUAllocator: can't allocate memory: you tried to allocate 4027105280 bytes. Error code 12 (Cannot allocate memory)" do you had the same?

100 GB RAM + DeepSpeed Zero 3 + T4 15 GB

StellaAthena · 2021-08-30T23:52:16Z

@dimaischenko I got error "RuntimeError: [enforce fail at CPUAllocator.cpp:65] . DefaultCPUAllocator: can't allocate memory: you tried to allocate 4027105280 bytes. Error code 12 (Cannot allocate memory)" do you had the same?

100 GB RAM + DeepSpeed Zero 3 + T4 15 GB

4,027,105,280 <<< 100 GB, so it’s hard to see how that’s the issue, unless you have something else running. Can you print out the amount of free memory during the loading process?

TiesdeKok · 2021-08-31T04:34:46Z

Thanks @StellaAthena and @EricHallahan for all your work on the #13022 GPT-J fork!!

Over the past few days I've been playing around with the current state of the fork and I am running into the same OOM issues that are referenced here by @dimaischenko.

Here is some information from my end in case it is helpful debugging what is happening (I'd be happy to put this in a separate issue if that is desired).

System: I am running everything on a compute cluster (i.e., not g-colab) with ~384GB of ram and 8x RTX 6000 GPUs with 24gb vram each. I am using the fork by @StellaAthena and a fresh conda environment with Python 3.9.

My observations:

I can't load the float32 model onto my RTX 6000 without running into an OOM error. With model.half().cuda() and/or torch_dtype=torch.float16 when loading the model it does work. As far as I understand, I should be able to load the float32 model with an RTX 6000 24GB? Given that I can't load the float32 model it might be that my OOM errors are caused by the issue brought up by @oborchers even why trying to use fp16.
Irrespective of my training parameters (e.g., everything set to minimum) my training always triggers an OOM error when using trainer or the run_clm.pyscript.

For example, these are my parameters:

    python run_clm.py \
    --model_name_or_path EleutherAI/gpt-j-6B \
    --model_revision float32 \
    --dataset_name wikitext \
    --dataset_config_name wikitext-2-raw-v1 \
    --do_train \
    --do_eval \
    --output_dir /mmfs1/gscratch/tdekok/test-clm-j \
    --overwrite_output_dir true \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 1 \
    --fp16 true \
    --fp16_opt_level O1

This results in about ~55G RAM usage and I can see in nvidia-smi that it fills up my GPU vram beyond the available 23761MiB.

I noticed when using gpt2 instead of gpt-j-6b that the memory usage on gpu:0 is substantially higher relative to the rest. I wonder whether this might be part of the issue:

dimaischenko · 2021-08-31T06:40:32Z

This results in about ~55G RAM usage and I can see in nvidia-smi that it fills up my GPU vram beyond the available 23761MiB.

I think it doesn't matter if you are using 8 GPU or 1 GPU cause batch_size=1. So at least one sample fits on one video card. I am trying the same params on A100 card with 40Gb gpu vram and OOM still exists. So I think that your RTX6000 card with 24 gb vram is definitely not enough for fine-tuning.

But let's wait for the answer from the creators of the model

MantasLukauskas · 2021-08-31T09:01:29Z

@dimaischenko yes I have the same problem, I tried to use DeepSpeed Zero3 optimizer for this one but even with batch_size=1 and model_revision = float16 I am out of memory. Interesting that with gpt2-xl I have the same problem but I saw a lot of people fine-tuning this model with T4 + Deepspeed :(

MantasLukauskas · 2021-08-31T09:33:07Z

@dimaischenko I tested a lot of parameters and found that with --block_size 512 I can fine-tune GPT-J model. RAM Usage 100 GB, GPU usage 12 GB (Nvidia T4 total 16 GB), DeepSpeed Zero3 optimizer

dimaischenko · 2021-08-31T10:08:57Z

@MantasLukauskas Sounds interesting. Maybe it's the optimizer. And what option is it enabled by, or do you need to modify the run_clm.py code?

MantasLukauskas · 2021-08-31T10:21:48Z

@dimaischenko DeepSpeed in library implemented into huggingface (https://github.com/microsoft/DeepSpeed) and you do not need to modify run_clm.py code you fine-tune model like that:
deepspeed --num_gpus 1 run_clm.py --model_name_or_path EleutherAI/gpt-j-6B --num_train_epochs 10 --model_revision float16 --per_device_train_batch_size 1 --per_device_eval_batch_size 1 --train_file train.txt --validation_file test.txt --do_train --do_eval --output_dir GPTJ --save_steps 1000 --logging_steps 100 --logging_dir GPTJ/runs --fp16 --deepspeed zero3ws.json --overwrite_output_dir

My deepspeed config file can be found here: https://github.com/MantasLukauskas/DeepSpeed-Test/blob/main/zero3.json

dimaischenko · 2021-08-31T10:23:18Z

@MantasLukauskas thanks! I'll try today and write about the results.

StellaAthena · 2021-08-31T12:59:39Z

@dimaischenko

@StellaAthena I am trying to fine-tune GPT-J from your branch. But neither Tesla A100 with 40GB GPU RAM (Google Cloud) nor TPU v3-8 allow for this. OOM error in both cases.

If you are working on TPUs, I strongly recommend using the mesh-transformer-jax library which was written for the purpose of producing GPT-J models. The version on HuggingFace is a PyTorch port of the original Jax code which has been used by many people on TPUs.

I'm not sure why you're having trouble with A100s though, as I have run the code on A100s before. Can you provide further details about how you're running the model? Is it loading the model or the act of fine-tuning that OOMs?

dimaischenko · 2021-08-31T13:16:00Z

@StellaAthena Thanks! I'll try mesh-transformer-jax. It's just that I already have a reliable fine-tuning pipeline for HuggingFace.

About OOM. I'll repeat my attempts today and will write logs. But the exact loading of the model was successful. And even performed validation with perplexity calculation on validation samples. But when it tried «to eat» the first sample in training, OOM would crash.

StellaAthena · 2021-08-31T13:21:37Z

@StellaAthena Thanks! I'll try mesh-transformer-jax. It's just that I already have a reliable fine-tuning pipeline for HuggingFace.

About OOM. I'll repeat my attempts today and will write logs. But the exact loading of the model was successful. And even performed validation with perplexity calculation on validation samples. But when it tried «to eat» the first sample in training, OOM would crash.

This is really weird, given that you've said the batch size is set to 1. How much memory is allocated before you feed the first datum into the model? Does a different architecture that takes up the same amount of memory also fail?

dimaischenko · 2021-09-01T11:43:42Z

@StellaAthena I tried again running run_clm.py from the latest branch on single GPU A100 (40Gb)

python run_clm_orig.py \
    --model_type gptj \
    --model_name_or_path EleutherAI/gpt-j-6B \
    --model_revision float16 \
    --do_train \
    --do_eval \
    --train_file ./data/train.txt \
    --validation_file ./data/val.txt \
    --evaluation_strategy steps \
    --logging_step 300 \
    --learning_rate 0.00002 \
    --save_steps 1500 \
    --fp16 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 1 \
    --num_train_epochs 1 \
    --block_size 1024 \
    --save_total_limit 1 \
    --overwrite_output_dir \
    --output_dir ./out/test_gptj_orig

and got OOM error

[INFO|trainer.py:414] 2021-09-01 11:39:10,987 >> Using amp fp16 backend
[INFO|trainer.py:1168] 2021-09-01 11:39:10,997 >> ***** Running training *****
[INFO|trainer.py:1169] 2021-09-01 11:39:10,997 >>   Num examples = 6011
[INFO|trainer.py:1170] 2021-09-01 11:39:10,997 >>   Num Epochs = 1
[INFO|trainer.py:1171] 2021-09-01 11:39:10,997 >>   Instantaneous batch size per device = 1
[INFO|trainer.py:1172] 2021-09-01 11:39:10,997 >>   Total train batch size (w. parallel, distributed & accumulation) = 1
[INFO|trainer.py:1173] 2021-09-01 11:39:10,997 >>   Gradient Accumulation steps = 1
[INFO|trainer.py:1174] 2021-09-01 11:39:10,997 >>   Total optimization steps = 6011
  0%|                                                                                                                                                                                                                | 0/6011 [00:00<?, ?it/s$
Traceback (most recent call last):
  File "run_clm_orig.py", line 522, in <module>
    main()
  File "run_clm_orig.py", line 472, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/mnt/disk/projects/gpt/venv/lib/python3.6/site-packages/transformers/trainer.py", line 1284, in train
    tr_loss += self.training_step(model, inputs)
  File "/mnt/disk/projects/gpt/venv/lib/python3.6/site-packages/transformers/trainer.py", line 1787, in training_step
    loss = self.compute_loss(model, inputs)
  File "/mnt/disk/projects/gpt/venv/lib/python3.6/site-packages/transformers/trainer.py", line 1821, in compute_loss
    outputs = model(**inputs)
  File "/mnt/disk/projects/gpt/venv/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/mnt/disk/projects/gpt/venv/lib/python3.6/site-packages/transformers/models/gptj/modeling_gptj.py", line 780, in forward
    return_dict=return_dict,
  File "/mnt/disk/projects/gpt/venv/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/mnt/disk/projects/gpt/venv/lib/python3.6/site-packages/transformers/models/gptj/modeling_gptj.py", line 631, in forward
    output_attentions=output_attentions,
  File "/mnt/disk/projects/gpt/venv/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/mnt/disk/projects/gpt/venv/lib/python3.6/site-packages/transformers/models/gptj/modeling_gptj.py", line 286, in forward
    feed_forward_hidden_states = self.mlp(hidden_states)
  File "/mnt/disk/projects/gpt/venv/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/mnt/disk/projects/gpt/venv/lib/python3.6/site-packages/transformers/models/gptj/modeling_gptj.py", line 249, in forward
    hidden_states = self.fc_in(hidden_states)
  File "/mnt/disk/projects/gpt/venv/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/mnt/disk/projects/gpt/venv/lib/python3.6/site-packages/torch/nn/modules/linear.py", line 96, in forward
    return F.linear(input, self.weight, self.bias)
  File "/mnt/disk/projects/gpt/venv/lib/python3.6/site-packages/torch/nn/functional.py", line 1847, in linear
    return torch._C._nn.linear(input, weight, bias)
RuntimeError: CUDA out of memory. Tried to allocate 32.00 MiB (GPU 0; 39.59 GiB total capacity; 37.49 GiB already allocated; 19.19 MiB free; 37.73 GiB reserved in total by PyTorch)
  0%|                                                                                                                                                                                                                | 0/6011 [00:00<?, ?it/s]

Today will switch to mesh-transformer-jax and try to fine-tune on TPU v3-8 and then convert checkpoint to HuggingFace format.

sgugger · 2021-09-01T12:05:38Z

You are trying to use the Adam optimizer with a model of 24Gb. With Adam, you have four copies of your model: model, gradients, and in the optimizer state the gradients averaged and square averaged. Even with fp16, all of that is still stored in FP32 because of mixed precision training (the optimzier update is in full precision). So unless you use DeepSpeed to offload the optimizer state and the gradient copy in FP32, you won't be able to fit this 4 x 24GB on your 80GB card.

dimaischenko · 2021-09-01T15:46:48Z

@sgugger Thanks for clarification! I configured DeepSpeed and everything started up on the A100 GPU. However, now I need 80Gb cpu RAM, but this is solvable 😄

sgugger · 2021-09-01T16:29:42Z

There is also the NVME offload if CPU RAM becomes a problem :-)

github-actions · 2021-09-29T15:01:58Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

johndpope · 2021-11-06T08:14:30Z

@sgugger - is there any docs on how to do this - you can point to ? I got rtx 3090 - and hitting the KeyError: 'gptj'
(The error is really obscure. it should really have some thing easier to understand.)
I've got 32gb of RAM - @dimaischenko - did bumping to 80gb fix things?

LysandreJik · 2021-11-06T13:38:20Z

@johndpope what is your transformers version? It looks like it is outdated and does not have the GPT-J model available.

dimaischenko · 2021-11-06T13:45:58Z

@LysandreJik I agree with you. I think that's the problem. @johndpope Yes 80gb ram was enough. To be honest, I don't remember the details anymore, but it seems that it took even less with DeepSpeed.

johndpope · 2021-11-06T21:40:17Z

had trouble with ram - but found this / installing now / supposedly fits 17 / 15gb in VRAM + uses fastapi - https://news.ycombinator.com/item?id=27731266
(it uses tensorflow / but keeps memory footprint lower )
https://gist.githubusercontent.com/kinoc/f3225092092e07b843e3a2798f7b3986/raw/fc0dbe522d09d3797dd2a64e7182003f7d9a7fa8/jserv.py

MantasLukauskas mentioned this issue Sep 9, 2021

Torch size missmatch in GPT-J model (Error) #13499

Closed

github-actions bot closed this as completed Oct 7, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPT-J-6B in run_clm.py #13329

GPT-J-6B in run_clm.py #13329

MantasLukauskas commented Aug 30, 2021

LysandreJik commented Aug 30, 2021

MantasLukauskas commented Aug 30, 2021

LysandreJik commented Aug 30, 2021

StellaAthena commented Aug 30, 2021

dimaischenko commented Aug 30, 2021 •

edited

Loading

MantasLukauskas commented Aug 30, 2021

dimaischenko commented Aug 30, 2021 •

edited

Loading

MantasLukauskas commented Aug 30, 2021

StellaAthena commented Aug 30, 2021

TiesdeKok commented Aug 31, 2021

dimaischenko commented Aug 31, 2021

MantasLukauskas commented Aug 31, 2021

MantasLukauskas commented Aug 31, 2021

dimaischenko commented Aug 31, 2021 •

edited

Loading

MantasLukauskas commented Aug 31, 2021

dimaischenko commented Aug 31, 2021

StellaAthena commented Aug 31, 2021 •

edited

Loading

dimaischenko commented Aug 31, 2021

StellaAthena commented Aug 31, 2021

dimaischenko commented Sep 1, 2021 •

edited

Loading

sgugger commented Sep 1, 2021

dimaischenko commented Sep 1, 2021

sgugger commented Sep 1, 2021

github-actions bot commented Sep 29, 2021

johndpope commented Nov 6, 2021

LysandreJik commented Nov 6, 2021

dimaischenko commented Nov 6, 2021

johndpope commented Nov 6, 2021

GPT-J-6B in run_clm.py #13329

GPT-J-6B in run_clm.py #13329

Comments

MantasLukauskas commented Aug 30, 2021

Environment info

Who can help

Information

LysandreJik commented Aug 30, 2021

MantasLukauskas commented Aug 30, 2021

LysandreJik commented Aug 30, 2021

StellaAthena commented Aug 30, 2021

dimaischenko commented Aug 30, 2021 • edited Loading

MantasLukauskas commented Aug 30, 2021

dimaischenko commented Aug 30, 2021 • edited Loading

MantasLukauskas commented Aug 30, 2021

StellaAthena commented Aug 30, 2021

TiesdeKok commented Aug 31, 2021

dimaischenko commented Aug 31, 2021

MantasLukauskas commented Aug 31, 2021

MantasLukauskas commented Aug 31, 2021

dimaischenko commented Aug 31, 2021 • edited Loading

MantasLukauskas commented Aug 31, 2021

dimaischenko commented Aug 31, 2021

StellaAthena commented Aug 31, 2021 • edited Loading

dimaischenko commented Aug 31, 2021

StellaAthena commented Aug 31, 2021

dimaischenko commented Sep 1, 2021 • edited Loading

sgugger commented Sep 1, 2021

dimaischenko commented Sep 1, 2021

sgugger commented Sep 1, 2021

github-actions bot commented Sep 29, 2021

johndpope commented Nov 6, 2021

LysandreJik commented Nov 6, 2021

dimaischenko commented Nov 6, 2021

johndpope commented Nov 6, 2021

dimaischenko commented Aug 30, 2021 •

edited

Loading

dimaischenko commented Aug 30, 2021 •

edited

Loading

dimaischenko commented Aug 31, 2021 •

edited

Loading

StellaAthena commented Aug 31, 2021 •

edited

Loading

dimaischenko commented Sep 1, 2021 •

edited

Loading