Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPT-J-6B in run_clm.py #13329

Closed
MantasLukauskas opened this issue Aug 30, 2021 · 28 comments
Closed

GPT-J-6B in run_clm.py #13329

MantasLukauskas opened this issue Aug 30, 2021 · 28 comments

Comments

@MantasLukauskas
Copy link

Environment info

  • transformers version: 4.10.0.dev0
  • Platform: Linux-4.19.0-10-cloud-amd64-x86_64-with-debian-10.5
  • Python version: 3.7.8
  • PyTorch version (GPU?): 1.7.1+cu110 (True)
  • Tensorflow version (GPU?): 2.4.1 (True)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?:
  • Using distributed or parallel set-up in script?:

Who can help

-->

Information

The model I am using GPT-J from HuggingFaceHub models, there is KeyError with this model, error listed below:

Traceback (most recent call last):
File "run_clm.py", line 522, in
main()
File "run_clm.py", line 320, in main
config = AutoConfig.from_pretrained(model_args.model_name_or_path, **config_kwargs)
File "/opt/conda/lib/python3.7/site-packages/transformers/models/auto/configuration_auto.py", line 514, in from_pretrained
config_class = CONFIG_MAPPING[config_dict["model_type"]]
File "/opt/conda/lib/python3.7/site-packages/transformers/models/auto/configuration_auto.py", line 263, in getitem
raise KeyError(key)
KeyError: 'gptj'

@LysandreJik
Copy link
Member

Hello @MantasLukauskas, GPT-J is not yet merged into transformers, see #13022

@MantasLukauskas
Copy link
Author

@LysandreJik Is there any way to do a workaround for fine-tuning it because as I see merge could take some time

@LysandreJik
Copy link
Member

You could checkout the PR directly and try fine-tuning it with the GPT-J code!

git remote add StellaAthena https://github.com/StellaAthena/transformers
git fetch StellaAthena
git checkout -b gptj StellaAthena/master

@StellaAthena
Copy link
Contributor

You can also install directly from my fork with
pip install -e git+https://github.com/StellaAthena/transformers#egg=transformers

@dimaischenko
Copy link

dimaischenko commented Aug 30, 2021

@StellaAthena I am trying to fine-tune GPT-J from your branch. But neither Tesla A100 with 40GB GPU RAM (Google Cloud) nor TPU v3-8 allow for this. OOM error in both cases.

I am setting batch_size = 1, gradient_checkpointing, trying different block_sizes 1024, 512, 256. There is OOM error in all cases.

Is it possible to fine-tune it on such devices?

@MantasLukauskas
Copy link
Author

@dimaischenko Do you use run_clm.py for fine-tune or do that in another way?

@dimaischenko
Copy link

dimaischenko commented Aug 30, 2021

@MantasLukauskas Yes, by run_clm.py

@MantasLukauskas
Copy link
Author

@dimaischenko I got error "RuntimeError: [enforce fail at CPUAllocator.cpp:65] . DefaultCPUAllocator: can't allocate memory: you tried to allocate 4027105280 bytes. Error code 12 (Cannot allocate memory)" do you had the same?

100 GB RAM + DeepSpeed Zero 3 + T4 15 GB

@StellaAthena
Copy link
Contributor

@dimaischenko I got error "RuntimeError: [enforce fail at CPUAllocator.cpp:65] . DefaultCPUAllocator: can't allocate memory: you tried to allocate 4027105280 bytes. Error code 12 (Cannot allocate memory)" do you had the same?

100 GB RAM + DeepSpeed Zero 3 + T4 15 GB

4,027,105,280 <<< 100 GB, so it’s hard to see how that’s the issue, unless you have something else running. Can you print out the amount of free memory during the loading process?

@TiesdeKok
Copy link

Thanks @StellaAthena and @EricHallahan for all your work on the #13022 GPT-J fork!!

Over the past few days I've been playing around with the current state of the fork and I am running into the same OOM issues that are referenced here by @dimaischenko.

Here is some information from my end in case it is helpful debugging what is happening (I'd be happy to put this in a separate issue if that is desired).

System: I am running everything on a compute cluster (i.e., not g-colab) with ~384GB of ram and 8x RTX 6000 GPUs with 24gb vram each. I am using the fork by @StellaAthena and a fresh conda environment with Python 3.9.

My observations:

  1. I can't load the float32 model onto my RTX 6000 without running into an OOM error. With model.half().cuda() and/or torch_dtype=torch.float16 when loading the model it does work. As far as I understand, I should be able to load the float32 model with an RTX 6000 24GB? Given that I can't load the float32 model it might be that my OOM errors are caused by the issue brought up by @oborchers even why trying to use fp16.
  2. Irrespective of my training parameters (e.g., everything set to minimum) my training always triggers an OOM error when using trainer or the run_clm.pyscript.

For example, these are my parameters:

    python run_clm.py \
    --model_name_or_path EleutherAI/gpt-j-6B \
    --model_revision float32 \
    --dataset_name wikitext \
    --dataset_config_name wikitext-2-raw-v1 \
    --do_train \
    --do_eval \
    --output_dir /mmfs1/gscratch/tdekok/test-clm-j \
    --overwrite_output_dir true \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 1 \
    --fp16 true \
    --fp16_opt_level O1

This results in about ~55G RAM usage and I can see in nvidia-smi that it fills up my GPU vram beyond the available 23761MiB.

  1. I noticed when using gpt2 instead of gpt-j-6b that the memory usage on gpu:0 is substantially higher relative to the rest. I wonder whether this might be part of the issue:

image

@dimaischenko
Copy link

This results in about ~55G RAM usage and I can see in nvidia-smi that it fills up my GPU vram beyond the available 23761MiB.

I think it doesn't matter if you are using 8 GPU or 1 GPU cause batch_size=1. So at least one sample fits on one video card. I am trying the same params on A100 card with 40Gb gpu vram and OOM still exists. So I think that your RTX6000 card with 24 gb vram is definitely not enough for fine-tuning.

But let's wait for the answer from the creators of the model

@MantasLukauskas
Copy link
Author

@dimaischenko yes I have the same problem, I tried to use DeepSpeed Zero3 optimizer for this one but even with batch_size=1 and model_revision = float16 I am out of memory. Interesting that with gpt2-xl I have the same problem but I saw a lot of people fine-tuning this model with T4 + Deepspeed :(

@MantasLukauskas
Copy link
Author

@dimaischenko I tested a lot of parameters and found that with --block_size 512 I can fine-tune GPT-J model. RAM Usage 100 GB, GPU usage 12 GB (Nvidia T4 total 16 GB), DeepSpeed Zero3 optimizer

@dimaischenko
Copy link

dimaischenko commented Aug 31, 2021

@MantasLukauskas Sounds interesting. Maybe it's the optimizer. And what option is it enabled by, or do you need to modify the run_clm.py code?

@MantasLukauskas
Copy link
Author

@dimaischenko DeepSpeed in library implemented into huggingface (https://github.com/microsoft/DeepSpeed) and you do not need to modify run_clm.py code you fine-tune model like that:
deepspeed --num_gpus 1 run_clm.py --model_name_or_path EleutherAI/gpt-j-6B --num_train_epochs 10 --model_revision float16 --per_device_train_batch_size 1 --per_device_eval_batch_size 1 --train_file train.txt --validation_file test.txt --do_train --do_eval --output_dir GPTJ --save_steps 1000 --logging_steps 100 --logging_dir GPTJ/runs --fp16 --deepspeed zero3ws.json --overwrite_output_dir

My deepspeed config file can be found here: https://github.com/MantasLukauskas/DeepSpeed-Test/blob/main/zero3.json

@dimaischenko
Copy link

@MantasLukauskas thanks! I'll try today and write about the results.

@StellaAthena
Copy link
Contributor

StellaAthena commented Aug 31, 2021

@dimaischenko

@StellaAthena I am trying to fine-tune GPT-J from your branch. But neither Tesla A100 with 40GB GPU RAM (Google Cloud) nor TPU v3-8 allow for this. OOM error in both cases.

If you are working on TPUs, I strongly recommend using the mesh-transformer-jax library which was written for the purpose of producing GPT-J models. The version on HuggingFace is a PyTorch port of the original Jax code which has been used by many people on TPUs.

I'm not sure why you're having trouble with A100s though, as I have run the code on A100s before. Can you provide further details about how you're running the model? Is it loading the model or the act of fine-tuning that OOMs?

@dimaischenko
Copy link

@StellaAthena Thanks! I'll try mesh-transformer-jax. It's just that I already have a reliable fine-tuning pipeline for HuggingFace.

About OOM. I'll repeat my attempts today and will write logs. But the exact loading of the model was successful. And even performed validation with perplexity calculation on validation samples. But when it tried «to eat» the first sample in training, OOM would crash.

@StellaAthena
Copy link
Contributor

@StellaAthena Thanks! I'll try mesh-transformer-jax. It's just that I already have a reliable fine-tuning pipeline for HuggingFace.

About OOM. I'll repeat my attempts today and will write logs. But the exact loading of the model was successful. And even performed validation with perplexity calculation on validation samples. But when it tried «to eat» the first sample in training, OOM would crash.

This is really weird, given that you've said the batch size is set to 1. How much memory is allocated before you feed the first datum into the model? Does a different architecture that takes up the same amount of memory also fail?

@dimaischenko
Copy link

dimaischenko commented Sep 1, 2021

@StellaAthena I tried again running run_clm.py from the latest branch on single GPU A100 (40Gb)

python run_clm_orig.py \
    --model_type gptj \
    --model_name_or_path EleutherAI/gpt-j-6B \
    --model_revision float16 \
    --do_train \
    --do_eval \
    --train_file ./data/train.txt \
    --validation_file ./data/val.txt \
    --evaluation_strategy steps \
    --logging_step 300 \
    --learning_rate 0.00002 \
    --save_steps 1500 \
    --fp16 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 1 \
    --num_train_epochs 1 \
    --block_size 1024 \
    --save_total_limit 1 \
    --overwrite_output_dir \
    --output_dir ./out/test_gptj_orig                                        

and got OOM error

[INFO|trainer.py:414] 2021-09-01 11:39:10,987 >> Using amp fp16 backend
[INFO|trainer.py:1168] 2021-09-01 11:39:10,997 >> ***** Running training *****
[INFO|trainer.py:1169] 2021-09-01 11:39:10,997 >>   Num examples = 6011
[INFO|trainer.py:1170] 2021-09-01 11:39:10,997 >>   Num Epochs = 1
[INFO|trainer.py:1171] 2021-09-01 11:39:10,997 >>   Instantaneous batch size per device = 1
[INFO|trainer.py:1172] 2021-09-01 11:39:10,997 >>   Total train batch size (w. parallel, distributed & accumulation) = 1
[INFO|trainer.py:1173] 2021-09-01 11:39:10,997 >>   Gradient Accumulation steps = 1
[INFO|trainer.py:1174] 2021-09-01 11:39:10,997 >>   Total optimization steps = 6011
  0%|                                                                                                                                                                                                                | 0/6011 [00:00<?, ?it/s$
Traceback (most recent call last):
  File "run_clm_orig.py", line 522, in <module>
    main()
  File "run_clm_orig.py", line 472, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/mnt/disk/projects/gpt/venv/lib/python3.6/site-packages/transformers/trainer.py", line 1284, in train
    tr_loss += self.training_step(model, inputs)
  File "/mnt/disk/projects/gpt/venv/lib/python3.6/site-packages/transformers/trainer.py", line 1787, in training_step
    loss = self.compute_loss(model, inputs)
  File "/mnt/disk/projects/gpt/venv/lib/python3.6/site-packages/transformers/trainer.py", line 1821, in compute_loss
    outputs = model(**inputs)
  File "/mnt/disk/projects/gpt/venv/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/mnt/disk/projects/gpt/venv/lib/python3.6/site-packages/transformers/models/gptj/modeling_gptj.py", line 780, in forward
    return_dict=return_dict,
  File "/mnt/disk/projects/gpt/venv/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/mnt/disk/projects/gpt/venv/lib/python3.6/site-packages/transformers/models/gptj/modeling_gptj.py", line 631, in forward
    output_attentions=output_attentions,
  File "/mnt/disk/projects/gpt/venv/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/mnt/disk/projects/gpt/venv/lib/python3.6/site-packages/transformers/models/gptj/modeling_gptj.py", line 286, in forward
    feed_forward_hidden_states = self.mlp(hidden_states)
  File "/mnt/disk/projects/gpt/venv/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/mnt/disk/projects/gpt/venv/lib/python3.6/site-packages/transformers/models/gptj/modeling_gptj.py", line 249, in forward
    hidden_states = self.fc_in(hidden_states)
  File "/mnt/disk/projects/gpt/venv/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/mnt/disk/projects/gpt/venv/lib/python3.6/site-packages/torch/nn/modules/linear.py", line 96, in forward
    return F.linear(input, self.weight, self.bias)
  File "/mnt/disk/projects/gpt/venv/lib/python3.6/site-packages/torch/nn/functional.py", line 1847, in linear
    return torch._C._nn.linear(input, weight, bias)
RuntimeError: CUDA out of memory. Tried to allocate 32.00 MiB (GPU 0; 39.59 GiB total capacity; 37.49 GiB already allocated; 19.19 MiB free; 37.73 GiB reserved in total by PyTorch)
  0%|                                                                                                                                                                                                                | 0/6011 [00:00<?, ?it/s]

Today will switch to mesh-transformer-jax and try to fine-tune on TPU v3-8 and then convert checkpoint to HuggingFace format.

@sgugger
Copy link
Collaborator

sgugger commented Sep 1, 2021

You are trying to use the Adam optimizer with a model of 24Gb. With Adam, you have four copies of your model: model, gradients, and in the optimizer state the gradients averaged and square averaged. Even with fp16, all of that is still stored in FP32 because of mixed precision training (the optimzier update is in full precision). So unless you use DeepSpeed to offload the optimizer state and the gradient copy in FP32, you won't be able to fit this 4 x 24GB on your 80GB card.

@dimaischenko
Copy link

@sgugger Thanks for clarification! I configured DeepSpeed and everything started up on the A100 GPU. However, now I need 80Gb cpu RAM, but this is solvable 😄

@sgugger
Copy link
Collaborator

sgugger commented Sep 1, 2021

There is also the NVME offload if CPU RAM becomes a problem :-)

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@github-actions github-actions bot closed this as completed Oct 7, 2021
@johndpope
Copy link

@sgugger - is there any docs on how to do this - you can point to ? I got rtx 3090 - and hitting the KeyError: 'gptj'
(The error is really obscure. it should really have some thing easier to understand.)
I've got 32gb of RAM - @dimaischenko - did bumping to 80gb fix things?

@LysandreJik
Copy link
Member

@johndpope what is your transformers version? It looks like it is outdated and does not have the GPT-J model available.

@dimaischenko
Copy link

@LysandreJik I agree with you. I think that's the problem. @johndpope Yes 80gb ram was enough. To be honest, I don't remember the details anymore, but it seems that it took even less with DeepSpeed.

@johndpope
Copy link

had trouble with ram - but found this / installing now / supposedly fits 17 / 15gb in VRAM + uses fastapi - https://news.ycombinator.com/item?id=27731266
(it uses tensorflow / but keeps memory footprint lower )
https://gist.githubusercontent.com/kinoc/f3225092092e07b843e3a2798f7b3986/raw/fc0dbe522d09d3797dd2a64e7182003f7d9a7fa8/jserv.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants