-
Notifications
You must be signed in to change notification settings - Fork 26.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GPT-J-6B in run_clm.py #13329
Comments
Hello @MantasLukauskas, GPT-J is not yet merged into |
@LysandreJik Is there any way to do a workaround for fine-tuning it because as I see merge could take some time |
You could checkout the PR directly and try fine-tuning it with the GPT-J code!
|
You can also install directly from my fork with |
@StellaAthena I am trying to fine-tune GPT-J from your branch. But neither Tesla A100 with 40GB GPU RAM (Google Cloud) nor TPU v3-8 allow for this. OOM error in both cases. I am setting batch_size = 1, gradient_checkpointing, trying different block_sizes Is it possible to fine-tune it on such devices? |
@dimaischenko Do you use run_clm.py for fine-tune or do that in another way? |
@MantasLukauskas Yes, by run_clm.py |
@dimaischenko I got error "RuntimeError: [enforce fail at CPUAllocator.cpp:65] . DefaultCPUAllocator: can't allocate memory: you tried to allocate 4027105280 bytes. Error code 12 (Cannot allocate memory)" do you had the same? 100 GB RAM + DeepSpeed Zero 3 + T4 15 GB |
4,027,105,280 <<< 100 GB, so it’s hard to see how that’s the issue, unless you have something else running. Can you print out the amount of free memory during the loading process? |
Thanks @StellaAthena and @EricHallahan for all your work on the #13022 GPT-J fork!! Over the past few days I've been playing around with the current state of the fork and I am running into the same OOM issues that are referenced here by @dimaischenko. Here is some information from my end in case it is helpful debugging what is happening (I'd be happy to put this in a separate issue if that is desired). System: I am running everything on a compute cluster (i.e., not g-colab) with ~384GB of ram and 8x RTX 6000 GPUs with 24gb vram each. I am using the fork by @StellaAthena and a fresh conda environment with Python 3.9. My observations:
For example, these are my parameters: python run_clm.py \
--model_name_or_path EleutherAI/gpt-j-6B \
--model_revision float32 \
--dataset_name wikitext \
--dataset_config_name wikitext-2-raw-v1 \
--do_train \
--do_eval \
--output_dir /mmfs1/gscratch/tdekok/test-clm-j \
--overwrite_output_dir true \
--per_device_train_batch_size 1 \
--per_device_eval_batch_size 1 \
--gradient_accumulation_steps 1 \
--fp16 true \
--fp16_opt_level O1 This results in about ~55G RAM usage and I can see in
|
I think it doesn't matter if you are using 8 GPU or 1 GPU cause But let's wait for the answer from the creators of the model |
@dimaischenko yes I have the same problem, I tried to use DeepSpeed Zero3 optimizer for this one but even with batch_size=1 and model_revision = float16 I am out of memory. Interesting that with gpt2-xl I have the same problem but I saw a lot of people fine-tuning this model with T4 + Deepspeed :( |
@dimaischenko I tested a lot of parameters and found that with --block_size 512 I can fine-tune GPT-J model. RAM Usage 100 GB, GPU usage 12 GB (Nvidia T4 total 16 GB), DeepSpeed Zero3 optimizer |
@MantasLukauskas Sounds interesting. Maybe it's the optimizer. And what option is it enabled by, or do you need to modify the run_clm.py code? |
@dimaischenko DeepSpeed in library implemented into huggingface (https://github.com/microsoft/DeepSpeed) and you do not need to modify run_clm.py code you fine-tune model like that: My deepspeed config file can be found here: https://github.com/MantasLukauskas/DeepSpeed-Test/blob/main/zero3.json |
@MantasLukauskas thanks! I'll try today and write about the results. |
If you are working on TPUs, I strongly recommend using the mesh-transformer-jax library which was written for the purpose of producing GPT-J models. The version on HuggingFace is a PyTorch port of the original Jax code which has been used by many people on TPUs. I'm not sure why you're having trouble with A100s though, as I have run the code on A100s before. Can you provide further details about how you're running the model? Is it loading the model or the act of fine-tuning that OOMs? |
@StellaAthena Thanks! I'll try About OOM. I'll repeat my attempts today and will write logs. But the exact loading of the model was successful. And even performed validation with perplexity calculation on validation samples. But when it tried «to eat» the first sample in training, OOM would crash. |
This is really weird, given that you've said the batch size is set to 1. How much memory is allocated before you feed the first datum into the model? Does a different architecture that takes up the same amount of memory also fail? |
@StellaAthena I tried again running run_clm.py from the latest branch on single GPU A100 (40Gb)
and got OOM error
Today will switch to |
You are trying to use the Adam optimizer with a model of 24Gb. With Adam, you have four copies of your model: model, gradients, and in the optimizer state the gradients averaged and square averaged. Even with fp16, all of that is still stored in FP32 because of mixed precision training (the optimzier update is in full precision). So unless you use DeepSpeed to offload the optimizer state and the gradient copy in FP32, you won't be able to fit this 4 x 24GB on your 80GB card. |
@sgugger Thanks for clarification! I configured DeepSpeed and everything started up on the A100 GPU. However, now I need 80Gb cpu RAM, but this is solvable 😄 |
There is also the NVME offload if CPU RAM becomes a problem :-) |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
@sgugger - is there any docs on how to do this - you can point to ? I got rtx 3090 - and hitting the KeyError: 'gptj' |
@johndpope what is your |
@LysandreJik I agree with you. I think that's the problem. @johndpope Yes 80gb ram was enough. To be honest, I don't remember the details anymore, but it seems that it took even less with |
had trouble with ram - but found this / installing now / supposedly fits 17 / 15gb in VRAM + uses fastapi - https://news.ycombinator.com/item?id=27731266 |
Environment info
transformers
version: 4.10.0.dev0Who can help
-->
Information
The model I am using GPT-J from HuggingFaceHub models, there is KeyError with this model, error listed below:
Traceback (most recent call last):
File "run_clm.py", line 522, in
main()
File "run_clm.py", line 320, in main
config = AutoConfig.from_pretrained(model_args.model_name_or_path, **config_kwargs)
File "/opt/conda/lib/python3.7/site-packages/transformers/models/auto/configuration_auto.py", line 514, in from_pretrained
config_class = CONFIG_MAPPING[config_dict["model_type"]]
File "/opt/conda/lib/python3.7/site-packages/transformers/models/auto/configuration_auto.py", line 263, in getitem
raise KeyError(key)
KeyError: 'gptj'
The text was updated successfully, but these errors were encountered: