New GA fix causes training loss multiple times higher across the board (5x to 10x higher) #34263

JianbangZ · 2024-10-19T11:48:06Z

System Info

8xH100

Who can help?

No response

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

After updating to the latest master branch of transformer, the training loss is mutiple times higher than before (5x-10x). I tried both SFT and DPO (paired with latest trl master), all having the same problems.
SFT after GA fix

SFT before GA fix

Expected behavior

training loss value should be aligned with old values, or should be expected lower.

The text was updated successfully, but these errors were encountered:

KeonwooChoi · 2024-10-19T13:06:22Z

I also experienced a reduction in loss while training, but the value loss and grad_norm became very large. According to the blog (https://huggingface.co/blog/gradient_accumulation), it seems the code has changed to 'sum' the loss instead of using 'mean', and then divide it by 'num_items'. I'm wondering if this 'sum' value is being output. If I'm mistaken, please let me know.

benjamin-marie · 2024-10-21T09:06:04Z

Can you share more info?
It would be useful to know at least the training arguments and model used.

danielhanchen · 2024-10-21T09:33:27Z

Actually would it be possible if you both @JianbangZ and @KeonwooChoi try it via Unsloth - use the same dataset if possible - thanks!

paulcx · 2024-10-21T12:00:51Z

here is my experiment can confirm the fix still needs to be fixed.

muellerzr · 2024-10-21T12:53:06Z

Will take a look since #34198 is really what fixed it on the trainer

muellerzr · 2024-10-21T13:04:32Z

Just an FYI for a true test it would be better to compare this against using no gradient accumulation. I wouldn't be surprised if LR hyper parameters are different now, but looking at this today/right now

JianbangZ · 2024-10-21T13:11:49Z

@muellerzr Does transformer need to update all the loss function in the corresponding modeling files to give the "num_items_in_batch" as argument?
Such as https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama/modeling_llama.py#L1212C94-L1212C118
Fix is mentioned here (though in Chinese)
hiyouga/LLaMA-Factory#5747 (comment)

muellerzr · 2024-10-21T13:29:25Z

Yeah. Will do so today

muellerzr · 2024-10-21T13:57:10Z

@JianbangZ you can build off of pip install git+https://github.com/huggingface/transformers@fixup-loss_fn_issues. It has the rest of the models + the full fix for trainer

JianbangZ · 2024-10-21T14:38:06Z

pip install git+https://github.com/huggingface/transformers@fixup-loss_fn_issues

do you have to give num_items_in_batch as a trainer argument now?

muellerzr · 2024-10-21T14:39:05Z

No, a user never has to keep track of that

JianbangZ · 2024-10-21T14:45:46Z

No, a user never has to keep track of that

Tried, loss value still very big. Seems num_items_in_batch is not in effect

muellerzr · 2024-10-21T14:49:35Z

Can you give me a reproducer script? I tested this with SFT. Also the loss could change, I would recommend testing with a larger bs and no grad accum if possible as a baseline

muellerzr · 2024-10-21T14:57:16Z

Also though: they shouldn't be the same, they will be different than the old version because they are two different calculations fully. Hence a reproducer and test with no grad accum

JianbangZ · 2024-10-21T15:00:39Z

Can you give me a reproducer script? I tested this with SFT. Also the loss could change, I would recommend testing with a larger bs and no grad accum if possible as a baseline

I found out why. It's your self.model_accepts_loss_kwargs not set as True (how to set it dynamically?). Anyway, I hard coded it to True and loss value seems to become normal. I will run a full 3 hour training (a multimodal training with Llama3.2-3B as backend). Then I will report the loss curves here.

muellerzr · 2024-10-21T15:02:17Z

@JianbangZ which model are you testing with? llama3.2-3B? We look at the forward() signature to see if it accepts the new loss_kwargs argument

JianbangZ · 2024-10-21T15:05:18Z

@JianbangZ which model are you testing with? llama3.2-3B?

@JianbangZ which model are you testing with? llama3.2-3B? We look at the forward() signature to see if it accepts the new loss_kwargs argument

Yes, I have a script to use Llama-3.2-3B-it as LLM backend, and do a LLAVA style training. I think it's good to test a MLLM training beyond just regular LLM SFT training to see how things go.
And where to set or how to determine forward() signature? Shouldn't loss_kwargs always be accepted?

muellerzr · 2024-10-21T15:09:07Z

To my knowledge this is really only for CLM/LLM training so the PR limits it to that (e.g. it wouldn't make sense with classification/problems where the lengths of inputs matter wrt the loss func). You can see in the details of the PR the where/how the models are updated with this new mechanic.

What does model.loss_function give you?

JianbangZ · 2024-10-21T21:02:07Z

Can confirm it seems fixed after I enable loss_kwags, and the loss is looking great at bs16_ga4 vs before (1.7 vs 1.74)

paulcx · 2024-10-22T00:18:02Z

Can confirm it seems fixed after I enable loss_kwags, and the loss is looking great at bs16_ga4 vs before (1.7 vs 1.74)

Would you mind summarize the solution for this issue?

man-shar · 2024-10-22T01:13:54Z

@muellerzr Thank you for the awesome work in your PR. This is a noob question, but:

If this wasn't OOTB in the trainer api, were we intended to provide the compute_loss_fn argument? Could you provide a small example of how that would look like? Just the source code for your experiments would also work!

muellerzr · 2024-10-22T12:02:00Z

@man-shar the docs in the Trainer are fairly detailed, but there's an example test here: https://github.com/huggingface/transformers/blob/main/tests/trainer/test_trainer.py#740-L834

If I had to guess, the reason why I was confused earlier is it was a PEFT method was the reason it didn't work

man-shar · 2024-10-22T12:04:03Z

Ah awesome, thanks!

JianbangZ · 2024-10-22T13:01:17Z

pip install git+https://github.com/huggingface/transformers@fixup-loss_fn_issues

pip install git+https://github.com/huggingface/transformers@fixup-loss_fn_issues
Then, eable loss_kwargs in forward(), or you can hardcoded loss_kwars to True in trainer

thesillystudent · 2024-10-22T16:31:44Z

Hey @JianbangZ , can you pls give an example ? I'm not able to find a paramter called "loss_kwargs" in the trainer class

muellerzr · 2024-10-22T16:44:20Z

@thesillystudent if you build off my branch again we take into account peft models which should fix the issue. You, as a user, do not have access to loss_kwargs in the Trainer. It's internal.

You can however create a compute_loss_func which you can make.

(This will be merged shortly)

paulcx · 2024-10-22T18:56:07Z

compute_loss_func

would you give an example of how to use compute_loss_func?

muellerzr · 2024-10-23T16:52:53Z

@paulcx as mentioned there's an example in the test:

        model = AutoModelForCausalLM.from_pretrained(model_name)

        def compute_loss(logits, labels, vocab_size, num_items_in_batch, disable_num_items_in_batch=False):
            return ForCausalLMLoss(
                logits["logits"], labels, vocab_size, num_items_in_batch, disable_num_items_in_batch
            )

        loss_fn = partial(compute_loss, vocab_size=model.config.vocab_size, disable_num_items_in_batch=False)

thusinh1969 · 2024-10-25T21:21:15Z

I have re-install:
pip install git+https://github.com/huggingface/transformers

Finetune QWen2VL on LLaMA-Factory w/ LoRA and gradient accumulation 8xH100 NVL (forced ignored version check).
enable_liger_kernel: true

The loss is still way high, like 10 times higher then before, although eval loss is small as before!

Do you know why ?

Thanks,
Steve

haorannlp · 2024-12-02T15:15:06Z

Anyone can summarize what has happened since transformers==4.46.0 in terms of gradient accumulation steps and does it influence users when upgrading from 4.43 to 4.46.0 ? Thanks !

muellerzr · 2024-12-02T19:01:23Z

@haorannlp we made it so that grad accum scaling aligns to training w/o grad accum. There might be influence in how end evals are, but it shouldn't be largely different. You'll notice losses are different (smaller, not larger, the issue here was a bug iirc), but overall the training is similar enough and "more accurate"

I also wrote a concise blog on it all: https://muellerzr.github.io/blog/gradient_accumulation_part2.html

haorannlp · 2024-12-03T01:42:48Z

@muellerzr Thanks for the reply. Now I understand that 4.46.0 is trying to fix this inaccurate GAS issue.
Just to confirm, is this this_multiply_by_gas in 4.46.0 related to the fix? is this multiplication (loss*GAS) a bug?

Yangr116 · 2024-12-09T16:22:34Z

Deepspeed zero3-offload will raise error "Cannot reduce scatter gradients whose size is not same as the params". Has anyone met this issue when using new transformers?

JianbangZ added the bug label Oct 19, 2024

muellerzr mentioned this issue Oct 21, 2024

Enable Gradient Accumulation fix across all models + trainer fully in forward() #34283

Merged

5 tasks

LysandreJik added the trainer label Oct 21, 2024

JianbangZ mentioned this issue Oct 21, 2024

使用解决了多卡gradient accumulation严重BUG的最新transformer库（以及对应的trl库），DPO训练的时候LOSS变为之前的好几倍 hiyouga/LLaMA-Factory#5747

Closed

1 task

muellerzr closed this as completed in #34283 Oct 23, 2024

Jintao-Huang mentioned this issue Oct 30, 2024

New GA fix causes training loss multiple times higher across the board (5x to 10x higher) (transformers==4.46) modelscope/ms-swift#2361

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New GA fix causes training loss multiple times higher across the board (5x to 10x higher) #34263

New GA fix causes training loss multiple times higher across the board (5x to 10x higher) #34263

JianbangZ commented Oct 19, 2024 •

edited

Loading

KeonwooChoi commented Oct 19, 2024

benjamin-marie commented Oct 21, 2024

danielhanchen commented Oct 21, 2024

paulcx commented Oct 21, 2024 •

edited

Loading

muellerzr commented Oct 21, 2024

muellerzr commented Oct 21, 2024 •

edited

Loading

JianbangZ commented Oct 21, 2024 •

edited

Loading

muellerzr commented Oct 21, 2024

muellerzr commented Oct 21, 2024

JianbangZ commented Oct 21, 2024

muellerzr commented Oct 21, 2024

JianbangZ commented Oct 21, 2024

muellerzr commented Oct 21, 2024 •

edited

Loading

muellerzr commented Oct 21, 2024 •

edited

Loading

JianbangZ commented Oct 21, 2024

muellerzr commented Oct 21, 2024 •

edited

Loading

JianbangZ commented Oct 21, 2024 •

edited

Loading

muellerzr commented Oct 21, 2024

JianbangZ commented Oct 21, 2024 •

edited

Loading

paulcx commented Oct 22, 2024

man-shar commented Oct 22, 2024 •

edited

Loading

muellerzr commented Oct 22, 2024

man-shar commented Oct 22, 2024

JianbangZ commented Oct 22, 2024

thesillystudent commented Oct 22, 2024

muellerzr commented Oct 22, 2024 •

edited

Loading

paulcx commented Oct 22, 2024

muellerzr commented Oct 23, 2024

thusinh1969 commented Oct 25, 2024 •

edited

Loading

haorannlp commented Dec 2, 2024

muellerzr commented Dec 2, 2024

haorannlp commented Dec 3, 2024

Yangr116 commented Dec 9, 2024

New GA fix causes training loss multiple times higher across the board (5x to 10x higher) #34263

New GA fix causes training loss multiple times higher across the board (5x to 10x higher) #34263

Comments

JianbangZ commented Oct 19, 2024 • edited Loading

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

KeonwooChoi commented Oct 19, 2024

benjamin-marie commented Oct 21, 2024

danielhanchen commented Oct 21, 2024

paulcx commented Oct 21, 2024 • edited Loading

muellerzr commented Oct 21, 2024

muellerzr commented Oct 21, 2024 • edited Loading

JianbangZ commented Oct 21, 2024 • edited Loading

muellerzr commented Oct 21, 2024

muellerzr commented Oct 21, 2024

JianbangZ commented Oct 21, 2024

muellerzr commented Oct 21, 2024

JianbangZ commented Oct 21, 2024

muellerzr commented Oct 21, 2024 • edited Loading

muellerzr commented Oct 21, 2024 • edited Loading

JianbangZ commented Oct 21, 2024

muellerzr commented Oct 21, 2024 • edited Loading

JianbangZ commented Oct 21, 2024 • edited Loading

muellerzr commented Oct 21, 2024

JianbangZ commented Oct 21, 2024 • edited Loading

paulcx commented Oct 22, 2024

man-shar commented Oct 22, 2024 • edited Loading

muellerzr commented Oct 22, 2024

man-shar commented Oct 22, 2024

JianbangZ commented Oct 22, 2024

thesillystudent commented Oct 22, 2024

muellerzr commented Oct 22, 2024 • edited Loading

paulcx commented Oct 22, 2024

muellerzr commented Oct 23, 2024

thusinh1969 commented Oct 25, 2024 • edited Loading

haorannlp commented Dec 2, 2024

muellerzr commented Dec 2, 2024

haorannlp commented Dec 3, 2024

Yangr116 commented Dec 9, 2024

JianbangZ commented Oct 19, 2024 •

edited

Loading

paulcx commented Oct 21, 2024 •

edited

Loading

muellerzr commented Oct 21, 2024 •

edited

Loading

JianbangZ commented Oct 21, 2024 •

edited

Loading

muellerzr commented Oct 21, 2024 •

edited

Loading

muellerzr commented Oct 21, 2024 •

edited

Loading

muellerzr commented Oct 21, 2024 •

edited

Loading

JianbangZ commented Oct 21, 2024 •

edited

Loading

JianbangZ commented Oct 21, 2024 •

edited

Loading

man-shar commented Oct 22, 2024 •

edited

Loading

muellerzr commented Oct 22, 2024 •

edited

Loading

thusinh1969 commented Oct 25, 2024 •

edited

Loading