Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New GA fix causes training loss multiple times higher across the board (5x to 10x higher) #34263

Closed
4 tasks
JianbangZ opened this issue Oct 19, 2024 · 33 comments · Fixed by #34283
Closed
4 tasks

Comments

@JianbangZ
Copy link

JianbangZ commented Oct 19, 2024

System Info

8xH100

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

After updating to the latest master branch of transformer, the training loss is mutiple times higher than before (5x-10x). I tried both SFT and DPO (paired with latest trl master), all having the same problems.
SFT after GA fix
image

SFT before GA fix
image

Expected behavior

training loss value should be aligned with old values, or should be expected lower.

@JianbangZ JianbangZ added the bug label Oct 19, 2024
@KeonwooChoi
Copy link

I also experienced a reduction in loss while training, but the value loss and grad_norm became very large. According to the blog (https://huggingface.co/blog/gradient_accumulation), it seems the code has changed to 'sum' the loss instead of using 'mean', and then divide it by 'num_items'. I'm wondering if this 'sum' value is being output. If I'm mistaken, please let me know.

image

@benjamin-marie
Copy link

Can you share more info?
It would be useful to know at least the training arguments and model used.

@danielhanchen
Copy link
Contributor

Actually would it be possible if you both @JianbangZ and @KeonwooChoi try it via Unsloth - use the same dataset if possible - thanks!

@paulcx
Copy link

paulcx commented Oct 21, 2024

here is my experiment can confirm the fix still needs to be fixed.

@muellerzr
Copy link
Contributor

Will take a look since #34198 is really what fixed it on the trainer

@muellerzr
Copy link
Contributor

muellerzr commented Oct 21, 2024

Just an FYI for a true test it would be better to compare this against using no gradient accumulation. I wouldn't be surprised if LR hyper parameters are different now, but looking at this today/right now

@JianbangZ
Copy link
Author

JianbangZ commented Oct 21, 2024

@muellerzr Does transformer need to update all the loss function in the corresponding modeling files to give the "num_items_in_batch" as argument?
Such as https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama/modeling_llama.py#L1212C94-L1212C118
Fix is mentioned here (though in Chinese)
hiyouga/LLaMA-Factory#5747 (comment)

@muellerzr
Copy link
Contributor

Yeah. Will do so today

@muellerzr
Copy link
Contributor

@JianbangZ you can build off of pip install git+https://github.com/huggingface/transformers@fixup-loss_fn_issues. It has the rest of the models + the full fix for trainer

@JianbangZ
Copy link
Author

pip install git+https://github.com/huggingface/transformers@fixup-loss_fn_issues

do you have to give num_items_in_batch as a trainer argument now?

@muellerzr
Copy link
Contributor

No, a user never has to keep track of that

@JianbangZ
Copy link
Author

No, a user never has to keep track of that

Tried, loss value still very big. Seems num_items_in_batch is not in effect

@muellerzr
Copy link
Contributor

muellerzr commented Oct 21, 2024

Can you give me a reproducer script? I tested this with SFT. Also the loss could change, I would recommend testing with a larger bs and no grad accum if possible as a baseline

@muellerzr
Copy link
Contributor

muellerzr commented Oct 21, 2024

Also though: they shouldn't be the same, they will be different than the old version because they are two different calculations fully. Hence a reproducer and test with no grad accum

@JianbangZ
Copy link
Author

Can you give me a reproducer script? I tested this with SFT. Also the loss could change, I would recommend testing with a larger bs and no grad accum if possible as a baseline

I found out why. It's your self.model_accepts_loss_kwargs not set as True (how to set it dynamically?). Anyway, I hard coded it to True and loss value seems to become normal. I will run a full 3 hour training (a multimodal training with Llama3.2-3B as backend). Then I will report the loss curves here.

@muellerzr
Copy link
Contributor

muellerzr commented Oct 21, 2024

@JianbangZ which model are you testing with? llama3.2-3B? We look at the forward() signature to see if it accepts the new loss_kwargs argument

@JianbangZ
Copy link
Author

JianbangZ commented Oct 21, 2024

@JianbangZ which model are you testing with? llama3.2-3B?

@JianbangZ which model are you testing with? llama3.2-3B? We look at the forward() signature to see if it accepts the new loss_kwargs argument

Yes, I have a script to use Llama-3.2-3B-it as LLM backend, and do a LLAVA style training. I think it's good to test a MLLM training beyond just regular LLM SFT training to see how things go.
And where to set or how to determine forward() signature? Shouldn't loss_kwargs always be accepted?

@muellerzr
Copy link
Contributor

To my knowledge this is really only for CLM/LLM training so the PR limits it to that (e.g. it wouldn't make sense with classification/problems where the lengths of inputs matter wrt the loss func). You can see in the details of the PR the where/how the models are updated with this new mechanic.

What does model.loss_function give you?

@JianbangZ
Copy link
Author

JianbangZ commented Oct 21, 2024

Can confirm it seems fixed after I enable loss_kwags, and the loss is looking great at bs16_ga4 vs before (1.7 vs 1.74)
image

@paulcx
Copy link

paulcx commented Oct 22, 2024

Can confirm it seems fixed after I enable loss_kwags, and the loss is looking great at bs16_ga4 vs before (1.7 vs 1.74) image

Would you mind summarize the solution for this issue?

@man-shar
Copy link

man-shar commented Oct 22, 2024

@muellerzr Thank you for the awesome work in your PR. This is a noob question, but:

If this wasn't OOTB in the trainer api, were we intended to provide the compute_loss_fn argument? Could you provide a small example of how that would look like? Just the source code for your experiments would also work!

@muellerzr
Copy link
Contributor

@man-shar the docs in the Trainer are fairly detailed, but there's an example test here: https://github.com/huggingface/transformers/blob/main/tests/trainer/test_trainer.py#740-L834

If I had to guess, the reason why I was confused earlier is it was a PEFT method was the reason it didn't work

@man-shar
Copy link

Ah awesome, thanks!

@JianbangZ
Copy link
Author

pip install git+https://github.com/huggingface/transformers@fixup-loss_fn_issues

pip install git+https://github.com/huggingface/transformers@fixup-loss_fn_issues
Then, eable loss_kwargs in forward(), or you can hardcoded loss_kwars to True in trainer

@thesillystudent
Copy link

Hey @JianbangZ , can you pls give an example ? I'm not able to find a paramter called "loss_kwargs" in the trainer class

@muellerzr
Copy link
Contributor

muellerzr commented Oct 22, 2024

@thesillystudent if you build off my branch again we take into account peft models which should fix the issue. You, as a user, do not have access to loss_kwargs in the Trainer. It's internal.

You can however create a compute_loss_func which you can make.

(This will be merged shortly)

@paulcx
Copy link

paulcx commented Oct 22, 2024

compute_loss_func

would you give an example of how to use compute_loss_func?

@muellerzr
Copy link
Contributor

@paulcx as mentioned there's an example in the test:

        model = AutoModelForCausalLM.from_pretrained(model_name)

        def compute_loss(logits, labels, vocab_size, num_items_in_batch, disable_num_items_in_batch=False):
            return ForCausalLMLoss(
                logits["logits"], labels, vocab_size, num_items_in_batch, disable_num_items_in_batch
            )

        loss_fn = partial(compute_loss, vocab_size=model.config.vocab_size, disable_num_items_in_batch=False)

@thusinh1969
Copy link

thusinh1969 commented Oct 25, 2024

I have re-install:
pip install git+https://github.com/huggingface/transformers

Finetune QWen2VL on LLaMA-Factory w/ LoRA and gradient accumulation 8xH100 NVL (forced ignored version check).
enable_liger_kernel: true

The loss is still way high, like 10 times higher then before, although eval loss is small as before!

Do you know why ?

Thanks,
Steve

@haorannlp
Copy link

Anyone can summarize what has happened since transformers==4.46.0 in terms of gradient accumulation steps and does it influence users when upgrading from 4.43 to 4.46.0 ? Thanks !

@muellerzr
Copy link
Contributor

@haorannlp we made it so that grad accum scaling aligns to training w/o grad accum. There might be influence in how end evals are, but it shouldn't be largely different. You'll notice losses are different (smaller, not larger, the issue here was a bug iirc), but overall the training is similar enough and "more accurate"

I also wrote a concise blog on it all: https://muellerzr.github.io/blog/gradient_accumulation_part2.html

@haorannlp
Copy link

@muellerzr Thanks for the reply. Now I understand that 4.46.0 is trying to fix this inaccurate GAS issue.
Just to confirm, is this this_multiply_by_gas in 4.46.0 related to the fix? is this multiplication (loss*GAS) a bug?

@Yangr116
Copy link

Yangr116 commented Dec 9, 2024

Deepspeed zero3-offload will raise error "Cannot reduce scatter gradients whose size is not same as the params". Has anyone met this issue when using new transformers?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet