-
Notifications
You must be signed in to change notification settings - Fork 27.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New GA fix causes training loss multiple times higher across the board (5x to 10x higher) #34263
Comments
I also experienced a reduction in loss while training, but the value loss and grad_norm became very large. According to the blog (https://huggingface.co/blog/gradient_accumulation), it seems the code has changed to 'sum' the loss instead of using 'mean', and then divide it by 'num_items'. I'm wondering if this 'sum' value is being output. If I'm mistaken, please let me know. |
Can you share more info? |
Actually would it be possible if you both @JianbangZ and @KeonwooChoi try it via Unsloth - use the same dataset if possible - thanks! |
here is my experiment can confirm the fix still needs to be fixed. |
Will take a look since #34198 is really what fixed it on the trainer |
Just an FYI for a true test it would be better to compare this against using no gradient accumulation. I wouldn't be surprised if LR hyper parameters are different now, but looking at this today/right now |
@muellerzr Does transformer need to update all the loss function in the corresponding modeling files to give the "num_items_in_batch" as argument? |
Yeah. Will do so today |
@JianbangZ you can build off of |
do you have to give num_items_in_batch as a trainer argument now? |
No, a user never has to keep track of that |
Tried, loss value still very big. Seems num_items_in_batch is not in effect |
Can you give me a reproducer script? I tested this with SFT. Also the loss could change, I would recommend testing with a larger bs and no grad accum if possible as a baseline |
Also though: they shouldn't be the same, they will be different than the old version because they are two different calculations fully. Hence a reproducer and test with no grad accum |
I found out why. It's your self.model_accepts_loss_kwargs not set as True (how to set it dynamically?). Anyway, I hard coded it to True and loss value seems to become normal. I will run a full 3 hour training (a multimodal training with Llama3.2-3B as backend). Then I will report the loss curves here. |
@JianbangZ which model are you testing with? llama3.2-3B? We look at the |
Yes, I have a script to use Llama-3.2-3B-it as LLM backend, and do a LLAVA style training. I think it's good to test a MLLM training beyond just regular LLM SFT training to see how things go. |
To my knowledge this is really only for CLM/LLM training so the PR limits it to that (e.g. it wouldn't make sense with classification/problems where the lengths of inputs matter wrt the loss func). You can see in the details of the PR the where/how the models are updated with this new mechanic. What does |
@muellerzr Thank you for the awesome work in your PR. This is a noob question, but: If this wasn't OOTB in the trainer api, were we intended to provide the |
@man-shar the docs in the If I had to guess, the reason why I was confused earlier is it was a PEFT method was the reason it didn't work |
Ah awesome, thanks! |
pip install git+https://github.com/huggingface/transformers@fixup-loss_fn_issues |
Hey @JianbangZ , can you pls give an example ? I'm not able to find a paramter called "loss_kwargs" in the trainer class |
@thesillystudent if you build off my branch again we take into account peft models which should fix the issue. You, as a user, do not have access to loss_kwargs in the Trainer. It's internal. You can however create a (This will be merged shortly) |
would you give an example of how to use compute_loss_func? |
@paulcx as mentioned there's an example in the test: model = AutoModelForCausalLM.from_pretrained(model_name)
def compute_loss(logits, labels, vocab_size, num_items_in_batch, disable_num_items_in_batch=False):
return ForCausalLMLoss(
logits["logits"], labels, vocab_size, num_items_in_batch, disable_num_items_in_batch
)
loss_fn = partial(compute_loss, vocab_size=model.config.vocab_size, disable_num_items_in_batch=False) |
I have re-install: Finetune QWen2VL on LLaMA-Factory w/ LoRA and gradient accumulation 8xH100 NVL (forced ignored version check). The loss is still way high, like 10 times higher then before, although eval loss is small as before! Do you know why ? Thanks, |
Anyone can summarize what has happened since transformers==4.46.0 in terms of gradient accumulation steps and does it influence users when upgrading from 4.43 to 4.46.0 ? Thanks ! |
@haorannlp we made it so that grad accum scaling aligns to training w/o grad accum. There might be influence in how end evals are, but it shouldn't be largely different. You'll notice losses are different (smaller, not larger, the issue here was a bug iirc), but overall the training is similar enough and "more accurate" I also wrote a concise blog on it all: https://muellerzr.github.io/blog/gradient_accumulation_part2.html |
@muellerzr Thanks for the reply. Now I understand that 4.46.0 is trying to fix this inaccurate GAS issue. |
Deepspeed zero3-offload will raise error "Cannot reduce scatter gradients whose size is not same as the params". Has anyone met this issue when using new transformers? |
System Info
8xH100
Who can help?
No response
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
After updating to the latest master branch of transformer, the training loss is mutiple times higher than before (5x-10x). I tried both SFT and DPO (paired with latest trl master), all having the same problems.
SFT after GA fix
SFT before GA fix
Expected behavior
training loss value should be aligned with old values, or should be expected lower.
The text was updated successfully, but these errors were encountered: