使用解决了多卡gradient accumulation严重BUG的最新transformer库（以及对应的trl库），DPO训练的时候LOSS变为之前的好几倍 #5747

JianbangZ · 2024-10-18T23:57:09Z

Reminder

I have read the README and searched the existing issues.

System Info

8XH100

Reproduction

更新到master分支的最新的transformer & trl库，DPO训练LOSS从之前的1.0->0.3 变为9->3
详情见huggingface/transformers#34191

Expected behavior

No response

Others

No response

ElementQi · 2024-10-19T08:50:56Z

这里的trl库是main branch吗？

JianbangZ · 2024-10-19T09:57:20Z

是的. 不过我不觉得是TRL弄的，应该主要原因还是transformer库的更新导致，我换了另外一个REPO也会有LOSS大幅增加的现象 Thanks. Jianbang Zhang

…

On Sat, Oct 19, 2024 at 4:51 AM ElementQi ***@***.***> wrote: 这里的trl库是main branch吗？ — Reply to this email directly, view it on GitHub <#5747 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADY2AP3RLT26O2JNLEDGW7DZ4IMQLAVCNFSM6AAAAABQG4J4OGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMRTGY4TSOJXGE> . You are receiving this because you authored the thread.Message ID: ***@***.***>

Arcmoon-Hu · 2024-10-20T01:17:38Z

+1, sft bs4 ga4，初始4.x，bs16 只有1.x

aliencaocao · 2024-10-20T12:03:16Z

如果模型正常收敛就ok吧？是没法正常训练吗

他们是改了cross entophy loss的分母normalization,所以会有数值上的变化是正常的

JianbangZ · 2024-10-20T12:05:23Z

如果模型正常收敛就ok吧？是没法正常训练吗

他们是改了cross entophy loss的分母normalization,所以会有数值上的变化是正常的

收敛正常，但LOSS值改动后，相应的最优LR也得进行变动比较麻烦。我是不确定这个是否是working as design

Arcmoon-Hu · 2024-10-21T03:26:49Z

+1, sft bs4 ga4，初始4.x，bs16 只有1.x
我这里千问模型的解决了，是transformers最新代码中，千问模型文件算损失时，有个参数没传进去，需要修改几个地方
1、transformers/models/qwen2/modeling_qwen2.py

修改成

记得forward里添加这个参数
2、transformers/trainer.py

把注释取消掉
我用qwen1.5修改不同bs和ga快速做了几组实验，初始loss一致

JianbangZ · 2024-10-21T12:30:00Z

为什么原作者要把那个注释掉？另外难道每个模型都得相应的改loss传入参数？

JianbangZ · 2024-10-21T21:30:59Z

transformer的一个PR已经解决该问题，详情见huggingface/transformers#34263

hiyouga · 2024-10-22T03:09:49Z

先留着，我之后仔细看下

thusinh1969 · 2024-10-25T21:41:05Z

先留着，我之后仔细看下

I have re-install:
pip install git+https://github.com/huggingface/transformers

Finetune QWen2VL on LLaMA-Factory w/ LoRA and gradient accumulation 8xH100 NVL (forced ignored version check).
enable_liger_kernel: true

The loss is still way high, like near 10 times higher than before, although eval loss is small as before!

Do you know why ?

Thanks,
Steve

techkang · 2024-10-29T02:13:31Z

应该在这里修复了：
https://github.com/huggingface/transformers/blob/main/src/transformers/trainer.py#L3635-L3636
之前在前面的代码里，只留了通过 num_items_in_batch 来解决 GA bug 的接口，但没有实际调用，但在这一行直接乘以 GA step，所以 loss 会变大，现在通过 if 来判断，只有真的启用修复 GA bug 的接口后，才会将 loss 乘以GA step。

JianbangZ · 2024-10-29T12:43:09Z

先留着，我之后仔细看下

I have re-install: pip install git+https://github.com/huggingface/transformers

Finetune QWen2VL on LLaMA-Factory w/ LoRA and gradient accumulation 8xH100 NVL (forced ignored version check). enable_liger_kernel: true

The loss is still way high, like near 10 times higher than before, although eval loss is small as before!

Do you know why ?

Thanks, Steve

In your forward() function, give a signature parameter "loss_kwargs", then the loss value will be 'normal' again. The other way is after you installed transformer, go the the trainer.py and hardcode 'model_accepts_loss_kwargs' to True

github-actions bot added the pending This problem is yet to be addressed label Oct 18, 2024

hiyouga added bug Something isn't working good first issue Good for newcomers labels Oct 21, 2024

aliencaocao mentioned this issue Oct 21, 2024

Fix Gradient Accumulation issue huggingface/transformers#34191

Merged

1 task

JianbangZ mentioned this issue Oct 21, 2024

New GA fix causes training loss multiple times higher across the board (5x to 10x higher) huggingface/transformers#34263

Closed

4 tasks

JianbangZ closed this as completed Oct 21, 2024

hiyouga reopened this Oct 22, 2024

hiyouga mentioned this issue Oct 29, 2024

[misc] several important updates #5852

Merged

2 tasks

hiyouga closed this as completed in #5852 Oct 29, 2024

hiyouga closed this as completed in ae045c8 Oct 29, 2024

hiyouga added solved This problem has been already solved and removed bug Something isn't working good first issue Good for newcomers pending This problem is yet to be addressed labels Oct 29, 2024

techkang mentioned this issue Nov 1, 2024

Revert overvide for the bug is fixed #5897

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

使用解决了多卡gradient accumulation严重BUG的最新transformer库（以及对应的trl库），DPO训练的时候LOSS变为之前的好几倍 #5747

使用解决了多卡gradient accumulation严重BUG的最新transformer库（以及对应的trl库），DPO训练的时候LOSS变为之前的好几倍 #5747

JianbangZ commented Oct 18, 2024

ElementQi commented Oct 19, 2024

JianbangZ commented Oct 19, 2024 via email •

edited

Loading

Arcmoon-Hu commented Oct 20, 2024

aliencaocao commented Oct 20, 2024

JianbangZ commented Oct 20, 2024

Arcmoon-Hu commented Oct 21, 2024 •

edited

Loading

JianbangZ commented Oct 21, 2024

JianbangZ commented Oct 21, 2024

hiyouga commented Oct 22, 2024

thusinh1969 commented Oct 25, 2024 •

edited

Loading

techkang commented Oct 29, 2024

JianbangZ commented Oct 29, 2024

使用解决了多卡gradient accumulation严重BUG的最新transformer库（以及对应的trl库），DPO训练的时候LOSS变为之前的好几倍 #5747

使用解决了多卡gradient accumulation严重BUG的最新transformer库（以及对应的trl库），DPO训练的时候LOSS变为之前的好几倍 #5747

Comments

JianbangZ commented Oct 18, 2024

Reminder

System Info

Reproduction

Expected behavior

Others

ElementQi commented Oct 19, 2024

JianbangZ commented Oct 19, 2024 via email • edited Loading

Arcmoon-Hu commented Oct 20, 2024

aliencaocao commented Oct 20, 2024

JianbangZ commented Oct 20, 2024

Arcmoon-Hu commented Oct 21, 2024 • edited Loading

JianbangZ commented Oct 21, 2024

JianbangZ commented Oct 21, 2024

hiyouga commented Oct 22, 2024

thusinh1969 commented Oct 25, 2024 • edited Loading

techkang commented Oct 29, 2024

JianbangZ commented Oct 29, 2024

JianbangZ commented Oct 19, 2024 via email •

edited

Loading

Arcmoon-Hu commented Oct 21, 2024 •

edited

Loading

thusinh1969 commented Oct 25, 2024 •

edited

Loading