Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

使用解决了多卡gradient accumulation严重BUG的最新transformer库(以及对应的trl库),DPO训练的时候LOSS变为之前的好几倍 #5747

Closed
1 task done
JianbangZ opened this issue Oct 18, 2024 · 12 comments · Fixed by #5852
Labels
solved This problem has been already solved

Comments

@JianbangZ
Copy link

Reminder

  • I have read the README and searched the existing issues.

System Info

8XH100

Reproduction

更新到master分支的最新的transformer & trl库,DPO训练LOSS从之前的1.0->0.3 变为9->3
详情见huggingface/transformers#34191

Expected behavior

No response

Others

No response

@github-actions github-actions bot added the pending This problem is yet to be addressed label Oct 18, 2024
@ElementQi
Copy link

这里的trl库是main branch吗?

@JianbangZ
Copy link
Author

JianbangZ commented Oct 19, 2024 via email

@Arcmoon-Hu
Copy link

+1, sft bs4 ga4,初始4.x,bs16 只有1.x

@aliencaocao
Copy link
Contributor

如果模型正常收敛就ok吧?是没法正常训练吗

他们是改了cross entophy loss的分母normalization,所以会有数值上的变化是正常的

@JianbangZ
Copy link
Author

如果模型正常收敛就ok吧?是没法正常训练吗

他们是改了cross entophy loss的分母normalization,所以会有数值上的变化是正常的

收敛正常,但LOSS值改动后,相应的最优LR也得进行变动比较麻烦。我是不确定这个是否是working as design

@Arcmoon-Hu
Copy link

Arcmoon-Hu commented Oct 21, 2024

+1, sft bs4 ga4,初始4.x,bs16 只有1.x
我这里千问模型的解决了,是transformers最新代码中,千问模型文件算损失时,有个参数没传进去,需要修改几个地方
1、transformers/models/qwen2/modeling_qwen2.py
image
修改成
image
记得forward里添加这个参数
2、transformers/trainer.py
image
把注释取消掉
我用qwen1.5修改不同bs和ga快速做了几组实验,初始loss一致

@hiyouga hiyouga added bug Something isn't working good first issue Good for newcomers labels Oct 21, 2024
@JianbangZ
Copy link
Author

为什么原作者要把那个注释掉? 另外难道每个模型都得相应的改loss传入参数?

@JianbangZ
Copy link
Author

transformer的一个PR已经解决该问题,详情见huggingface/transformers#34263

@hiyouga hiyouga reopened this Oct 22, 2024
@hiyouga
Copy link
Owner

hiyouga commented Oct 22, 2024

先留着,我之后仔细看下

@thusinh1969
Copy link

thusinh1969 commented Oct 25, 2024

先留着,我之后仔细看下

I have re-install:
pip install git+https://github.com/huggingface/transformers

Finetune QWen2VL on LLaMA-Factory w/ LoRA and gradient accumulation 8xH100 NVL (forced ignored version check).
enable_liger_kernel: true

The loss is still way high, like near 10 times higher than before, although eval loss is small as before!

Do you know why ?

Thanks,
Steve

@techkang
Copy link

应该在这里修复了:
https://github.com/huggingface/transformers/blob/main/src/transformers/trainer.py#L3635-L3636
之前在前面的代码里,只留了通过 num_items_in_batch 来解决 GA bug 的接口,但没有实际调用,但在这一行直接乘以 GA step,所以 loss 会变大,现在通过 if 来判断,只有真的启用修复 GA bug 的接口后,才会将 loss 乘以GA step。

@hiyouga hiyouga added solved This problem has been already solved and removed bug Something isn't working good first issue Good for newcomers pending This problem is yet to be addressed labels Oct 29, 2024
@JianbangZ
Copy link
Author

先留着,我之后仔细看下

I have re-install: pip install git+https://github.com/huggingface/transformers

Finetune QWen2VL on LLaMA-Factory w/ LoRA and gradient accumulation 8xH100 NVL (forced ignored version check). enable_liger_kernel: true

The loss is still way high, like near 10 times higher than before, although eval loss is small as before!

Do you know why ?

Thanks, Steve

In your forward() function, give a signature parameter "loss_kwargs", then the loss value will be 'normal' again. The other way is after you installed transformer, go the the trainer.py and hardcode 'model_accepts_loss_kwargs' to True

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
solved This problem has been already solved
Projects
None yet
7 participants