-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[XPU] update xdnn adamw_v2 #63108
[XPU] update xdnn adamw_v2 #63108
Conversation
你的PR提交成功,感谢你对开源项目的贡献! |
❌ The PR is not created using PR's template. You can refer to this Demo. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@@ -386,22 +379,23 @@ void AdamwDenseKernelKL3(const Context& dev_ctx, | |||
reinterpret_cast<XPUType*>(dev_ctx.template Alloc<T>(param_out)), | |||
master_in_data, | |||
master_out_data, | |||
param.numel()); | |||
param.numel(), | |||
round_bf16_output); | |||
PADDLE_ENFORCE_XDNN_SUCCESS(r, "adamw_v2"); | |||
} | |||
if (!use_global_beta_pow) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
现在use_global_beta_pow是true吗?我看llama7b也是需要有scale
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
不是,use_global_beta_pow
是false。
如果48626那个PR给revert掉的话,beta1_pow
和beta2_pow
就会在CPU上了,和GPU一样,在CPU上更新这两个标量。需要找实际模型跑一下能不能revert掉。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
在某个模型上面测试过了,需要回滚掉48626那个PR,这样beta1_pow
和beta2_pow
才会放到CPU上面去。然后就可以用新的adamw_v2
直接读这个标量用来计算了。
Sorry to inform you that 13de79f's CIs have passed for more than 7 days. To prevent PR conflicts, you need to re-run all CIs manually. |
PR Category
Operator Mechanism
PR Types
Performance
Description
修改点:
adamw_v2
函数,支持将beta1_pow
和beta2_pow
以标量方式传入,而非XPU显存下的指针。XPU_PADDLE_ADAMW_ROUND_BF16_OUTPUT
来打开round_bf16_output
功能,该配置可以在数据类型为bfloat16的时候,adamw_v2
输出计算结果的时候,以round to nearest even的方式,而非传统的直接截断掉尾部16个bit。本PR对速度没有什么影响,虽然干掉了
adamw
算子中的两个scale
操作,但是实际跑起来的速度是持平的。考虑掉删除了优化器中的“针对XPU的特殊处理”,所以还是有点意义的。理论上在组网代码中不应该有这种“仅XPU特殊”的东西。