-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add fast_rmsnorm #8680
add fast_rmsnorm #8680
Conversation
Thanks for your contribution! |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #8680 +/- ##
========================================
Coverage 55.74% 55.74%
========================================
Files 623 623
Lines 97454 97457 +3
========================================
+ Hits 54323 54331 +8
+ Misses 43131 43126 -5 ☔ View full report in Codecov by Sentry. |
测试精度的结果,PR里面展示一下吧。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
PR types
Performance optimization
PR changes
Others
Description
基于fast_ln,支持了fast_rms_norm。
对性能的影响:
使得rms_norm算子速度提升了1倍,模型吞吐如下:
对精度的影响:
修改前后保证了fast_ln的结果不变:
具体测试是打印了此算子前向和反向的md5sum值,结果不变,具体如下:
PR前的结果:
fast_rms_norm和fused_rms_norm无法做到诸位对齐。但不影响收敛,收敛的验证是通过TE来验证的,TE中用的就是fast_rms_norm,已知bf16精度的情况下,开关TE不影响收敛。
具体的精度测试结果如下:
可以看到,前向反向的md5sum值对不上,tensor值不完全相同,从diff上看,两边值几乎相同,对于shape=[10, 4096]的输出tensor,通过print(paddle.nonzero(output1 - output2)),可以看到有462个元素的值结果不同,占比1.1%,元素在1e-4精度有diff。反向亦如此
端到端影响:
控制相同输入和参数初始化
只看第一个loss的话,绝对误差1e-3,相对误差在1e-5