-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Inference] Add cutlass gemm dequant op #8909
[Inference] Add cutlass gemm dequant op #8909
Conversation
Thanks for your contribution! |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #8909 +/- ##
===========================================
+ Coverage 53.58% 53.76% +0.18%
===========================================
Files 652 652
Lines 105169 104513 -656
===========================================
- Hits 56354 56193 -161
+ Misses 48815 48320 -495 ☔ View full report in Codecov by Sentry. |
paddlenlp/experimental/transformers/fused_transformer_layers.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
* change gpu name * add cutlass gemm_dequant op * add cutlass gemm_dequant op * fix format * fix fused_transformer_layers * fix layer * fix layer * fix format * fix format * fix review * fix review * fix review * fix review * fix review * fix review * fix review
PR types
New features
PR changes
Add new cutlass op
Description
Add cutlass gemm dequant op
参数:--decode_strategy greedy_search --mode dynamic --quant_type a8w8 --inference_model 1 --batch_size 2 --src_length 128 --max_length 256
Use block atta输出:
Not use block atta输出:(不增加该PR时,第二条输出就有乱码)
测试配置 L20 、batch_size 2、block atta
gemm dequant 未融合:
gemm dequant 融合:
3.尝试qkv_out后接dequant,但出现报错
详细见这里:
https://ku.baidu-int.com/knowledge/HFVrC7hq1Q/pKzJfZczuc/TK3hw_mluo/1-4J_hgwU8mmJN