-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[bloom] Add kv cache support for flash attention & fix bugs #7735
Conversation
Thanks for your contribution! |
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## develop #7735 +/- ##
========================================
Coverage 57.29% 57.30%
========================================
Files 584 584
Lines 87646 87628 -18
========================================
- Hits 50219 50215 -4
+ Misses 37427 37413 -14 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@@ -17,10 +17,10 @@ | |||
|
|||
def bloom_postprocess_past_key_value(past_key_values): | |||
# (layer_num, bs, head_num/tensor_parallel_degree, prefixlen, head_dim)*2 | |||
past_key_values = paddle.transpose(past_key_values, perm=[2, 0, 3, 1, 4]).split(2) | |||
keys, values = paddle.transpose(past_key_values, perm=[2, 0, 1, 3, 4]).split(2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这块 @lugimzzz 看看呗,之前用 bloom 训过 ptuning,精度是对齐的,如果这块调整之后是否会影响目前对齐的版本,前端的推理是否也需要调整呢?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
训练测试过就可以,确认不影响推理
@@ -3,6 +3,7 @@ inference-predict: | |||
mode: dynamic | |||
max_length: 40 | |||
batch_size: 2 | |||
use_flash_attention: false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这块的配置是否要设置成 true 呢?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
我单独写了一个单测,在单测里面加了use_flash_attention:true的配置
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
* Add kv cache support for flash attention * Update chatglm flash attention version check * Add test for flash attention * Fix unitest bug * Add flash attention to predictor * Add flash attention2 * Add flash attention unitests * fix prefix decoder * remove unused comments * Update unitest * Update unitest
PR types
PR changes
Description
TODO: