-
Notifications
You must be signed in to change notification settings - Fork 10k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for flash attention #3282
Comments
There's #778 - it didn't get merged since it didn't seem to provide an advantage though. |
I am interested what benchmarks are used to check if flash attn improved the current implementation. |
I was testing with single-batch inference back at the time. |
I think flash will be effective when the load consists of more prefill than decode. Parallel loads may satisfy this. |
Superseded by #3365 |
Prerequisites
Expected Behavior
ggml core lib to use flash attention (v1 or v2) at least for nvidia runtime.
Refs:
https://github.com/Dao-AILab/flash-attention
https://tridao.me/publications/flash2/flash2.pdf
#2257
Current Behavior
ggml core lib not using flashattn.
Environment and Context
Operating System, e.g. for Linux:
Any recent Linux with 'recent' nvidia drivers/gpus.
SDK version, e.g. for Linux:
cuda 11 or 12, no preference
The text was updated successfully, but these errors were encountered: