-
Notifications
You must be signed in to change notification settings - Fork 10k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Demo usage of Flash Attention #778
Conversation
Please merge this because it's amazing on x86 with longer context. I tried generating 1500 tokens with the 7B model ( --ignore-eos -c 2048 -n 1500). On the master branch the generation took 1385 seconds. On the flash-attn branch it took 200 seconds. |
Edit: Will run some more tests just to make sure this isn't coincidental for my machine.
|
Alright, #775 clearly contributed to the results I got. I pulled master again with #775 already merged and now I'm getting:
Exactly the same result as with flash attention. Just for reference this is what I got previously:
I guess FA needs more testing. |
Wow, that's quite a dramatic change nonetheless! I guess some systems were hit way harder than others by the V transpose on every token. |
I couldn't find a measurable difference between this and master on a 9900k. |
Yeah, no noticeable difference on a Ryzen 2600. But interesting if it can go somewhere. |
There's a good chance that CPU is more bottlenecked by compute than GPU, and that orig implementation already prefetches cache lines. |
Would this implementation also work on GPUs? Has anyone tried how well it works on GPUs? |
@ggerganov Tks. What about disabling it by default but merging it for people to be able at least to try with master via a cli arg ? |
how do you do the benchmarking? |
Is this the correct implementation? I think the effect on the GPU is good because it uses shared memory with higher bandwidth. On the CPU, should the block data be temporarily stored in registers to obtain higher bandwidth? |
A faster metal implementation: https://github.com/philipturner/metal-flash-attention cc: @philipturner |
@ggerganov I now see ggml_flash_attn_ext/back ... in recent ggml and already used in llama.cpp (if flash_attn true) so should that PR be now closed? Tks |
I just got pinged for this PR. Does LLaMA.cpp even exist anymore? It was a thing like 1.5 years ago. |
This is my understanding of how Flash Attention works based on this picture:
ref: https://github.com/HazyResearch/flash-attention
The implementation is here:
https://github.com/ggerganov/llama.cpp/blob/flash-attn/ggml.c#L8122-L8367
I don't plan on merging this because on M1 it is the same performance as without FA.
However, in
whisper.cpp
I have gained performance from using this same exact call in the Encoder:https://github.com/ggerganov/whisper.cpp/blob/0a2d1210bcb98978214bbf4e100922a413afd39d/whisper.cpp#L1482-L1508
Putting this here if someone wants to play with it or figures out how to implement sparse attention.
The idea is just to merge the
ggml
operators into a single op and avoid intermediate tensors.