Demo usage of Flash Attention #778

ggerganov · 2023-04-05T15:44:12Z

This is my understanding of how Flash Attention works based on this picture:

ref: https://github.com/HazyResearch/flash-attention

The implementation is here:

https://github.com/ggerganov/llama.cpp/blob/flash-attn/ggml.c#L8122-L8367

I don't plan on merging this because on M1 it is the same performance as without FA.
However, in whisper.cpp I have gained performance from using this same exact call in the Encoder:

https://github.com/ggerganov/whisper.cpp/blob/0a2d1210bcb98978214bbf4e100922a413afd39d/whisper.cpp#L1482-L1508

Putting this here if someone wants to play with it or figures out how to implement sparse attention.
The idea is just to merge the ggml operators into a single op and avoid intermediate tensors.

bakamomi · 2023-04-05T18:56:35Z

Please merge this because it's amazing on x86 with longer context. I tried generating 1500 tokens with the 7B model ( --ignore-eos -c 2048 -n 1500). On the master branch the generation took 1385 seconds. On the flash-attn branch it took 200 seconds.

rabidcopy · 2023-04-05T19:21:21Z

Strange, when comparing #775 to this I noticed a regression in the time it took to generate 1024 tokens.
#775

llama_print_timings:        load time =  2776.69 ms
llama_print_timings:      sample time =   801.28 ms /  1024 runs   (    0.78 ms per run)
llama_print_timings: prompt eval time =  1912.48 ms /    14 tokens (  136.61 ms per token)
llama_print_timings:        eval time = 189655.06 ms /  1023 runs   (  185.39 ms per run)
llama_print_timings:       total time = 193245.67 ms

#778 + #775 (fluke)

llama_print_timings:        load time =  2745.93 ms
llama_print_timings:      sample time =   814.02 ms /  1024 runs   (    0.79 ms per run)
llama_print_timings: prompt eval time =  1880.81 ms /    14 tokens (  134.34 ms per token)
llama_print_timings:        eval time = 250896.90 ms /  1023 runs   (  245.26 ms per run)
llama_print_timings:       total time = 254470.03 ms

Please merge this because it's amazing on x86 with longer context. I tried generating 1500 tokens with the 7B model ( --ignore-eos -c 2048 -n 1500). On the master branch the generation took 1385 seconds. On the flash-attn branch it took 200 seconds.

~~Are you certain that uplift isn't a result of #775? If you cloned the flash-attn branch it included that commit.~~

Edit: Will run some more tests just to make sure this isn't coincidental for my machine.
Edit2: It was a fluke. Re-running again on this PR and I got a slightly better result now.

llama_print_timings:        load time =  2875.85 ms
llama_print_timings:      sample time =   802.98 ms /  1024 runs   (    0.78 ms per run)
llama_print_timings: prompt eval time =  1938.22 ms /    14 tokens (  138.44 ms per token)
llama_print_timings:        eval time = 180435.09 ms /  1023 runs   (  176.38 ms per run)
llama_print_timings:       total time = 184126.72 ms

bakamomi · 2023-04-05T19:40:23Z

Alright, #775 clearly contributed to the results I got. I pulled master again with #775 already merged and now I'm getting:

llama_print_timings:        load time =   928.29 ms
llama_print_timings:      sample time =   859.52 ms /  1500 runs   (    0.57 ms per run)
llama_print_timings: prompt eval time =   454.82 ms /     8 tokens (   56.85 ms per token)
llama_print_timings:        eval time = 200382.48 ms /  1500 runs   (  133.59 ms per run)
llama_print_timings:       total time = 202195.50 ms

Exactly the same result as with flash attention. Just for reference this is what I got previously:

llama_print_timings:        load time =  1817.75 ms
llama_print_timings:      sample time =   857.91 ms /  1500 runs   (    0.57 ms per run)
llama_print_timings: prompt eval time =  1454.90 ms /     8 tokens (  181.86 ms per token)
llama_print_timings:        eval time = 1383048.29 ms /  1500 runs   (  922.03 ms per run)
llama_print_timings:       total time = 1385748.68 ms

I guess FA needs more testing.

rabidcopy · 2023-04-05T19:47:33Z

Alright, #775 clearly contributed to the results I got. I pulled master again with #775 already merged and now I'm getting:

llama_print_timings:        load time =   928.29 ms
llama_print_timings:      sample time =   859.52 ms /  1500 runs   (    0.57 ms per run)
llama_print_timings: prompt eval time =   454.82 ms /     8 tokens (   56.85 ms per token)
llama_print_timings:        eval time = 200382.48 ms /  1500 runs   (  133.59 ms per run)
llama_print_timings:       total time = 202195.50 ms

Exactly the same result as with flash attention. Just for reference this is what I got previously:

llama_print_timings:        load time =  1817.75 ms
llama_print_timings:      sample time =   857.91 ms /  1500 runs   (    0.57 ms per run)
llama_print_timings: prompt eval time =  1454.90 ms /     8 tokens (  181.86 ms per token)
llama_print_timings:        eval time = 1383048.29 ms /  1500 runs   (  922.03 ms per run)
llama_print_timings:       total time = 1385748.68 ms

I guess FA needs more testing.

Wow, that's quite a dramatic change nonetheless! I guess some systems were hit way harder than others by the V transpose on every token.

slaren · 2023-04-05T21:29:20Z

I couldn't find a measurable difference between this and master on a 9900k.

rabidcopy · 2023-04-05T22:14:01Z

Yeah, no noticeable difference on a Ryzen 2600. But interesting if it can go somewhere.

jon-chuang · 2023-04-12T01:49:25Z

There's a good chance that CPU is more bottlenecked by compute than GPU, and that orig implementation already prefetches cache lines.

See: Dao-AILab/flash-attention#59

NikolaBorisov · 2023-09-02T00:04:58Z

Would this implementation also work on GPUs? Has anyone tried how well it works on GPUs?

WilliamTambellini · 2023-09-20T21:41:23Z

@ggerganov Tks. What about disabling it by default but merging it for people to be able at least to try with master via a cli arg ?

Aya-ZIbra · 2023-09-26T02:17:20Z

how do you do the benchmarking?

Waylon-Zhu · 2023-09-27T07:21:32Z

Is this the correct implementation? I think the effect on the GPU is good because it uses shared memory with higher bandwidth. On the CPU, should the block data be temporarily stored in registers to obtain higher bandwidth?

sroussey · 2023-09-27T15:49:03Z

A faster metal implementation:

https://github.com/philipturner/metal-flash-attention

cc: @philipturner

…gerganov#778

WilliamTambellini · 2024-08-28T22:43:03Z

@ggerganov I now see ggml_flash_attn_ext/back ... in recent ggml and already used in llama.cpp (if flash_attn true) so should that PR be now closed? Tks

philipturner · 2024-08-29T04:13:04Z

I just got pinged for this PR. Does LLaMA.cpp even exist anymore? It was a thing like 1.5 years ago.

ggerganov force-pushed the flash-attn branch from 4ce8723 to ab2ac0d Compare April 5, 2023 15:58

Base automatically changed from fix-cpy to master April 5, 2023 19:07

llama : add flash attention (demo)

36ddd12

ggerganov force-pushed the flash-attn branch from ab2ac0d to 36ddd12 Compare April 5, 2023 19:12

jon-chuang mentioned this pull request Apr 12, 2023

[Enhancement]: Implement optimizations used in CTranslate2 #811

Closed

jon-chuang mentioned this pull request Apr 12, 2023

benchmarks? #34

Closed

ggerganov added the demo Demonstrate some concept or idea, not intended to be merged label Apr 22, 2023

ggerganov mentioned this pull request May 12, 2023

How to fine tune it? ggerganov/ggml#8

Open

KerfuffleV2 mentioned this pull request Sep 20, 2023

Add support for flash attention #3282

Closed

4 tasks

ggerganov mentioned this pull request Sep 27, 2023

llama : revisit using flash attention for prompt processing (a.k.a. prefil) + GPU implementation #3365

Closed

Deadsg pushed a commit to Deadsg/llama.cpp that referenced this pull request Dec 19, 2023

Pass-Through grammar parameter in web server. (ggerganov#855) Closes g…

5f8f369

…gerganov#778

FSSRepo mentioned this pull request Jan 4, 2024

CUDA: faster softmax via shared memory + fp16 math #4742

Merged

This was referenced Jan 17, 2024

WIP: Flash Attention implementation (forward + backward) #5010

Closed

WIP: Flash Attention implementation (forward + backward) Pints-AI/llama.cpp#1

Closed

ggerganov closed this Aug 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Demo usage of Flash Attention #778

Demo usage of Flash Attention #778

ggerganov commented Apr 5, 2023

bakamomi commented Apr 5, 2023

rabidcopy commented Apr 5, 2023 •

edited

Loading

bakamomi commented Apr 5, 2023

rabidcopy commented Apr 5, 2023 •

edited

Loading

slaren commented Apr 5, 2023

rabidcopy commented Apr 5, 2023

jon-chuang commented Apr 12, 2023

NikolaBorisov commented Sep 2, 2023

WilliamTambellini commented Sep 20, 2023

Aya-ZIbra commented Sep 26, 2023

Waylon-Zhu commented Sep 27, 2023

sroussey commented Sep 27, 2023

WilliamTambellini commented Aug 28, 2024

philipturner commented Aug 29, 2024

Demo usage of Flash Attention #778

Demo usage of Flash Attention #778

Conversation

ggerganov commented Apr 5, 2023

bakamomi commented Apr 5, 2023

rabidcopy commented Apr 5, 2023 • edited Loading

bakamomi commented Apr 5, 2023

rabidcopy commented Apr 5, 2023 • edited Loading

slaren commented Apr 5, 2023

rabidcopy commented Apr 5, 2023

jon-chuang commented Apr 12, 2023

NikolaBorisov commented Sep 2, 2023

WilliamTambellini commented Sep 20, 2023

Aya-ZIbra commented Sep 26, 2023

Waylon-Zhu commented Sep 27, 2023

sroussey commented Sep 27, 2023

WilliamTambellini commented Aug 28, 2024

philipturner commented Aug 29, 2024

rabidcopy commented Apr 5, 2023 •

edited

Loading

rabidcopy commented Apr 5, 2023 •

edited

Loading