Add support for flash attention #3282

WilliamTambellini · 2023-09-20T19:32:36Z

Prerequisites

I am running the latest code. Development is very rapid so there are no tagged versions as of now.
I carefully followed the README.md.
I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
I reviewed the Discussions, and have a new bug or useful enhancement to share.

ggml core lib not using flashattn.

Operating System, e.g. for Linux:
Any recent Linux with 'recent' nvidia drivers/gpus.
SDK version, e.g. for Linux:
cuda 11 or 12, no preference

KerfuffleV2 · 2023-09-20T19:42:45Z

There's #778 - it didn't get merged since it didn't seem to provide an advantage though.

Aya-ZIbra · 2023-09-26T02:19:14Z

I am interested what benchmarks are used to check if flash attn improved the current implementation.

ggerganov · 2023-09-26T02:41:07Z

I was testing with single-batch inference back at the time.
Need to redo the tests with parallel batches #3228

Aya-ZIbra · 2023-09-26T05:09:59Z

I think flash will be effective when the load consists of more prefill than decode. Parallel loads may satisfy this.

ggerganov · 2023-09-27T15:21:45Z

Superseded by #3365

ggerganov closed this as completed Sep 27, 2023