[Feature] Flash Attention In-app Feature Option Request #2460

DKNTZMN · 2024-06-22T18:57:57Z

May I request for an option to enable flash attention option in the UI.
Current model is spitting nonsense and requires flash attention to run.

ThiloteE · 2024-07-09T15:04:04Z

llama.cpp related PRs:

Flash attention can speed up inference with supported backends substantially. The following screenshot is from a test in one of the PRs that add support for flash attention to llama.cpp:

Not only can it speed up t/s, it also reduces the size of the compute buffer substantially with large contexts, as following screenshots will demonstrate:

I think having Flash attention available as preference in the settings for backends that support it will be great.

DKNTZMN added the enhancement New feature or request label Jun 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Flash Attention In-app Feature Option Request #2460

[Feature] Flash Attention In-app Feature Option Request #2460

DKNTZMN commented Jun 22, 2024

ThiloteE commented Jul 9, 2024

[Feature] Flash Attention In-app Feature Option Request #2460

[Feature] Flash Attention In-app Feature Option Request #2460

Comments

DKNTZMN commented Jun 22, 2024

ThiloteE commented Jul 9, 2024