Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Flash Attention In-app Feature Option Request #2460

Open
DKNTZMN opened this issue Jun 22, 2024 · 1 comment
Open

[Feature] Flash Attention In-app Feature Option Request #2460

DKNTZMN opened this issue Jun 22, 2024 · 1 comment
Labels
enhancement New feature or request

Comments

@DKNTZMN
Copy link

DKNTZMN commented Jun 22, 2024

May I request for an option to enable flash attention option in the UI.
Current model is spitting nonsense and requires flash attention to run.

@DKNTZMN DKNTZMN added the enhancement New feature or request label Jun 22, 2024
@ThiloteE
Copy link
Collaborator

ThiloteE commented Jul 9, 2024

llama.cpp related PRs:

Flash attention can speed up inference with supported backends substantially. The following screenshot is from a test in one of the PRs that add support for flash attention to llama.cpp:
image
Not only can it speed up t/s, it also reduces the size of the compute buffer substantially with large contexts, as following screenshots will demonstrate:
image
image

I think having Flash attention available as preference in the settings for backends that support it will be great.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants