You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Flash attention can speed up inference with supported backends substantially. The following screenshot is from a test in one of the PRs that add support for flash attention to llama.cpp:
Not only can it speed up t/s, it also reduces the size of the compute buffer substantially with large contexts, as following screenshots will demonstrate:
I think having Flash attention available as preference in the settings for backends that support it will be great.
May I request for an option to enable flash attention option in the UI.
Current model is spitting nonsense and requires flash attention to run.
The text was updated successfully, but these errors were encountered: