Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for CUDNN attention via CUDNN_FRONTEND Python API? #1123

Open
Skylion007 opened this issue Oct 10, 2024 · 0 comments
Open

Add support for CUDNN attention via CUDNN_FRONTEND Python API? #1123

Skylion007 opened this issue Oct 10, 2024 · 0 comments

Comments

@Skylion007
Copy link
Contributor

Skylion007 commented Oct 10, 2024

🚀 Feature

Motivation

Pitch

  • SDPA supports this and it can be faster in some scenarios. As such to keep parity, and iterate faster on speed improvements, it may better to add a backend here via https://github.com/NVIDIA/cudnn-frontend which has a Python API as well for interacting it with it. This also allows folks to try CUDNN attention improvements without waiting for updated binaries to land in PT or update their PT version to the absolute latest one.
  • Speed benchmarks available here: https://github.com/NVIDIA/cudnn-frontend/tree/main/benchmark
  • Advantage ssupport custom masks for attnBias unlike Flash Attention and is much faster than CutlassF or even Flash Attention

Alternatives

  • Leave current behavior and accept the SDPA could be faster than xformers in some circumstances.

Additional context

Flagging @eqy in case they are interested in helpng out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant