Add support for CUDNN attention via CUDNN_FRONTEND Python API? #1123

Skylion007 · 2024-10-10T15:19:18Z

🚀 Feature

SDPA supports this and it can be faster in some scenarios. As such to keep parity, and iterate faster on speed improvements, it may better to add a backend here via https://github.com/NVIDIA/cudnn-frontend which has a Python API as well for interacting it with it. This also allows folks to try CUDNN attention improvements without waiting for updated binaries to land in PT or update their PT version to the absolute latest one.
Speed benchmarks available here: https://github.com/NVIDIA/cudnn-frontend/tree/main/benchmark
Advantage ssupport custom masks for attnBias unlike Flash Attention and is much faster than CutlassF or even Flash Attention

Leave current behavior and accept the SDPA could be faster than xformers in some circumstances.

Flagging @eqy in case they are interested in helpng out.