Releases: NVIDIA/cudnn-frontend
v1.9.0 release
cudnn frontend v1.9 release notes
New API
Enhancements to flash attention API
-
SDPA_attributes
andSDPA_bprop_attributes
now accepts a score_mod function throughset_score_mod
andset_score_mod_bprop
API. The function accepts a custom chain of pointwise operations which operate on the Attention Score Matrix. Some common functors like causal mask, sliding window mask, soft capping etc. have been added to the headers as reference. More examples of usage have been added in samples for fprop and bprop. -
Added support for THD format and sliding window mask.
-
Added support for THD format and Bottom right causal mask.
-
Added support for bottom right causal masking with sliding window mask
-
Added a new parameter called
set_max_total_seq_len_q/set_max_total_seq_len_kv
on the sdpa bprop node. This will help reduce the workspace size required when running with THD format.
Improvements
-
Allow creation of serialized json for dgrad, wgrad and resample operations.
-
Added more diagnostic message when the compiled version of cudnn does not match the run-time version of cudnn.
Bug fixes
-
Fixed an issue where log messages unparseable data at the end of messages.
-
Fixed an issue where while building the python pip wheel would hang.
-
Fixed natively creating cuda graphs for SDPA with alibi masks.
New samples
- Added a new sample for Layernorm with dynamic shapes and a kernel cache to showcase reduced plan build time when using the kernel cache.
v1.8.0 release
cudnn frontend v1.8 release:
New API
Paged Attention API
SDPA forward operation now supports paged attention on cudnn 9.5.0 and later by setting the appropriate page table descriptors. SDPA_attributes
now accepts set_paged_attention_k_table
and set_paged_attention_v_table
to input these descriptors. Please refer to samples for usage : cpp samples, python samples. See docs for more API details. Paged attention allows for more efficient memory usage by storing K/V caches in non-contiguous memory, and using page tables to reconstruct them. For more information, refer to the cudnn_graph Library, and the Paged Attention paper
cuda Graph API
cudnn graph now allows user to directly build native cuda_graph for given sub_graph (requires cudnn 9.5.0). There are two APIs:
populate_cuda_graph
: add the cudnn nodes to the empty cuda_graph provided as input.update_cuda_graph
: update the populated cuda graph with necessary data pointers.
See docs and backend documentation for more details.
Enhancements
-
Kernel cache for dynamic shapes are now supported in python. Added a sample to showcase usage.
-
graph.deselect_engines(str: )
has now a python equivalent through pybind11. -
graph.tensor(...)
can now acceptint64_t
scalars directly. (Previously limited to int32_t, float and fp16 data types). -
fp8 sdpa attention now allows dropout and padding mask. Requires cudnn 9.5.0 and above.
-
More enhancements to pointwise output stride inferencing (for broadcast operation). For non-unary operands, the broadcasted tensor can now be either at IN_0 or IN_1.
-
SDPA backward operation now allows d upto 256 for Hopper. Requires cudnn 9.5.0 and above.
Bug fixes
-
Fixed an issue while querying
cudnnGetLastErrorString()
from the backend. The error_t object will now have more meaningful message. -
Fixed build issues seen with clang-19 compiler.
-
Fixed an issue where it was assumed a graph with bias in sdpa_bprop will always have a dbias.
cudnn FE 1.7.0 Release
cudnn FE 1.7.0 Release notes:
New API
- Kernel Cache support for dynamic graphs Added New APIs to enable kernel cache support for graphs with dynamic shapes. Please refer to documentation for API details.
Added examples Convolution fprop dynamic shape
, CSBR Graph dynamic shape
, Matmul dynamic shape
and Bias + Matmul dynamic shape
to showcase use of dynamic shapes and kernel cache.
- Two new APIs to describe the plan in the form engine number and knobs are introduced.
error_t
get_plan_name(std::string &name) const;
error_t
get_plan_name_at_index(int64_t plan_index, std::string &name) const;
Note:
This name can be used later if you want to deselect_plan_by_name, if run into any potential errors.
- Added an API to query tensor attributes from its UID in a graph.
query_tensor_with_uid(int64_t const uid, Tensor_attributes &tensor) const;
Improvements
-
sdpa fp16 bprop node can now compute dbias when padding mask is enabled (requires cudnn 9.4.0 and above).
-
sdpa fp8 (forward and bprop) nodes now support optional bias, dropout and padding mask(requires cudnn 9.4.0 and above).
-
Matmul fp8 node can now accept M,N,K overrides.
-
Added new python notebooks for implementing BatchNorm and BatchNorm bprop using cuDNN.
-
Updated benchmark numbers with cudnn 9.4.0 for fp16 and fp8 datatypes.
-
Fixed compilation issues when
NV_CUDNN_DISABLE_EXCEPTION
is enabled.
Bug fixes
-
Fixed a crash when the output dimension of dgrad node is not specified. This now returns an error message instead.
-
Fixed incorrect SDPA stats stride inferencing.
-
Fixed a bug in sdpa test when sliding window attention is enabled and query sequence length (s_q) is greater than key length (s_kv). This case is now not supported.
cudnn FE 1.6.1 release
Bug fix
- Fixed an issue where custom dropout mask was not correctly applied.
- Added
-fvisibility=hidden
for the pip wheels generated to avoid symbol conflicts with other modules that use cudnn frontend. - Fixed an issue in sdpa operation which when deserialized will lead to numerical mismatches.
- Fixed an issue in sdpa fp8 fprop operation (in inference mode).
Samples
- Added a new sample to showcase how a custom dropout mask can be applied to a sdpa operation.
- Added a sample to shocase convolutions on large (
c * d * h * w > 2 **31
) tensors.
v1.6.0 release
Release notes:
New API
- Graph Slice Operation: Introduced the
graph.slice
operation for slicing input tensors. Refer to docs/operations/Slice.md for detailed documentation andsamples/cpp/misc/slice.cpp
for a C++ sample. Pybinds for this operation have also been added. - SM Carveout Feature: Added the
set_sm_count(int32_t type)
graph property to support the SM Carveout feature introduced in Ampere and Hopper GPUs. Engines that do not supportSM_COUNT
will returnNOT_SUPPORTED
.
Bug Fixes
- Convolution Mode Attribute: Added the missing
set_convolution_mode
attribute to convolution attributes in forward propagation (fprop), data gradient (dgrad), and weight gradient (wgrad). Previously, this was hardcoded toCUDNN_CROSS_CORRELATION
in the 1.x API. - SDPA FP8 Backward Node: Fixed an issue with the deserialization of the
sdpa_fp8_backward
node.
Enhancements
- Graph Execution Overhead: Reduced the overhead of
graph.execute()
by optimizing sub-node tree traversal, collected UIDs, workspace modifications, and workspace size. - Graph Validation Performance: Significantly improved (~10x) the performance of
graph.validate()
by deferring graph expansion to a later stage (build_operation_graph
). - Optional Running Stats for BatchNorm: Made the running statistics for the batch normalization operation optional, supported by cuDNN backend version 9.3.0 and later.
- Shape and Stride Inferencing: Enhanced shape and stride inferencing to preserve the stride order of the input.
- Diagnostic Error Message: Added a diagnostic error message to
create_execution_plans
if called without the precedingbuild_operation_graph
. - JSON Schema and Deserialization: Improved the JSON schema and deserialization logic with additional checks.
- Logging Overhead: Reduced logging overhead, resulting in faster
graph.build()
calls. - CMake Integration: Replaced
CMAKE_SOURCE_DIR
withPROJECT_SOURCE_DIR
in CMake files for better integration. See the relevant pull request for more details.
Samples
- Jupyter Notebooks: Added Jupyter notebooks for RMSNorm, InstanceNorm, and LayerNorm. Refer to the
samples/python
folder for more information.
v1.5.2 release
[Enhancement] Allows stride value of 0 indicating repetition of tensor in those dimensions.
v1.5.1 release
v1.5.1
[Bug fix] Fixed an issue, where cudnn-frontend (1.5.0) when built with cudnn version 9.1.1 and below, runs into issues when run with 9.2.0 and
above.
v1.5.0 release
[New feature] With cudnn backend 9.2.0 and above, Graph::check_support can determine support check for runtime engines without invoking the nvrtc compiler. This allows users to check the support surface of cudnn without invoking the nvrtc compilation.
[New feature] Python pip wheel now contains the necessary c++ development headers.
[New feature] Sliding window attention is now supported as an attribute to the sdpa forward and bprop node. Usage:
sdpa_attributes.set_sliding_window_length(window_length)
[New feature] Bottom right aligned causal masking is now supported as an attribute to the sdpa forward and bprop node. Usage: sdpa_attributes.use_causal_mask_bottom_right(true)
[New feature] SDPA bprop attributes can choose deterministic algorithm using the use_deterministic_algorithm API.
[New feature] Allow users to filter candidate execution plans of graph by its shared memory usage in cudnn 9.2.0 and later.
[Bug fix] A runtime error if chosen execution plan candidate is incorrectly set in the backend has been fixed. This would happen when check_support does not correctly filter by the workspace size.
[Bug fix] selecting/deselecting by behavior and numerical notes has now been fixed and works as intended.
[Debugging] A new tool for easy reproduction of a failure using the json representation of the graph can be found here.
[Samples] Restructured the cpp samples into categories for easier navigation.
[Samples] Added a sample to showcase how different plans can be built in parallel in separate threads.
[Compilation enhancement] Added a new macro
CUDNN_FRONTEND_SKIP_NLOHMANN_JSON as compilation flag to not have nlohman::json as compilation dependency. Users lose access to certain API functions like print, key, serialize, deserialzie that depend on the library.
[Enhancement] Serialization of resample operation is now supported.
[Enhancement] Bug template has been added for new github issues
v1.4.0 release
[New] Added a benchmark folder which contains a sample docker file to compare cudnn implementation of sdpa with that of the pytorch implementation.
[Enhancement] Once an engine is de-selected by name, it will not be built as part of check support.
[Enhancement] The cudnn backend search order for wheels is as follows: (a) It will dlopen libcudnn.so.MAJOR_VERSION in the site packages. (b) It will try to dlopen unversioned libcudnn.so. This way pypi cudnn package nvidia-cudnn-cu* gets priority over default search path.
[Enhancement] Allow embedding dimension up to 256 (currently limited to 128) in sdpa fprop operation.
[Bug fix] Update the scale and bias shapes in batch norm sample.
v1.3.0 release
[New API] Added new operations sdpa_fp8_forward
and sdpa_fp8_backward
to perform scaled dot prodcut attention of fp8 tensors. See more details in the docs/operations/Attention.md
and cpp sample in samples/cpp/mha.cpp
. Pybinds for the fp8 nodes are also added.
[New API] Added new operation for resample forward operation. Add a new sample samples/cpp/resample.cpp
to show its usage.
[New API] Add a new API deselect_engines(std::vector<std::string> const &engine_names)
which blocks certain engine configs from running.
[New API] Add new APIs select_numeric_notes
and select_behavior_notes
to allow user select engine configs which have the selected numeric and behavior notes respectively.
[Python API] Added a custom exception cudnnGraphNotSupportedException
to the python API to distinguish between graphs that are actually not supported as compared to programming errors.
[Python API] Added a new backend_version_string
which returns the backend version in canonical form (eg. 9.1.0) instead of a version number.
[Bug Fix] Fixed issues with compilation on clang19 and c++20 standard.
[Bug Fix] Updated the workspace computation for sdpa fprop node. Previously, workspace was calculated for alibi slopes irrespective of whether alibi mask was turned on or not.
[Bug Fix] Fixed deserialization of fused scalars.