Releases: NVIDIA/cudnn-frontend
v1.2.1
v1.2.1 release:.
[Bug Fix] cudnn-frontend pip wheels will now dlopen the fully version tag first libucdnn.so.8
or libcudnn.so.9
first before trying to load libcudnn.so
. This means the pip wheels in the RUN_PATH will be prioritized over system paths (default behavior of dlopen). This can be overridden by setting the LD_LIBRARY_PATH
. Source installation will now automatically look at cudnn in site packages before system path.
[Documentation] Fixed the google-colab links in the jupyter notebooks.
[Documentation] Added a jupyter notebook sample to go over the basics of cudnn FE graph API. 00_introduction.ipynb
v1.2.0
[New artifacts] Pre-built (alpha version) pip installable wheels for linux will be made available as part of this release. The pip wheels are compatible from python 3.8 through 3.12. The source builds will continue to work as expected.
[Documentation] We are updating our contribution policy and will be accepting small PRs targetting improving the cudnn-frontend. For full contribution guide refer to our contribution policy.
[API updates] [Python] The graph.execute function in python now takes an optional handle. This is to help user provide a custom handle to the execute function(and achieve parity with the C++ API).
[API updates] Pointwise ops can now take scalars directly as an argument. This simplifies the graph creation process in general. For eg.
auto C = graph.pointwise(A,
graph.tensor(5.0f),
fe::graph::Pointwise_attributes()
.set_mode(fe::PointwiseMode_t::ADD)
.set_compute_data_type(fe::DataType_t::FLOAT));
[Installation] Addresses RFE #64 to provide installation as cmake install
[Installation] Addresses RFE #63 to provide custom installation of catch2. If catch2 is not found, cudnn frontend fetches it automatically from the upstream github repository.
[Logging] Improved logging to print legible tensor names. We will be working on further improvements in future releases to make the logging more streamlined.
[Samples] Add a sample for showcasing auto-tuning to select the best plan among the ones returned from heuristics.
[Samples] As part of v1.2 release, we have created new Jupyter notebooks, showcasing the python API usage. At this point, these will work on A100 and H100 cards only as mentioned in the notebooks. With future releases, we plan to simplify the installation process and elaborate the API usage. Please refer to samples/python directory.
[Bug fixes] Fixed issues related to auto-tuning when the always plan 0 was executed, even though a different plan was chosen as the best candidate.
[Unit Tests] We are adding some unit tests which will provide a way for developers to test parts of the their code before submitting the pull requests. It is highly encouraged to add unit-tests and samples before submitting a pull request.
Note on source installation of python bindings:
In Ubuntu 22.04 debian based systems, when installing without the virtual environment, set ENV DEB_PYTHON_INSTALL_LAYOUT=deb_system
. See related issue
v1.1.2
v1.1.1
v1.1.0
[New API] A new overloaded variant of execute has been added which allows the variant pack to be mentioned as pair of "uid, device pointer". In order to use this, the expectation is user will provide the uid for the tensors created.
error_t
cudnn_frontend::graph::Graph::execute(cudnnHandle_t handle,
std::unordered_map<int64_t, void*>& tensor_to_pointer_map, void *workspace) const;
[New API] Serialization: Graph class now supports serialization and deserialization after the final plan is built. Serialization is only supported on Runtime compiled engines in the cuDNN backend as of today, but may be extended to other engines in future. Deserialization requires a cuDNN handle that is created for an identical GPU the original graph/plan was created with. New samples showcasing this have been added in samples/cpp/serialization.cpp
error_t
cudnn_frontend::graph::Graph::serialize(std::vector<uint8_t>& data) const;
error_t
cudnn_frontend::graph::Graph::deserialize(cudnnHandle_t handle,
std::vector<uint8_t> const& data);
[New API] Autotuning: If the graph allows multiple engine configs for a given topology, each of this can now be built and executed in parallel. The expected flow is user queries the number of plans present and spawns a new thread for each plan to be finalized in parallel. The set of APIs to support this are as follows:
int64_t
Graph::get_execution_plan_count() const;
error_t
Graph::build_plan_at_index(cudnnHandle_t const &handle, int64_t index);
error_t
Graph::execute_plan_at_index(cudnnHandle_t const &handle,
std::unordered_map<int64_t, void*>& ,
void* workspace,
int64_t plan_index) const;
int64_t
get_workspace_size_plan_at_index(int64_t plan_index) const;
[New feature] sdpa_node now allows ragged offset to be set in the input and output tensors.
[Bug Fix] Certain parts of the FE code, used to throw excpetion even with DISABLE_EXCEPTION
flag set. This has been cleaned up.
[Bug Fix] For sdpa node, cudnn now correctly returns NOT_SUPPORTED
when s_q is not a multiple of 64 and padding mask is on and cudnn version is less than 9.0.0.
[Bug Fix] For sdpa backward node, cudnn now correctly returns NOT_SUPPORTED
when s_q is less than 64 and cudnn version is less than 9.0.0.
[Bug Fix] Fixed an issue with pointwise Modulo operation.
[Bug Fix] Fixed an issue in sdpa node, where the intermediate data types were wrong.
[Samples] Added a sample to showcase matmul with int8 and FP8 precisions.
[Cleanup] Python samples have moved from samples/python
to tests/python_fe
.
[Cleanup] Removed the cudnn_frontend::throw_if
function.
v1.0.3 release
[Bug fix] Fixed an issue where in some cases with padding, SDPA backward node can produce NaNs.
[Bug fix] In some older cuda toolkits, eg. cuda 11.4, float to half conversion is not implicit. This was raised in PR-57. Thanks @drisspg for reporting this. A more explicit fix using __float2half
has been implemented in this patch.
[Enhancement] Accepting github PR-55. Thanks @r-barnes for the suggestion.
v1.0.2 release
v1.0.2
[Cleanup] Remove the cudnn_backend.h dependency, since the correct header is already included in cudnn.h
v1.0.1 release
v1.0.1
[Bug Fix] Fixed an issue in the sdpa node when kv-sequence length is not a multiple of 64 and padding mask is not enabled. This allows graphs with kv-sequence length not a multiple of 64 to be executed on cudnn version 8.9.5 onwards. cudnn versions prior to this now correctly return NOT_SUPPORTED as expected.
[Bug Fix] Fixed an issue where creation of graph object leads to compilation error in some compilers.
[Bug Fix] cudnn frontend now correctly sets the stream to on the handle. This affected only the python bindings.
[Internal change] Streamlined includes of cudnn graph API header files into cudnn_frontend.h.
v1.0.0 release
cudnn_frontend v1.0 release introduces new API aimed to simplify graph construction.
[New API] In FE v1.0 API, users can describe multiple operations that form subgraph through cudnn_frontend::graph::Graph
object.
Unlike the FE v0.x API, users dont need to worry about specifying shapes and sizes of the intermediate virtual tensors. See README.FE.1.0.md for
more details. For more information on historical 1.0 changes, pre-release release notes are here.
Graph class consist of three types of API, viz.
- APIs that return reference to the graph itself.
This is necessary for chaining.
These can be used for setting the global properties of the graph. Example,
graph.set_compute_data_type(...).set_io_data_type(...);
- APIs that return a shared pointer to the tensor. These are required to denote entry tensors or output of nodes which can be exit points of graph or inputs to other nodes. Example,
X = graph.tensor(...);
W = graph.tensor(...);
Y = graph.conv_frop(X,W, Conv_fprop_attributes(...));
- APIs that return a error type which is a combination of error code and error message. These APIs generally mutate the graph object, or are responsible for calling the cudnn backend API. Example,
auto error = graph.validate();
auto error = graph.build_operation_graph(handle);
[New Feature] Python bindings for the FE 1.0 API. See, Python API section in README.md for building the python bindings. Details of python
API and its kw arguments are in the README.FE.1.0.md. Python API samples are in samples/python/*.py
[New Feature] Added a compound SDPA op (both forward and back prop). More details in docs/operations/Attention.md
[New Feature] Better error reporting, where in addition to error codes, we also provide error messages which provide more information on specific cause of failure.
[Deprecation] v0.x API are now labelled deprecated and may be removed in v2.0. Consider moving to v1.0 API. If there are issues or missing features, please create a github issue.
Changes over pre-release-5:
[New Feature] Scaled_Dot_Product_Attention op now supports GQA in Fprop and bprop.
[Breaking change] Output dim and strides of SDPA fprop and bprop outputs are now mandatory. Since, the inference of output shapes are non-deterministic.
[Samples] Added samples to showcase,
- INT8 convolution (
"Conv with Int8 datatypes"
) - Mixed precision multiplication (
"Mixed Precision Matmul"
) - Simple Convolutions, MatMuls and Matmuls with simple epilogues(
matmuls.cpp, wgrads.cpp, dgrads.cpp
)
[Update] The default value of cudnnNanPropagation_t
has been set to CUDNN_PROPAGATE_NAN
instead of CUDNN_NOT_PROPAGATE_NAN
.
[Update] Have added a typedef for scaled_dot_product_flash_attention
as SDPA
as a convenience.
Miscellaneous updates to v0.x API and the legacy samples:
[Bug fix] Some tests were failing on Ampere GPUs because no plans with 0 size were available. This has been fixed.
[Bug fix] Median of three sampling was incorrectly sorting the results, when cudnnFind was used. This has been fixed.
[Bug fix] Thanks to @Riottomsk for pointing out the bug in port count of Pointwise mode POW
in his [PR] (#49). This fix has been incorporated.
[Bug fix] Have fixed a bug in resample backprop operation, where CUDNN_ATTR_OPERATION_RESAMPLE_BWD_XDESC
and CUDNN_ATTR_OPERATION_RESAMPLE_BWD_YDESC
were not set correctly.
[Feature] Layer Norm API has been added. And can be used with the v0.x API.
cudnn FE 1.0 pre-release-5
Pre-release-5 release notes:
[API change] Based on user feedback, we have removed distinction between the graph and plan objects. With the new API, plan remains embedded in the graph and all operations are performed on the graph object.
Previously,
REQUIRE(graph.validate().is_good());
REQUIRE(graph.build_operation_graph(handle).is_good());
auto plans = graph.get_execution_plan_list({fe::HeurMode_t::A});
REQUIRE(plans.check_support(handle).is_good());
REQUIRE(graph.set_execution_plans(plans).is_good());
Now,
REQUIRE(graph.validate().is_good());
REQUIRE(graph.build_operation_graph(handle).is_good());
REQUIRE(graph.create_execution_plans({fe::HeurMode_t::A}).is_good());
REQUIRE(graph.check_support(handle).is_good());
REQUIRE(graph.build_plans(handle).is_good());
Also, with this change the following new API have been introduced on the graph class.
error_t
build_plans(cudnnHandle_t const &handle,
BuildPlanPolicy_t const policy = BuildPlanPolicy_t::HEURISTICS_CHOICE,
bool const do_multithreaded_builds = false);
Graph & deselect_workspace_greater_than(int64_t const workspace);
Graph & deselect_behavior_notes(std::vector<BehaviorNote_t> const ¬es);
Graph & deselect_numeric_notes(std::vector<NumericalNote_t> const ¬es);
int64_t get_workspace_size() const
int64_t get_autotune_workspace_size() const;
error_t autotune(cudnnHandle_t handle,
std::unordered_map<std::shared_ptr<Tensor_attributes>, void *> variants,
void *workspace,
void *user_impl = nullptr);
[API change] Removes the implicit validate
call made in build_operation_graph
. Now, the expectation is that the user explicitly calls validate
on the graph before calling build_operation_graph
. This helps the user distinguish errors between malformed graphs and error occuring due to lowering into cudnn.
[API change] Return error codes from the graph API have now been marked nodiscard
.
[New API] Have added a new graph::key() -> int64_t
as an API that returns a hash on the graph object. This can be used as key for graph caching. Eg. of this usage is shown in the samples.
[New API] Have added new python API create_handle
, destroy_handle
, set_stream
, get_stream
to allow custom handle and stream management on the graph object.
[New functionality] sdpa backward can now compute dbias if the fprop had a bias operation. This functionality was added in cudnn 8.9.6.
[Enhancement] There is a extension in behavior of CUDNN_FRONTEND_ATTN_DP_WORKSPACE_LIMIT
. This is documented in docs/operation/Attention.md
[Enhancement] Have added better error checks to make sure all the tensors of the node have been created. This prevents unexpected segmentation faults seen earlier.
[Bug Fix] Fix issues in instancenorm, which had caused invalid memory access earlier.
[Enhancement] Have moved the v0.9 API samples to samples/legacy_samples
folder for better organization.