[New API] A new overloaded variant of execute has been added which al…

…lows the variant pack to be mentioned as pair of "uid, device pointer". In order to use this, the expectation is user will provide the uid for the tensors created. (#60) ``` error_t cudnn_frontend::graph::Graph::execute(cudnnHandle_t handle, std::unordered_map<int64_t, void*>& tensor_to_pointer_map, void *workspace) const; ``` [New API] Serialization: Graph class can now be serialized once the final plan is built. The corresponding deserialized plan requires the handle to be created on the same device the original graph was created with. Serialization is only supported on Runtime compiled engines. This support may be extended to other engines in future. New samples showcasing this have been added in `samples/cpp/serialization.cpp` ``` error_t cudnn_frontend::graph::Graph::serialize(std::vector<uint8_t>& data) const; error_t cudnn_frontend::graph::Graph::deserialize(cudnnHandle_t handle, std::vector<uint8_t> const& data); ``` [New API] Autotuning: If the graph allows multiple engine configs for a given topology, each of this can now be built and executed in parallel. The expected flow is user queries the number of plans present and spawns a new thread for each plan to be finalized in parallel. The set of APIs to support this are as follows: ``` int64_t Graph::get_execution_plan_count() const; error_t Graph::build_plan_at_index(cudnnHandle_t const &handle, int64_t index); error_t Graph::execute_plan_at_index(cudnnHandle_t const &handle, std::unordered_map<int64_t, void*>& , void* workspace, int64_t plan_index) const; int64_t get_workspace_size_plan_at_index(int64_t plan_index) const; ``` [New feature] sdpa_node now allows ragged offset to be set in the input and output tensors. [Bug Fix] Certain parts of the FE code, used to throw excpetion even with `DISABLE_EXCEPTION` flag set. This has been cleaned up. [Bug Fix] For sdpa node, cudnn now correctly returns `NOT_SUPPORTED` when s_q is not a multiple of 64 and padding mask is on. [Bug Fix] For sdpa backward node, cudnn now correctly returns `NOT_SUPPORTED` when s_q is less than 64. [Bug Fix] Fixed an issue with pointwise Modulo operation. [Bug Fix] Fixed an issue in sdpa node, where the intermediate data types were wrong. [Samples] Added a sample to showcase matmul with int8 and FP8 precisions. [Cleanup] Python samples have moved from `samples/python` to `tests/python_fe`. [Cleanup] Removed the `cudnn_frontend::throw_if` function.
NVIDIA · Feb 7, 2024 · c29d609 · c29d609
1 parent a86ad70
commit c29d609
Show file tree

Hide file tree

Showing 62 changed files with 3,904 additions and 1,479 deletions.
diff --git a/CMakeLists.txt b/CMakeLists.txt
@@ -1,6 +1,6 @@
 cmake_minimum_required(VERSION 3.17)
 
-project(cudnn_frontend VERSION 1.0.3)
+project(cudnn_frontend VERSION 1.1.0)
 
 option(CUDNN_FRONTEND_BUILD_SAMPLES "Defines if samples are built or not." ON)
 option(CUDNN_FRONTEND_BUILD_UNIT_TESTS "Defines if unittests are built or not." OFF)

diff --git a/README.FE.1.0.md b/README.FE.1.0.md
@@ -9,8 +9,8 @@
 6. [Miscellaneous](#Miscellaneous)
 
 ## Introduction
-FE v1.0 API is aimed to extend functionality and usage exposed by the [cuDNN C backend API](https://docs.nvidia.com/deeplearning/cudnn/api/index.html#cudnn-backend-api). Both C++ and python APIs are provided with both having functional parity.  
-For a general introduction to FE, please first refer README.md
+FE v1.0 API is aimed to extend functionality and usage exposed by the [cuDNN C backend API](https://docs.nvidia.com/deeplearning/cudnn/api/index.html#cudnn-backend-api). Both C++ and python APIs are provided, and both have functional parity.  
+For a general introduction to FE, please start with README.md.
 
 ## Workflow
 The steps involved in building and running a cudnn graph are as follows:
@@ -97,6 +97,14 @@ This method internally queries the heuristics for engine configs for the given h
 cudnn_frontend::error_t cudnn_frontend::graph::Graph::get_execution_plans(std::vector<heur_mode_t>)
 ```
 
+### Get execution plan count
+This method returns the number of execution plans returned by cudnn heuristics. Each plan gets an index from 0 to #plans-1, with 0 having top priority.
+
+```
+cudnn_frontend::int64_t
+cudnn_frontend::Graph::get_execution_plan_count() const;
+```
+
 ### Check graph support
 This method guarantees that executing the graph using plans queried will succeed.
 
@@ -105,14 +113,33 @@ cudnn_frontend::error_t cudnn_frontend::graph::Graph::check_support(cudnnHandle_
 ```
 
 ### Build plans
-This method builds one or all the engine configs that was queries during the create_execution_plan phase.
 
+This function builds execution plans queired with `create_execution_plan(...)`` API.
+
+There are two flavours of this API:
+
+Use this method to build execution plans according to a policy. Suitable when trusting cudnn heuristics to return nest suitable execition plan with top priority.
+```
+cudnn_frontend::error_t
+cudnn_frontend::graph::Graph::build_plan(
+    cudnnHandle_t const &handle, 
+    cudnn_frontend::BuildPlanPolicy_t const policy, 
+    bool const do_multithreaded_builds
+);
+```
+
+Use this method to build individual plan indicies. Main usecase is to parallely build execution plans when autotuning.
+Plan index to be used here can be queried with `get_execution_plan_count(...)` API.
 ```
-cudnn_frontend::error_t cudnn_frontend::graph::Graph::build_plans(cudnnHandle_t const &handle, 
-                                                                cudnn_frontend::BuildPlanPolicy_t const policy, 
-                                                                bool const do_multithreaded_builds);
+cudnn_frontend::error_t
+cudnn_frontend::Graph::build_plan_at_index(
+    cudnnHandle_t const &handle,
+    int64_t plan_index
+);
 ```
 
+
+
 ### Filter plans (optional)
 Users can filter out plans against numerical, behavioral notes, or plans that do not provide desired functional correctness.
 
@@ -139,18 +166,40 @@ cudnn_frontend::graph::Graph::autotune(cudnnHandle_t handle,
 ### Execute
 Executing graph requires device pointers to all input output tensors and a user alloaction device workspace pointer.
 
+Two flavours of execute exists, corresponding to `build_plans(...)`` API.
+
+This API already has a candidate execution plan set. Candidate execution plan get internally set either:
+- if build_policy_t::HEURISTIC_CHOICE is used, or
+- as the last plan built that got built.
+
 ```
 cudnn_frontend::error_t
-cudnn_frontend::graph::Graph::execute(cudnnHandle_t handle,
-                                        std::unordered_map<std::shared_ptr<Tensor>, void *> var_pack,
-                                        void* workspace);
+cudnn_frontend::graph::Graph::execute(
+    cudnnHandle_t handle,
+    std::unordered_map<std::shared_ptr<Tensor>, void *> var_pack,
+    void* workspace
+);
+```
+
+execute API also takes a plan index to target a specific plan. This may be used when autotuning, in conjuction with `build_plan_at_index(...)` API.
+```
+cudnn_frontend::error_t
+cudnn_frontend::graph::Graph::execute(
+    cudnnHandle_t handle,
+    std::unordered_map<std::shared_ptr<Tensor>, void *> var_pack,
+    void* workspace,
+    int64_t plan_index
+);
 ```
 
 ### Miscellaneous APIs
 
 Get workspace to execute the current selected execution plan.
 
+Can also take in a plan index to query workspace for. This may be used when autotuning, in conjuction with `build_plan_at_index(...)` API.
+
 `int64_t get_workspace_size() const`
+`int64_t get_workspace_size_plan_index(int64_t plan_index) const`
 
 Get workspace to run autotune on all plans.
 
@@ -167,8 +216,7 @@ Samples are meant to illustrate FE v1.0 API usage to users.
 - `samples/cpp` contains samples that use C++ API.
 - `samples/python` contains samples that use python API.
 
-C++ samples are written using [Catch2](https://github.com/catchorg/Catch2) test framework.  
-Python samples are written using [pytest](https://github.com/pytest-dev/pytest) and [pytorch](https://pytorch.org), with both requiring external installation.
+Python samples are jupyter notebooks with step by step guide on using FE v1 API.
 
 ## Operations
 

diff --git a/README.md b/README.md
@@ -31,56 +31,63 @@ cudnn can be installed from
 Minimum python version needed 3.6
 The python binding compilation requires development package which can be installed by running `apt-get install python-dev`.
 
-To run the python samples, additionally, you will need the following python packages
+To run the python samples, additionally, you will need the following python packages:
 - pytest
-- pytorch-cuda=12.1 (or pytorch-cuda=11.8)
-- torchvision
-- torchaudio
-- pytorch
+- torch
+- jupyter
+
+
+### Python API
+Install FE python API by running:
+```
+pip install git+https://github.com/NVIDIA/cudnn-frontend.git
+```
+
+Above command picks cuda and cudnn from default system paths.
+
+To provide a custom CUDA installation path, use environment variable: `CUDAToolkit_ROOT`.  
+To provide a custom CUDNN installation path, use environment variable: `CUDNN_PATH`.
+
+
+To test whether installation is successful, run:
+```
+pytest tests/python_fe
+```
+
+NOTE: Only v1.0 API is exposed via python bindings.
 
 
 ### C++ API
 
-C++ API is header only library. The following compilation steps are only required for building the samples and python bindings.
+C++ API is header only library.
+
+The root CMakeLists.txt can be used as reference to include the cudnn_frontend in your project's build system.
 
-The CMakeLists.txt can be used reference to include the cudnn_frontend in your project.
+#### Building samples
+The following compilation steps are only required for building the samples and/or python bindings.
 
-Provide CUDA according to: https://cmake.org/cmake/help/latest/module/FindCUDAToolkit.html  
+Provide CUDA installation path according to: https://cmake.org/cmake/help/latest/module/FindCUDAToolkit.html  
+
+Provide CUDNN installation path using CUDNN_PATH env variable or cmake parameter.
 
 CUDNN_PATH has the cudnn installation:
 - Headers are in CUDNN_PATH/include.
 - Libraries are in CUDNN_PATH/lib or CUDNN_PATH/lib64 or CUDNN_PATH/lib/x64.
 
-From project Root,
-
+For a in-source build,
 ```
-mkdir build; cd build
+mkdir build
+cd build
 cmake -DCUDNN_PATH=/path/to/cudnn -DCUDAToolkit_ROOT=/path/to/cuda  ../
 cmake --build . -j16
 bin/samples
 ```
 
-Skip building samples by providing `CUDNN_FRONTEND_BUILD_SAMPLES=OFF` as cmake parameter.  
-Skip building python bindings by providing `CUDNN_FRONTEND_BUILD_PYTHON_BINDINGS=OFF` as cmake parameter.
-
-In case, you have a stale cmake cache and want to update the cudnn/cuda paths, please delete the cmake cache (or build directory and redo the above steps).
-
-### Python API
-Install FE python API by running: 
-pip install git+https://github.com/NVIDIA/cudnn-frontend.git
-
-Incase of custom installation of CUDA and CUDNN, the default path can be overriden by:
+To skip building samples, use `-DCUDNN_FRONTEND_BUILD_SAMPLES=OFF`.
 
-`CUDAToolkit_ROOT=/path/to/cuda CUDNN_PATH=/path/to/cudnn pip install /path/to/cudnn_frontend`.
+To skip building python bindings, use `-DCUDNN_FRONTEND_BUILD_PYTHON_BINDINGS=OFF`.
 
-To provide a custom CUDA, export environment variable: `CUDAToolkit_ROOT`.  
-To provide a custom CUDNN, export environment variable: `CUDNN_PATH`.
-
-```
-    pytest samples/python
-```
-
-NOTE: Only v1.0 API is exposed via python bindings.
+In case, you have a stale cmake cache and want to update the cudnn/cuda paths, please delete the cmake cache (or build directory and redo the above steps).
 
 ## Debugging
 For initial debugging, we recommend turning on the cudnn FE logging and checking for warnings and errors.
@@ -108,4 +115,5 @@ No external contribution to this repository is accepted. Please create an issue
 
 ## Feedback
 Support, resources, and information about cuDNN can be found online at https://developer.nvidia.com/cudnn. 
+
 Also, bugs and rfes can be reported in the issues section.
diff --git a/docs/operations/Attention.md b/docs/operations/Attention.md
@@ -27,6 +27,7 @@ using the FlashAttention-2 algorithm as described in the paper [FlashAttention-2
   - To use an user-provided dropout mask, users must provide:
     - `dropout mask` that matches the attention weights' dimensions, indicating which weights to drop.
     - `dropout scale` used to adjust the scale of the remaining weights accordingly, such as $1 / (1 - \text{dropout probability})$.
+- Ragged tensor: allows the query, key, value, and output tensor to be [ragged tensors](https://www.tensorflow.org/guide/ragged_tensor), which are tensors with nested variable length lists as inner dimensions. Users must pass another tensor called ragged offset tensor using the `Tensor_attributes.set_ragged_offset()` method as specified in the tensors section below.
 
 When multiple masking options are enabled, they are applied in the listed order above.
 
@@ -43,6 +44,7 @@ The dimensions that are passed as 1 will apply a broadcasted mask over attention
 - (Optional) When philox RNG dropout mask is enabled, the RNG seed and offset tensors should have size $(1, 1, 1, 1)$ with int32 or int64 datatype as either a CPU or GPU tensor.
 - (Optional) When a user provided dropout mask is enabled, a dropout mask tensor should have shape $(1, 1, S_{q}, S_{kv})$, $(1, H_{q}, S_{q}, S_{kv})$, $(B, 1, S_{q}, S_{kv})$, or $(B, H_{q}, S_{q}, S_{kv})$ with input/output datatype.  
 The dimensions that are passed as 1 will apply a broadcasted mask over attention weights.
+- (Optional) When query, key, value, and output tensors are ragged tensors, the ragged offset tensor must be a tensor of size $(B + 1, 1, 1, 1)$ that contains the nested tensor's offset in terms of number of elements (not bytes). The last value of the offset tensor specifies the offset of the past-the-end element of the ragged tensor.
 
 Where,
 
@@ -96,7 +98,7 @@ SDPA_attributes &
 set_bias(std::shared_ptr<Tensor_attributes> value);
 
 SDPA_attributes&
-set_alibi_mask(bool const value)
+set_alibi_mask(bool const value);
 
 SDPA_attributes&
 set_padding_mask(bool const value);
@@ -120,7 +122,7 @@ set_dropout(std::shared_ptr<Tensor_attributes> mask,
             std::shared_ptr<Tensor_attributes> scale);
 
 SDPA_attributes &
-set_compute_data_type(DataType_t value)
+set_compute_data_type(DataType_t value);
 ```
 
 **Python API:**
@@ -153,7 +155,7 @@ This operation computes gradient tensors for scaled dot product attention using
 
 #### Configurable Options:
 
-All the options mentioned in the forward operation, including GQA and MQA, are applicable in the backward operation as well.
+All the options mentioned in the forward operation, including ragged tensors and GQA/MQA, are applicable in the backward operation as well.
 
 #### Tensors:
 
@@ -181,19 +183,19 @@ The `options` parameter of type `SDPA_backward_attributes` is used to control th
 
 ```cpp
 SDPA_backward_attributes&
-set_attn_scale(std::shared_ptr<Tensor_attributes> value)
+set_attn_scale(std::shared_ptr<Tensor_attributes> value);
 
 SDPA_backward_attributes&
 set_attn_scale(float const value);
 
 SDPA_backward_attributes&
-set_bias(std::shared_ptr<Tensor_attributes> value)
+set_bias(std::shared_ptr<Tensor_attributes> value);
 
 SDPA_backward_attributes&
-set_dbias(std::shared_ptr<Tensor_attributes> value)
+set_dbias(std::shared_ptr<Tensor_attributes> value);
 
 SDPA_backward_attributes&
-set_alibi_mask(bool const value)
+set_alibi_mask(bool const value);
 
 SDPA_backward_attributes&
 set_padding_mask(bool const value);
@@ -205,20 +207,20 @@ SDPA_backward_attributes&
 set_seq_len_kv(std::shared_ptr<Tensor_attributes> value);
 
 SDPA_backward_attributes&
-set_causal_mask(bool const value)
+set_causal_mask(bool const value);
 
 SDPA_backward_attributes&
 set_dropout(float const probability,
             std::shared_ptr<Tensor_attributes> seed,
-            std::shared_ptr<Tensor_attributes> offset)
+            std::shared_ptr<Tensor_attributes> offset);
 
 SDPA_backward_attributes&
 set_dropout(std::shared_ptr<Tensor_attributes> mask,
             std::shared_ptr<Tensor_attributes> scale,
-            std::shared_ptr<Tensor_attributes> scale_inv)
+            std::shared_ptr<Tensor_attributes> scale_inv);
 
 SDPA_backward_attributes&
-set_compute_data_type(DataType_t const value)
+set_compute_data_type(DataType_t const value);
 ```
 
 Python API: 

diff --git a/include/cudnn_frontend.h b/include/cudnn_frontend.h
@@ -121,10 +121,11 @@
 #include "cudnn_frontend_Resample.h"
 
 #include "cudnn_frontend/graph_interface.h"
+#include "cudnn_frontend/utils/serialize.h"
 
 #define CUDNN_FRONTEND_MAJOR_VERSION 1
-#define CUDNN_FRONTEND_MINOR_VERSION 0
-#define CUDNN_FRONTEND_PATCH_VERSION 3
+#define CUDNN_FRONTEND_MINOR_VERSION 1
+#define CUDNN_FRONTEND_PATCH_VERSION 0
 #define CUDNN_FRONTEND_VERSION \
     ((CUDNN_FRONTEND_MAJOR_VERSION * 10000) + (CUDNN_FRONTEND_MINOR_VERSION * 100) + CUDNN_FRONTEND_PATCH_VERSION)