[API change] Based on user feedback, we have removed distinction betw…

…een the graph and plan objects. With the new API, plan remains embedded in the graph and all operations are performed on the graph object. Previously, ``` REQUIRE(graph.validate().is_good()); REQUIRE(graph.build_operation_graph(handle).is_good()); auto plans = graph.get_execution_plan_list({fe::HeurMode_t::A}); REQUIRE(plans.check_support(handle).is_good()); REQUIRE(graph.set_execution_plans(plans).is_good()); ``` Now, ``` REQUIRE(graph.validate().is_good()); REQUIRE(graph.build_operation_graph(handle).is_good()); REQUIRE(graph.create_execution_plans({fe::HeurMode_t::A}).is_good()); REQUIRE(graph.check_support(handle).is_good()); REQUIRE(graph.build_plans(handle).is_good()); ``` Also, with this change the following new API have been introduced on the graph class. ``` error_t build_plans(cudnnHandle_t const &handle, BuildPlanPolicy_t const policy = BuildPlanPolicy_t::HEURISTICS_CHOICE, bool const do_multithreaded_builds = false); Graph & deselect_workspace_greater_than(int64_t const workspace); Graph & deselect_behavior_notes(std::vector<BehaviorNote_t> const &notes); Graph & deselect_numeric_notes(std::vector<NumericalNote_t> const &notes); int64_t get_workspace_size() const int64_t get_autotune_workspace_size() const; error_t autotune(cudnnHandle_t handle, std::unordered_map<std::shared_ptr<Tensor_attributes>, void *> variants, void *workspace, void *user_impl = nullptr); ``` [API change] Removes the implicit `validate` call made in `build_operation_graph`. Now, the expectation is that the user explicitly calls `validate` on the graph before calling `build_operation_graph`. This helps the user distinguish errors between malformed graphs and error occuring due to lowering into cudnn. [API change] Return error codes from the graph API have now been marked `nodiscard`. [New API] Have added a new `graph::key() -> int64_t` as an API that returns a hash on the graph object. This can be used as key for graph caching. Eg. of this usage is shown in the samples. [New API] Have added new python API `create_handle`, `destroy_handle`, `set_stream`, `get_stream` to allow custom handle and stream management on the graph object. [New functionality] sdpa backward can now compute dbias if the fprop had a bias operation. This functionality was added in cudnn 8.9.6. [Enhancement] There is a extension in behavior of `CUDNN_FRONTEND_ATTN_DP_WORKSPACE_LIMIT`. This is documented in `docs/operation/Attention.md` [Enhancement] Have added better error checks to make sure all the tensors of the node have been created. This prevents unexpected segmentation faults seen earlier. [Bug Fix] Fix issues in instancenorm, which had caused invalid memory access earlier. [Enhancement] Have moved the v0.9 API samples to `samples/legacy_samples` folder for better organization.
NVIDIA · Nov 20, 2023 · 1c88f0f · 1c88f0f
1 parent d337a3c
commit 1c88f0f
Show file tree

Hide file tree

Showing 103 changed files with 5,521 additions and 5,791 deletions.
diff --git a/CMakeLists.txt b/CMakeLists.txt
@@ -21,6 +21,24 @@ target_include_directories(
     $<INSTALL_INTERFACE:${CMAKE_INSTALL_INCLUDEDIR}>
 )
 
+# Find the cuda compiler
+find_package(CUDAToolkit)
+
+# Find cudnn
+include(${CMAKE_SOURCE_DIR}/cmake/cuDNN.cmake)
+
+target_link_libraries(
+    cudnn_frontend INTERFACE
+
+    CUDA::cudart
+    CUDA::nvrtc
+
+    # cuDNN dlopen's its libraries
+    # Add all libraries in link line as NEEDED
+    # This forces the executable itself to find all cudnn sublibraries initially
+    CUDNN::cudnn_all
+)
+
 target_compile_features(cudnn_frontend INTERFACE cxx_std_17)
 
 if (CUDNN_FRONTEND_BUILD_SAMPLES)

diff --git a/README.FE.1.0.md b/README.FE.1.0.md
@@ -19,11 +19,11 @@ The steps involved in building and running a cudnn graph are as follows:
 3. Create and add the operation nodes. The outputs of these operation are of tensor type and can be sequentially used as inputs to the next node.
 4. Validate the operation graph. This step makes sure the graph is well built and does not have hanging tensors or node.
 5. Build the cudnn operation graph. This step lowers the graph into cudnn dialect.
-6. Get the execution plan, based on the heuristics type of your choice.
+6. Create the execution plan, based on the heuristics type of your choice.
 7. [Optional] Check support of the operation graph.
 8. [Optional] Filter out the plans by your custom criteria (Optional).
-9. [Optional] Run autotuning on the filter plan (Optional). 
-10. Set the execution plan of choice back into the graph.
+9. Build (one or all) the execution plans. 
+10. [Optional] Run autotuning on the filter plan (Optional). 
 11. Execute the graph with the relevant data pointers.
 
 ## APIs
@@ -90,27 +90,36 @@ This method creates cudnn backend descriptors for all constituents of the graph.
 cudnn_frontend::error_t cudnn_frontend::graph::Graph::build_operation_graph(cudnnHandle_t handle)
 ```
 
-### Get Execution plans
-This method returns a list of execution plans that can potentially run the FE graph.
+### Create Execution plans
+This method internally queries the heuristics for engine configs for the given heuristics modes.
 
 ```
-cudnn_frontend::graph::Plans cudnn_frontend::graph::Graph::get_execution_plans(heur_mode_t)
+cudnn_frontend::error_t cudnn_frontend::graph::Graph::get_execution_plans(std::vector<heur_mode_t>)
 ```
 
-### Filter plans
-Users can filter out plans against numerical, behavioral notes, or plans that do not provide desired functional correctness.
+### Check graph support
+This method guarantees that executing the graph using plans queried will succeed.
 
 ```
-cudnn_frontend::graph::Plans& cudnn_frontend::graph::Plans::filter_out_numeric_notes(std::vector<cudnnBackendNumericalNote_t> const&);
-cudnn_frontend::graph::Plans& cudnn_frontend::graph::Plans::filter_out_behavior_notes(std::vector<cudnnBackendBehaviorNote_t> const&);
-cudnn_frontend::graph::Plans& cudnn_frontend::graph::Plans::filter_out_workspace_greater_than(int64_t max_allowed_workspace);
+cudnn_frontend::error_t cudnn_frontend::graph::Graph::check_support(cudnnHandle_t h);
 ```
 
-### Check graph support
-This method guarantees that executing the graph using plans queried will succeed.
+### Build plans
+This method builds one or all the engine configs that was queries during the create_execution_plan phase.
+
+```
+cudnn_frontend::error_t cudnn_frontend::graph::Graph::build_plans(cudnnHandle_t const &handle, 
+                                                                cudnn_frontend::BuildPlanPolicy_t const policy, 
+                                                                bool const do_multithreaded_builds);
+```
+
+### Filter plans (optional)
+Users can filter out plans against numerical, behavioral notes, or plans that do not provide desired functional correctness.
 
 ```
-cudnn_frontend::error_t Plans::check_support();
+cudnn_frontend::graph::Graph& cudnn_frontend::graph::Plans::deselect_numeric_notes(std::vector<cudnn_frontend::NumericalNote_t> const&);
+cudnn_frontend::graph::Graph& cudnn_frontend::graph::Plans::deselect_behavior_notes(std::vector<cudnn_frontend::BehaviorNote_t> const&);
+cudnn_frontend::graph::Graph& cudnn_frontend::graph::Plans::deselect_workspace_greater_than(int64_t const workspace);
 ```
 
 ### Autotune
@@ -119,23 +128,14 @@ Autotuning provides a way to execute different execution plans for a given graph
 This generally helps validate and improve upon the results provided by the heuristics.
 
 The current API to perform the autotuning on the filtered plans:
-```
-    error_t
-    autotune(cudnnHandle_t handle,
-             std::unordered_map<std::shared_ptr<Tensor_attributes>, void *> variants,
-             void *workspace,
-             void *user_impl = nullptr);
-
-```
-
-### Set Execution plans
-After checking support, filtering and/or autotuning, execution plans can be set in descending order of preference.
-
 ```
 cudnn_frontend::error_t
-cudnn_frontend::graph::Graph::set_execution_plans(cudnn_frontend::::graph::Plans const&)
-```
+cudnn_frontend::graph::Graph::autotune(cudnnHandle_t handle,
+            std::unordered_map<std::shared_ptr<Tensor_attributes>, void *> variants,
+            void *workspace,
+            void *user_impl = nullptr);
 
+```
 ### Execute
 Executing graph requires device pointers to all input output tensors and a user alloaction device workspace pointer.
 
@@ -146,6 +146,17 @@ cudnn_frontend::graph::Graph::execute(cudnnHandle_t handle,
                                         void* workspace);
 ```
 
+### Miscellaneous APIs
+
+Get workspace to execute the current selected execution plan.
+
+`int64_t get_workspace_size() const`
+
+Get workspace to run autotune on all plans.
+
+`get_autotune_workspace_size() const`
+
+
 ## Samples
 Samples are meant to illustrate FE v1.0 API usage to users.  
 - `samples/cpp` contains samples that use C++ API.
@@ -156,4 +167,4 @@ Python samples are written using [pytest](https://github.com/pytest-dev/pytest)
 
 ## Operations
 
-Please look at docs/operations for APIs of different operation types.
+Please look at docs/operations for APIs of different operation types.
diff --git a/README.md b/README.md
@@ -22,25 +22,36 @@ In order to include the entire library, include the cudnn_frontend header file `
 ### Dependencies
 With the release of v1.0, we are bumping up the minimum supported cudnn version to 8.5.0
 
+cuda can be downloaded from the [nvidia dev-zone](https://developer.nvidia.com/cuda-downloads)
+
+cudnn can be installed from 
+    - [nvidia dev-zone] (https://developer.nvidia.com/cudnn)
+    - [pypi wheels] (https://pypi.org/project/nvidia-cudnn-cu12/)
+
 Minimum python version needed 3.6
 The python binding compilation requires development package which can be installed by running `apt-get install python-dev`.
 
 To run the python samples, additionally, you will need the following python packages
 - pytest
-- pytorch-cuda=11.8 (or pytorch-cuda=12.1)
+- pytorch-cuda=12.1 (or pytorch-cuda=11.8)
 - torchvision
 - torchaudio
 - pytorch
 
 
 ### C++ API
+
+C++ API is header only library. The following compilation steps are only required for building the samples and python bindings.
+
+The CMakeLists.txt can be used reference to include the cudnn_frontend in your project.
+
 Provide CUDA according to: https://cmake.org/cmake/help/latest/module/FindCUDAToolkit.html  
 
 CUDNN_PATH has the cudnn installation:
 - Headers are in CUDNN_PATH/include.
 - Libraries are in CUDNN_PATH/lib or CUDNN_PATH/lib64 or CUDNN_PATH/lib/x64.
 
-From Project Root,
+From project Root,
 
 ```
 mkdir build; cd build

diff --git a/docs/operations/Attention.md b/docs/operations/Attention.md
@@ -3,7 +3,9 @@
 2. [Scaled Dot Product Flash Attention Backward](#scaled-dot-product-flash-attention-backward)
 
 ### Scaled Dot Product Flash Attention
-Computes the scaled dot product attention for given Query, Key and Value tensors. Optionally, can set dropout probability, causal mask. Can optionally dump stats to be used for the bprop computation.
+Computes the scaled dot product attention for given Query, Key and Value tensors. Setting `is_inference` to false configures the operation to output `softmax_stats` to be used for backwards computation.
+
+The user can also optionally configure attention scale, bias mask, alibi mask, padding mask, causal mask, and dropout for this operation.
 
 The dimensions for
 
@@ -16,8 +18,7 @@ The dimensions for
 Where $B$ is the batch size, $H$ is the number of heads, $S_{q}$ is the sequence length of the query, $S_{kv}$ is the sequence length
 of the key and value, and $D$ is the embedding dimension per head.
 
-Additionally, the stride for the last dimension corresponding to the embedding dim per head for each of these tensors
-must be 1.
+Additionally, the stride for the last dimension $D$ corresponding to the embedding dimension per head for each of these tensors must be 1.
 
 **API:**
 
@@ -95,7 +96,9 @@ Returns:
 ```
 
 ### Scaled Dot Product Flash Attention Backward
-Computes the query, key and value gradient tensors for scaled dot product flash attention. Optionally, can set dropout probability, causal mask.
+Computes the query, key and value gradient tensors for scaled dot product flash attention.
+
+The user can also optionally configure attention scale, bias mask, alibi mask, padding mask, causal mask, and dropout for this operation.
 
 The dimensions for
 
@@ -136,6 +139,9 @@ set_attn_scale(std::shared_ptr<Tensor_attributes> value)
 Scaled_dot_product_flash_attention_backward_attributes&
 set_bias(std::shared_ptr<Tensor_attributes> value)
 
+Scaled_dot_product_flash_attention_backward_attributes&
+set_dbias(std::shared_ptr<Tensor_attributes> value)
+
 Scaled_dot_product_flash_attention_backward_attributes&
 set_alibi_mask(bool const value)
 
@@ -177,6 +183,7 @@ Args:
     stats (cudnn_tensor): The softmax statistics from the forward pass.
     attn_scale (Optional[Union[float, cudnn_tensor]]): The scale factor for attention. Default is None.
     bias (Optional[cudnn_tensor]): The bias data for attention. Default is None.
+    dBias (Optional[cudnn_tensor]): The dBias output for attention. Default is None.
     use_alibi_mask (Optional[bool]): Whether to use alibi mask. Default is False.
     use_causal_mask (Optional[bool]): Whether to use causal mask. Default is False.
     dropout (Optional[Union[Tuple[(probability: float, seed: cudnn_tensor, offset: cudnn_tensor)],
@@ -196,4 +203,12 @@ Returns:
 - The cudnn backend enums are changed as follows:
     - `cudnnBackend<enum_name>` -> `cudnn_frontend::<enum_name>`
     - `cudnn<enum_name>` -> `cudnn_frontend::<enum_name>`
-- Scaled Dot Product Flash Attention Backward improves computation speed by employing an optional workspace tensor, which consumes quadratically increasing memory usage relative to sequence length. The default GPU memory limit for the workspace tensor is 256MB, but users with enough available GPU memory budget can increase this limit by configuring the CUDNN_FRONTEND_ATTN_DP_WORKSPACE_LIMIT environment variable to the desired new limit in bytes.
+- Scaled Dot Product Flash Attention Backward improves performance by through the use of an optional dP workspace tensor. This tensor's memory consumption increases quadratically with the sequence length. The following describes the behavior of the `CUDNN_FRONTEND_ATTN_DP_WORKSPACE_LIMIT` environment variable, which allows the user to change the GPU memory limit for this workspace tensor:
+  - `CUDNN_FRONTEND_ATTN_DP_WORKSPACE_LIMIT = unset`  
+    The optimization will utilize workspace memory until reaching the default limit of 256MB.
+  - `CUDNN_FRONTEND_ATTN_DP_WORKSPACE_LIMIT = -1`  
+    Workspace optimization is always enabled, regardless of memory usage.
+  - `CUDNN_FRONTEND_ATTN_DP_WORKSPACE_LIMIT = 0`  
+    Workspace optimization is always disabled, avoiding the additional memory usage.
+  - `CUDNN_FRONTEND_ATTN_DP_WORKSPACE_LIMIT = n`  
+    Allows workspace optimization up to a user-defined limit of n bytes, accommodating systems with varying GPU memory capacities.
diff --git a/docs/operations/Normalizations.md b/docs/operations/Normalizations.md
@@ -43,7 +43,6 @@ set_compute_data_type(DataType_t value)
 
 Python API: 
 - batchnorm
-    - norm_forward_phase
     - input
     - scale
     - bias

diff --git a/include/cudnn_frontend/context.h b/include/cudnn_frontend/context.h
@@ -0,0 +1,61 @@
+#pragma once
+
+#include "../cudnn_frontend_utils.h"
+
+namespace cudnn_frontend::detail {
+
+class Context {
+    DataType_t compute_data_type      = DataType_t::NOT_SET;
+    DataType_t intermediate_data_type = DataType_t::NOT_SET;
+    DataType_t io_data_type           = DataType_t::NOT_SET;
+
+   public:
+    Context&
+    set_intermediate_data_type(DataType_t const type) {
+        intermediate_data_type = type;
+        return *this;
+    }
+
+    Context&
+    set_io_data_type(DataType_t const type) {
+        io_data_type = type;
+        return *this;
+    }
+
+    Context&
+    set_compute_data_type(DataType_t const type) {
+        compute_data_type = type;
+        return *this;
+    }
+
+    DataType_t
+    get_io_data_type() const {
+        return io_data_type;
+    }
+
+    DataType_t
+    get_intermediate_data_type() const {
+        return intermediate_data_type;
+    }
+
+    DataType_t
+    get_compute_data_type() const {
+        return compute_data_type;
+    }
+
+    Context&
+    fill_missing_properties(Context const& global_context) {
+        if (get_compute_data_type() == DataType_t::NOT_SET) {
+            set_compute_data_type(global_context.get_compute_data_type());
+        }
+        if (get_intermediate_data_type() == DataType_t::NOT_SET) {
+            set_intermediate_data_type(global_context.get_intermediate_data_type());
+        }
+        if (get_io_data_type() == DataType_t::NOT_SET) {
+            set_io_data_type(global_context.get_io_data_type());
+        }
+        return *this;
+    }
+};
+
+}  // namespace cudnn_frontend::detail
-Original file line number
+Diff line change
@@ Expand Up / @@ -43,7 +43,6 @@ set_compute_data_type(DataType_t value) @@
     Python API:
     - batchnorm
-        - norm_forward_phase
         - input
         - scale
         - bias
@@ Expand Down @@