Skip to content

Commit

Permalink
[API change] Based on user feedback, we have removed distinction betw…
Browse files Browse the repository at this point in the history
…een the graph and plan objects. With the new API, plan remains embedded in the graph and all operations are performed on the graph object.

Previously,

```
    REQUIRE(graph.validate().is_good());
    REQUIRE(graph.build_operation_graph(handle).is_good());
    auto plans = graph.get_execution_plan_list({fe::HeurMode_t::A});
    REQUIRE(plans.check_support(handle).is_good());
    REQUIRE(graph.set_execution_plans(plans).is_good());
```

Now,

```
    REQUIRE(graph.validate().is_good());
    REQUIRE(graph.build_operation_graph(handle).is_good());
    REQUIRE(graph.create_execution_plans({fe::HeurMode_t::A}).is_good());
    REQUIRE(graph.check_support(handle).is_good());
    REQUIRE(graph.build_plans(handle).is_good());
```

Also, with this change the following new API have been introduced on the graph class.

```
error_t
build_plans(cudnnHandle_t const &handle,
            BuildPlanPolicy_t const policy     = BuildPlanPolicy_t::HEURISTICS_CHOICE,
            bool const do_multithreaded_builds = false);

Graph & deselect_workspace_greater_than(int64_t const workspace);

Graph & deselect_behavior_notes(std::vector<BehaviorNote_t> const &notes);

Graph & deselect_numeric_notes(std::vector<NumericalNote_t> const &notes);

int64_t get_workspace_size() const

int64_t get_autotune_workspace_size() const;

error_t autotune(cudnnHandle_t handle,
             std::unordered_map<std::shared_ptr<Tensor_attributes>, void *> variants,
             void *workspace,
             void *user_impl = nullptr);

```

[API change] Removes the implicit `validate` call made in `build_operation_graph`. Now, the expectation is that the user explicitly calls `validate` on the graph before calling `build_operation_graph`. This helps the user distinguish errors between malformed graphs and error occuring due to lowering into cudnn.

[API change] Return error codes from the graph API have now been marked `nodiscard`.

[New API] Have added a new `graph::key() -> int64_t` as an API that returns a hash on the graph object. This can be used as key for graph caching. Eg. of this usage is shown in the samples.

[New API] Have added new python API `create_handle`, `destroy_handle`, `set_stream`, `get_stream` to allow custom handle and stream management on the graph object.

[New functionality] sdpa backward can now compute dbias if the fprop had a bias operation. This functionality was added in cudnn 8.9.6.

[Enhancement] There is a extension in behavior of `CUDNN_FRONTEND_ATTN_DP_WORKSPACE_LIMIT`. This is documented in `docs/operation/Attention.md`

[Enhancement] Have added better error checks to make sure all the tensors of the node have been created. This prevents unexpected segmentation faults seen earlier.

[Bug Fix] Fix issues in instancenorm, which had caused invalid memory access earlier.

[Enhancement] Have moved the v0.9 API samples to `samples/legacy_samples` folder for better organization.
  • Loading branch information
Anerudhan committed Nov 20, 2023
1 parent d337a3c commit 1c88f0f
Show file tree
Hide file tree
Showing 103 changed files with 5,521 additions and 5,791 deletions.
18 changes: 18 additions & 0 deletions CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,24 @@ target_include_directories(
$<INSTALL_INTERFACE:${CMAKE_INSTALL_INCLUDEDIR}>
)

# Find the cuda compiler
find_package(CUDAToolkit)

# Find cudnn
include(${CMAKE_SOURCE_DIR}/cmake/cuDNN.cmake)

target_link_libraries(
cudnn_frontend INTERFACE

CUDA::cudart
CUDA::nvrtc

# cuDNN dlopen's its libraries
# Add all libraries in link line as NEEDED
# This forces the executable itself to find all cudnn sublibraries initially
CUDNN::cudnn_all
)

target_compile_features(cudnn_frontend INTERFACE cxx_std_17)

if (CUDNN_FRONTEND_BUILD_SAMPLES)
Expand Down
69 changes: 40 additions & 29 deletions README.FE.1.0.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,11 +19,11 @@ The steps involved in building and running a cudnn graph are as follows:
3. Create and add the operation nodes. The outputs of these operation are of tensor type and can be sequentially used as inputs to the next node.
4. Validate the operation graph. This step makes sure the graph is well built and does not have hanging tensors or node.
5. Build the cudnn operation graph. This step lowers the graph into cudnn dialect.
6. Get the execution plan, based on the heuristics type of your choice.
6. Create the execution plan, based on the heuristics type of your choice.
7. [Optional] Check support of the operation graph.
8. [Optional] Filter out the plans by your custom criteria (Optional).
9. [Optional] Run autotuning on the filter plan (Optional).
10. Set the execution plan of choice back into the graph.
9. Build (one or all) the execution plans.
10. [Optional] Run autotuning on the filter plan (Optional).
11. Execute the graph with the relevant data pointers.

## APIs
Expand Down Expand Up @@ -90,27 +90,36 @@ This method creates cudnn backend descriptors for all constituents of the graph.
cudnn_frontend::error_t cudnn_frontend::graph::Graph::build_operation_graph(cudnnHandle_t handle)
```

### Get Execution plans
This method returns a list of execution plans that can potentially run the FE graph.
### Create Execution plans
This method internally queries the heuristics for engine configs for the given heuristics modes.

```
cudnn_frontend::graph::Plans cudnn_frontend::graph::Graph::get_execution_plans(heur_mode_t)
cudnn_frontend::error_t cudnn_frontend::graph::Graph::get_execution_plans(std::vector<heur_mode_t>)
```

### Filter plans
Users can filter out plans against numerical, behavioral notes, or plans that do not provide desired functional correctness.
### Check graph support
This method guarantees that executing the graph using plans queried will succeed.

```
cudnn_frontend::graph::Plans& cudnn_frontend::graph::Plans::filter_out_numeric_notes(std::vector<cudnnBackendNumericalNote_t> const&);
cudnn_frontend::graph::Plans& cudnn_frontend::graph::Plans::filter_out_behavior_notes(std::vector<cudnnBackendBehaviorNote_t> const&);
cudnn_frontend::graph::Plans& cudnn_frontend::graph::Plans::filter_out_workspace_greater_than(int64_t max_allowed_workspace);
cudnn_frontend::error_t cudnn_frontend::graph::Graph::check_support(cudnnHandle_t h);
```

### Check graph support
This method guarantees that executing the graph using plans queried will succeed.
### Build plans
This method builds one or all the engine configs that was queries during the create_execution_plan phase.

```
cudnn_frontend::error_t cudnn_frontend::graph::Graph::build_plans(cudnnHandle_t const &handle,
cudnn_frontend::BuildPlanPolicy_t const policy,
bool const do_multithreaded_builds);
```

### Filter plans (optional)
Users can filter out plans against numerical, behavioral notes, or plans that do not provide desired functional correctness.

```
cudnn_frontend::error_t Plans::check_support();
cudnn_frontend::graph::Graph& cudnn_frontend::graph::Plans::deselect_numeric_notes(std::vector<cudnn_frontend::NumericalNote_t> const&);
cudnn_frontend::graph::Graph& cudnn_frontend::graph::Plans::deselect_behavior_notes(std::vector<cudnn_frontend::BehaviorNote_t> const&);
cudnn_frontend::graph::Graph& cudnn_frontend::graph::Plans::deselect_workspace_greater_than(int64_t const workspace);
```

### Autotune
Expand All @@ -119,23 +128,14 @@ Autotuning provides a way to execute different execution plans for a given graph
This generally helps validate and improve upon the results provided by the heuristics.

The current API to perform the autotuning on the filtered plans:
```
error_t
autotune(cudnnHandle_t handle,
std::unordered_map<std::shared_ptr<Tensor_attributes>, void *> variants,
void *workspace,
void *user_impl = nullptr);
```

### Set Execution plans
After checking support, filtering and/or autotuning, execution plans can be set in descending order of preference.

```
cudnn_frontend::error_t
cudnn_frontend::graph::Graph::set_execution_plans(cudnn_frontend::::graph::Plans const&)
```
cudnn_frontend::graph::Graph::autotune(cudnnHandle_t handle,
std::unordered_map<std::shared_ptr<Tensor_attributes>, void *> variants,
void *workspace,
void *user_impl = nullptr);
```
### Execute
Executing graph requires device pointers to all input output tensors and a user alloaction device workspace pointer.

Expand All @@ -146,6 +146,17 @@ cudnn_frontend::graph::Graph::execute(cudnnHandle_t handle,
void* workspace);
```

### Miscellaneous APIs

Get workspace to execute the current selected execution plan.

`int64_t get_workspace_size() const`

Get workspace to run autotune on all plans.

`get_autotune_workspace_size() const`


## Samples
Samples are meant to illustrate FE v1.0 API usage to users.
- `samples/cpp` contains samples that use C++ API.
Expand All @@ -156,4 +167,4 @@ Python samples are written using [pytest](https://github.com/pytest-dev/pytest)

## Operations

Please look at docs/operations for APIs of different operation types.
Please look at docs/operations for APIs of different operation types.
15 changes: 13 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,25 +22,36 @@ In order to include the entire library, include the cudnn_frontend header file `
### Dependencies
With the release of v1.0, we are bumping up the minimum supported cudnn version to 8.5.0

cuda can be downloaded from the [nvidia dev-zone](https://developer.nvidia.com/cuda-downloads)

cudnn can be installed from
- [nvidia dev-zone] (https://developer.nvidia.com/cudnn)
- [pypi wheels] (https://pypi.org/project/nvidia-cudnn-cu12/)

Minimum python version needed 3.6
The python binding compilation requires development package which can be installed by running `apt-get install python-dev`.

To run the python samples, additionally, you will need the following python packages
- pytest
- pytorch-cuda=11.8 (or pytorch-cuda=12.1)
- pytorch-cuda=12.1 (or pytorch-cuda=11.8)
- torchvision
- torchaudio
- pytorch


### C++ API

C++ API is header only library. The following compilation steps are only required for building the samples and python bindings.

The CMakeLists.txt can be used reference to include the cudnn_frontend in your project.

Provide CUDA according to: https://cmake.org/cmake/help/latest/module/FindCUDAToolkit.html

CUDNN_PATH has the cudnn installation:
- Headers are in CUDNN_PATH/include.
- Libraries are in CUDNN_PATH/lib or CUDNN_PATH/lib64 or CUDNN_PATH/lib/x64.

From Project Root,
From project Root,

```
mkdir build; cd build
Expand Down
25 changes: 20 additions & 5 deletions docs/operations/Attention.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,9 @@
2. [Scaled Dot Product Flash Attention Backward](#scaled-dot-product-flash-attention-backward)

### Scaled Dot Product Flash Attention
Computes the scaled dot product attention for given Query, Key and Value tensors. Optionally, can set dropout probability, causal mask. Can optionally dump stats to be used for the bprop computation.
Computes the scaled dot product attention for given Query, Key and Value tensors. Setting `is_inference` to false configures the operation to output `softmax_stats` to be used for backwards computation.

The user can also optionally configure attention scale, bias mask, alibi mask, padding mask, causal mask, and dropout for this operation.

The dimensions for

Expand All @@ -16,8 +18,7 @@ The dimensions for
Where $B$ is the batch size, $H$ is the number of heads, $S_{q}$ is the sequence length of the query, $S_{kv}$ is the sequence length
of the key and value, and $D$ is the embedding dimension per head.

Additionally, the stride for the last dimension corresponding to the embedding dim per head for each of these tensors
must be 1.
Additionally, the stride for the last dimension $D$ corresponding to the embedding dimension per head for each of these tensors must be 1.

**API:**

Expand Down Expand Up @@ -95,7 +96,9 @@ Returns:
```
### Scaled Dot Product Flash Attention Backward
Computes the query, key and value gradient tensors for scaled dot product flash attention. Optionally, can set dropout probability, causal mask.
Computes the query, key and value gradient tensors for scaled dot product flash attention.
The user can also optionally configure attention scale, bias mask, alibi mask, padding mask, causal mask, and dropout for this operation.
The dimensions for
Expand Down Expand Up @@ -136,6 +139,9 @@ set_attn_scale(std::shared_ptr<Tensor_attributes> value)
Scaled_dot_product_flash_attention_backward_attributes&
set_bias(std::shared_ptr<Tensor_attributes> value)

Scaled_dot_product_flash_attention_backward_attributes&
set_dbias(std::shared_ptr<Tensor_attributes> value)

Scaled_dot_product_flash_attention_backward_attributes&
set_alibi_mask(bool const value)

Expand Down Expand Up @@ -177,6 +183,7 @@ Args:
stats (cudnn_tensor): The softmax statistics from the forward pass.
attn_scale (Optional[Union[float, cudnn_tensor]]): The scale factor for attention. Default is None.
bias (Optional[cudnn_tensor]): The bias data for attention. Default is None.
dBias (Optional[cudnn_tensor]): The dBias output for attention. Default is None.
use_alibi_mask (Optional[bool]): Whether to use alibi mask. Default is False.
use_causal_mask (Optional[bool]): Whether to use causal mask. Default is False.
dropout (Optional[Union[Tuple[(probability: float, seed: cudnn_tensor, offset: cudnn_tensor)],
Expand All @@ -196,4 +203,12 @@ Returns:
- The cudnn backend enums are changed as follows:
- `cudnnBackend<enum_name>` -> `cudnn_frontend::<enum_name>`
- `cudnn<enum_name>` -> `cudnn_frontend::<enum_name>`
- Scaled Dot Product Flash Attention Backward improves computation speed by employing an optional workspace tensor, which consumes quadratically increasing memory usage relative to sequence length. The default GPU memory limit for the workspace tensor is 256MB, but users with enough available GPU memory budget can increase this limit by configuring the CUDNN_FRONTEND_ATTN_DP_WORKSPACE_LIMIT environment variable to the desired new limit in bytes.
- Scaled Dot Product Flash Attention Backward improves performance by through the use of an optional dP workspace tensor. This tensor's memory consumption increases quadratically with the sequence length. The following describes the behavior of the `CUDNN_FRONTEND_ATTN_DP_WORKSPACE_LIMIT` environment variable, which allows the user to change the GPU memory limit for this workspace tensor:
- `CUDNN_FRONTEND_ATTN_DP_WORKSPACE_LIMIT = unset`
The optimization will utilize workspace memory until reaching the default limit of 256MB.
- `CUDNN_FRONTEND_ATTN_DP_WORKSPACE_LIMIT = -1`
Workspace optimization is always enabled, regardless of memory usage.
- `CUDNN_FRONTEND_ATTN_DP_WORKSPACE_LIMIT = 0`
Workspace optimization is always disabled, avoiding the additional memory usage.
- `CUDNN_FRONTEND_ATTN_DP_WORKSPACE_LIMIT = n`
Allows workspace optimization up to a user-defined limit of n bytes, accommodating systems with varying GPU memory capacities.
1 change: 0 additions & 1 deletion docs/operations/Normalizations.md
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,6 @@ set_compute_data_type(DataType_t value)

Python API:
- batchnorm
- norm_forward_phase
- input
- scale
- bias
Expand Down
61 changes: 61 additions & 0 deletions include/cudnn_frontend/context.h
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
#pragma once

#include "../cudnn_frontend_utils.h"

namespace cudnn_frontend::detail {

class Context {
DataType_t compute_data_type = DataType_t::NOT_SET;
DataType_t intermediate_data_type = DataType_t::NOT_SET;
DataType_t io_data_type = DataType_t::NOT_SET;

public:
Context&
set_intermediate_data_type(DataType_t const type) {
intermediate_data_type = type;
return *this;
}

Context&
set_io_data_type(DataType_t const type) {
io_data_type = type;
return *this;
}

Context&
set_compute_data_type(DataType_t const type) {
compute_data_type = type;
return *this;
}

DataType_t
get_io_data_type() const {
return io_data_type;
}

DataType_t
get_intermediate_data_type() const {
return intermediate_data_type;
}

DataType_t
get_compute_data_type() const {
return compute_data_type;
}

Context&
fill_missing_properties(Context const& global_context) {
if (get_compute_data_type() == DataType_t::NOT_SET) {
set_compute_data_type(global_context.get_compute_data_type());
}
if (get_intermediate_data_type() == DataType_t::NOT_SET) {
set_intermediate_data_type(global_context.get_intermediate_data_type());
}
if (get_io_data_type() == DataType_t::NOT_SET) {
set_io_data_type(global_context.get_io_data_type());
}
return *this;
}
};

} // namespace cudnn_frontend::detail
Loading

0 comments on commit 1c88f0f

Please sign in to comment.