Skip to content

Commit

Permalink
nvFuser integration for operation fusion (#357)
Browse files Browse the repository at this point in the history
* added nvfuser implementation, benchmark for biasReluDropout

* reformatted fuse pattern

* revised benchamrking, nvfused patterns

* adds BiasDropoutRes and BiasDropoutResLayernorm patterns, minor edits

* unit testing for all fused patterns, minor edits

* benchmarking for all nvfused patterns

* mypy wip

* benchmarking nvfuser patterns, adding plots, minor testing changes

* fixing mypy errors

* fixed benchmarking bug, minor test change

* final benchmark plots, benchmmark edits

* nvfuser documentation, minor edits

* fixing functorch version error, documentation revisions

* fixing circleci functorch errors, mypy errors

* circleci config wip

* circleci test wip

* wip2

* testing revisions, circleci fixes, minor changes

* changelog changes, fixes functorch flag bug

* circle-ci fix

* circle-ci spacing fix

* build error wip

* revised documentation, reverted circleci config

* Fix functorch errors, circleci issue, testing changes

* updating changelog

Co-authored-by: Chris Yuan <christopheryuan@learnfair1488.h2.fair>
Co-authored-by: Chris Yuan <christopheryuan@learnfair1481.h2.fair>
Co-authored-by: Chris Yuan <christopheryuan@learnfair1483.h2.fair>
Co-authored-by: Chris Yuan <christopheryuan@learnfair1492.h2.fair>
Co-authored-by: Chris Yuan <christopheryuan@learnfair1478.h2.fair>
Co-authored-by: Chris Yuan <christopheryuan@learnfair1479.h2.fair>
Co-authored-by: Chris Yuan <christopheryuan@learnfair1484.h2.fair>
Co-authored-by: Chris Yuan <christopheryuan@learnfair1477.h2.fair>
  • Loading branch information
9 people authored Jul 28, 2022
1 parent 3a7b713 commit 089f826
Show file tree
Hide file tree
Showing 114 changed files with 864 additions and 22 deletions.
6 changes: 3 additions & 3 deletions .circleci/config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -85,7 +85,7 @@ install_dep: &install_dep
# start installing
source activate /home/circleci/venv
conda install pytorch torchvision torchaudio cudatoolkit=11.3 -c pytorch-nightly -q
conda install pytorch torchvision torchaudio cudatoolkit=11.3 -c pytorch -q
$CONDA_PYTHON -m pip install -r requirements-benchmark.txt --progress-bar off
# Mark install as complete
Expand All @@ -102,7 +102,7 @@ install_dep_exp: &install_dep_exp
# start installing
source activate /home/circleci/venv
conda install pytorch torchvision torchaudio cudatoolkit=11.3 -c pytorch-nightly -q
conda install pytorch torchvision torchaudio cudatoolkit=11.3 -c pytorch -q
$CONDA_PYTHON -m pip install -r experimental/requirements.txt --progress-bar off
install_repo: &install_repo
Expand Down Expand Up @@ -374,7 +374,7 @@ jobs:
- ~/miniconda
- ~/venv

key: cache-key-gpu-exp-114-{{ checksum "experimental/requirements.txt"}}-{{ checksum ".circleci/config.yml"}}
key: cache-key-gpu-exp-114-{{ checksum "experimental/requirements.txt" }}-{{ checksum ".circleci/config.yml" }}

- <<: *install_experimental_repo
- <<: *run_experimental_unittests
Expand Down
17 changes: 9 additions & 8 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,18 +7,19 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

## TBD
### Fixed
- Removed dupliacated biases in the FusedMLP layers [#317]
- Removed duplicated biases in the FusedMLP layers [#317]
- Rotary embeddings respecting input types [#326]
- Poolformer style instantiating useless projection layers [#349]
- Fix layer position not being properly tracked, causing extra layernorms for programatic xformers [#348]
- Fix layer position not being properly tracked, causing extra layernorms for programmatic xformers [#348]

### Added
- Four blocksparsity layouts from DeepSpeed [#320]
- Support several initialization options [#312]
- Conv2DFeedforward feedforward part [#321]
- VisualAttention [#329]
- Automatic blocksparse for causal attention [#334]
- Better hierarchical transformer generation [#345]
- Better hierarchical transformer generation [#345]
- Fused operations with AOTAutograd/NVFuser, integration into MLP [#357]

## [0.0.11] - 2022-05-30
### Fixed
Expand All @@ -40,7 +41,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
## [0.0.10] - 2022-03-14
### Fixed
- Expose bias flag for feedforwards, same default as Timm [#220]
- Update eps value for layernormm, same default as torch [#221]
- Update eps value for layernorm, same default as torch [#221]
- PreNorm bugfix, only one input was normalized [#233]
- Fix bug where embedding dimensions that did not match model dim would lead to a crash [#244]

Expand All @@ -53,12 +54,12 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- Experimental Ragged attention [#189]
- Mixture of Experts [#181]
- BlockSparseTensor [#202]
- nd-tensor support for triton softmax [#210]
- Nd-tensor support for triton softmax [#210]

### Fixed
- bugfix Favor, single feature map [#183]
- sanity check blocksparse settings [#207]
- fixed some pickability [#204]
- Bugfix Favor, single feature map [#183]
- Sanity check blocksparse settings [#207]
- Fixed some picklability [#204]

## [0.0.8] - 2022-01-07
### Fixed
Expand Down
13 changes: 13 additions & 0 deletions HOWTO.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@ Let's present here a couple of code snippets on how to solve a couple of questio
- [Replace all attentions from an existing ViT model with a sparse equivalent ?](#replace-all-attentions-from-an-existing-vit-model-with-a-sparse-equivalent-)
- [Some more examples](#some-more-examples)
- [BlockSparseAttention](#blocksparseattention)
- [How to Enable Fused Operations Using AOTAutograd and NVFuser](#how-to-enable-fused-operations-using-aotautograd-and-nvfuser)
- [From cherry picking attentions to building whole models](#from-cherry-picking-attentions-to-building-whole-models)
- [Testing out an attention mechanism](#testing-out-an-attention-mechanism)
- [Building an encoder, comparing to PyTorch](#building-an-encoder-comparing-to-pytorch)
Expand Down Expand Up @@ -295,6 +296,18 @@ On a V100, with PyTorch 1.9, Triton 1.1 and xFormers 0.0.2 this reports somethin

Note that the pattern here is not that sparse (half of the matrix is empty), the more sparse it gets the more biased the result will get towards BlockSparseAttention.

## How to Enable Fused Operations Using AOTAutograd and NVFuser

AOT Autograd is a toolkit from [FuncTorch](https://pytorch.org/functorch/stable/) which can be used to accelerate model training in xFormers. Broadly, it extracts a computational graph of the forward and backward passes of a model ahead of time. This allows for some joint graph optimizations and enables deep learning compilers such as [NVFuser](https://github.com/pytorch/pytorch/blob/release/1.12/torch/csrc/jit/codegen/cuda/README.md) to perform operator fusion. The [`memory_efficient_fusion`](https://pytorch.org/functorch/stable/generated/functorch.compile.memory_efficient_fusion.html#functorch.compile.memory_efficient_fusion) wrapper function provides a convenient way to leverage AOTAutograd and NVFuser on GPU.

XFormers uses `memory_efficient_fusion` to combine sequences of fusable operations together into single fused function layers. These parts can be found [here](xformers/components/nvfuser). A notable example is [`NVFusedBiasActivationDropout`](xformers/components/nvfuser/bias_act_dropout.py), which is readily used inside the [`MLP`](xformers/components/feedforward/mlp.py) feedforward component.

A benchmark of these fused patterns across some representative shapes shows significant speed increases compared to the unfused, Pytorch eager approach&mdash;up to 3.5x speedup for the forward pass and 2.2x for the forward and backward passes together. On average, peak memory usage of fused patterns is also lower, although we see some infrequent cases of up to 1.6x Pytorch peak memory usage on larger shapes. We also see better overall performance against our implementation of fused Bias, Activation, and Dropout using Triton ([see](xformers/triton/dropout.py)) as well. Full benchmark plots can be found [here](docs/plots/nvfuser/).

Please note from README that the `_is_functorch_available` flag must be enabled for xFormers to use these optimizations. This allows the fused layers to be used and changes the behavior of the `MLP` feedforward component, causing it to default to using the fused `NVFusedBiasActivationDropout` layer.

AOT Autograd offers a great deal a flexibility to the user, as `memory_efficient_fusion` can accept either a Python function or an entire `nn.Module` as input for fusion. Currently in xFormers, however, it is only used with Python function inputs because initial attempts with fusing xFormers layers and blocks have yielded memory issues and other CUDA errors. We are currently exploring further testing and benchmarking.

## From cherry picking attentions to building whole models

### Testing out an attention mechanism
Expand Down
19 changes: 19 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -84,6 +84,25 @@ Triton will cache the compiled kernels to `/tmp/triton` by default. If this beco

</p></details>

<details><summary> AOTAutograd/NVFuser </summary><p>

Some parts of xFormers use AOT Autograd from the [FuncTorch](https://pytorch.org/functorch/stable/) library, and will only expose themselves if FuncTorch is installed, and a compatible GPU is present. If functorch was not installed as part of the testing procedure, you can install it directly through pip.

```bash
pip install functorch
```

Once installed, set the flag `_is_functorch_available = True` in `xformers/__init__.py`. You can optionally test that the installation is successful by running one of the functorch-related benchmarks `python3 xformers/benchmarks/benchmark_nvfuser.py`

If you are importing the xFormers library in a script, you can modify the flag as such:

```python
import xformers
xformers._is_functorch_available = True
```

</p></details>

### Testing the installation

This will run a benchmark of the attention mechanisms exposed by xFormers, and generate a runtime and memory plot.
Expand Down
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
32 changes: 32 additions & 0 deletions docs/source/tutorials/aotautograd_nvfuser.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
How to Enable Fused Operations Using AOTAutograd and NVFuser
===================================================================

AOT Autograd is a toolkit from FuncTorch_ which can be used to accelerate model training in xFormers.
Broadly, it extracts a computational graph of the forward and backward passes of a model ahead of time.
This allows for some joint graph optimizations enables deep learning compilers such as NVFuser_ to perform operator fusion.
The `memory_efficient_fusion`_ wrapper function provides a convenient way to leverage AOTAutograd and NVFuser on GPU.

.. _FuncTorch: https://pytorch.org/functorch/stable/
.. _NVFuser: https://github.com/pytorch/pytorch/blob/release/1.12/torch/csrc/jit/codegen/cuda/README.md
.. _memory_efficient_fusion: https://pytorch.org/functorch/stable/generated/functorch.compile.memory_efficient_fusion.html#functorch.compile.memory_efficient_fusion

XFormers uses `memory_efficient_fusion` to combine sequences of fusable operations together into single fused function layers.
These parts can be found inside `xformers/components/nvfuser`. A notable example is `NVFusedBiasActivationDropout`, which is readily used inside the `MLP`_ feedforward component.

.. _MLP: https://github.com/facebookresearch/xformers/blob/main/xformers/components/feedforward/mlp.py

A benchmark of these fused patterns across some representative shapes shows significant speed increases compared to the unfused,
Pytorch eager approach―up to 3.5x speedup for the forward pass and 2.2x for the forward and backward passes together. On average, peak memory usage of fused patterns is also lower,
although we see some infrequent cases of up to 1.6x Pytorch peak memory usage on larger shapes. We also see better overall performance against our implementation of fused Bias,
Activation, and Dropout using Triton (see_) as well. Full benchmark plots can be found here_.

.. _see: https://github.com/facebookresearch/xformers/blob/main/xformers/triton/dropout.py
.. _here: https://github.com/facebookresearch/xformers/tree/main/docs/plots/nvfuser

Please note from README that the `_is_functorch_available` flag must be enabled for xFormers to use these optimizations.
This allows the fused layers to be used and changes the behavior of the `MLP` feedforward component,
causing it to default to using the fused `NVFusedBiasActivationDropout` layer.

AOT Autograd offers a great deal a flexibility to the user, as `memory_efficient_fusion` can accept either a Python function or an entire `nn.Module` as input for fusion.
Currently in xFormers, however, it is only used with Python function inputs because initial attempts with fusing xFormers layers and blocks have yielded memory issues and other CUDA errors.
We are currently exploring further testing and benchmarking.
1 change: 1 addition & 0 deletions docs/source/tutorials/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@ Tutorials

sparse_vit
blocksparse
aotautograd_nvfuser
extend_attentions
use_attention
pytorch_encoder
Expand Down
4 changes: 4 additions & 0 deletions requirements-test.txt
Original file line number Diff line number Diff line change
Expand Up @@ -28,3 +28,7 @@ fairscale >= 0.4.5

# Dependency for fused layers, optional
triton == 2.0.0.dev20220701

# Dependencies for fused layers using FuncTorch, optional
git+https://github.com/pytorch/functorch@v0.2.0
networkx == 2.8.4
200 changes: 200 additions & 0 deletions tests/test_nvfuser.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,200 @@
# Copyright (c) Facebook, Inc. and its affiliates. All rights reserved.
#
# This source code is licensed under the BSD license found in the
# LICENSE file in the root directory of this source tree.


import logging
from collections import OrderedDict
from contextlib import nullcontext

import pytest
import torch
import torch.nn as nn
from torch.cuda.amp.autocast_mode import autocast

import xformers
from xformers.components import Activation, ResidualNormStyle

# Store original and possible flag setting
flag_orig = xformers._is_functorch_available
flag_new = True
xformers._is_functorch_available = True


_gpu_available = torch.cuda.is_available()

try:
import xformers.components.feedforward as ff
from xformers.components.nvfuser import (
NVFusedBiasActivationDropout,
NVFusedBiasDropoutRes,
NVFusedBiasDropoutResLayerNorm,
)
from xformers.components.nvfuser.utils import build_nvfused
except ImportError as e:
logging.warning(f"Functorch is not available to run test_nvfuser.py. \nError {e}")
flag_new = False

xformers._is_functorch_available = flag_orig

FUSED_PATTERNS = (
[
NVFusedBiasActivationDropout,
NVFusedBiasDropoutRes,
NVFusedBiasDropoutResLayerNorm,
]
if flag_new
else []
)

# Testing odd (non-power-of-two for instance) shapes on purpose
SHAPES = [
(384, 512),
(8, 384, 128),
(8, 784, 512),
(4, 16, 384),
(4, 16, 1024),
(2, 16, 2048),
(2, 16, 4096),
(1, 16, 12288),
]

BATCH = 4
SEQ = 256
EMBD = 16
LATENT = 128
DEVICES = [torch.device("cuda")]

ACTIVATIONS = [
Activation.ReLU,
Activation.GeLU,
Activation.LeakyReLU,
Activation.SquaredReLU,
Activation.SmeLU,
]


@pytest.mark.skipif(not flag_new, reason="Functorch is not available")
@pytest.mark.skipif(not _gpu_available, reason="GPU is not available")
@pytest.mark.parametrize("fused_pattern", FUSED_PATTERNS)
@pytest.mark.parametrize("shape", SHAPES)
@pytest.mark.parametrize("amp", [False, True])
@pytest.mark.parametrize("bias", [False, True])
@pytest.mark.parametrize("activation", ACTIVATIONS)
@pytest.mark.parametrize("p", [0, 0.1, 0.5])
@pytest.mark.parametrize(
"layer_norm_style", [None, ResidualNormStyle.Pre, ResidualNormStyle.Post]
)
def test_nvfused_pattern_parity(
fused_pattern: nn.Module,
shape: tuple,
amp: bool,
bias: bool,
activation: Activation,
p: float,
layer_norm_style: ResidualNormStyle,
):
# Enable global flag
xformers._is_functorch_available = flag_new

if (
fused_pattern != NVFusedBiasDropoutResLayerNorm
and layer_norm_style != ResidualNormStyle.Pre
):
pytest.skip(
"Layer norm style doesn't apply, the same relevant params already tested once."
)

torch.cuda.manual_seed_all(0)
torch.random.manual_seed(0)
x = torch.normal(0, 1, size=shape, device="cuda", requires_grad=True)
x_cpu = x.clone().cpu()

with autocast(enabled=amp), pytest.raises(
ValueError
) if layer_norm_style is None else nullcontext():
fused = build_nvfused(
fused_pattern, shape, bias, activation, p, layer_norm_style
)
fused.train().cuda()
nvfused_res = fused(x, x) if fused.requires_residual else fused(x)
fused.cpu()
torch_res = (
fused(x_cpu, x_cpu).cuda()
if fused.requires_residual
else fused(x_cpu).cuda()
)

# Check if operation was actually fused
assert isinstance(
nvfused_res.grad_fn, torch.autograd.function.BackwardCFunction
)

if p == 0.0:
# Check fused and unfused paths are the same
assert torch.allclose(torch_res, nvfused_res, atol=1e-6, rtol=1e-2)

# Restore original flag configuration
xformers._is_functorch_available = flag_orig


@pytest.mark.skipif(not flag_new, reason="Functorch is not available")
@pytest.mark.skipif(not _gpu_available, reason="GPU is not available")
@pytest.mark.parametrize("activation", ACTIVATIONS)
@pytest.mark.parametrize("device", DEVICES)
@pytest.mark.parametrize("p", [0, 0.1, 0.5])
def test_nvfused_mlp(activation: Activation, device: torch.device, p: float):
test_config = {
"name": "MLP",
"dim_model": LATENT,
"dropout": p,
"activation": activation,
"hidden_layer_multiplier": 4,
"bias": False,
}
# Enable global flag
xformers._is_functorch_available = flag_new

torch.random.manual_seed(0)
torch.cuda.manual_seed_all(0)

mlp = ff.build_feedforward(test_config)
# Creates non-fused default MLP
xformers._is_functorch_available = False
mlp_default = ff.build_feedforward(test_config)
xformers._is_functorch_available = flag_new

inputs = torch.rand(BATCH, SEQ, LATENT, device=device)
mlp.train()

# Check fused pattern w/ unfused default (switch happens within NVFusedBiasActivationDropout)
mlp.cuda()
fused_res = mlp(inputs)

mlp.cpu()
unfused_res = mlp(inputs.cpu())

if p == 0.0:
assert torch.allclose(unfused_res.cuda(), fused_res, atol=1e-6, rtol=1e-2)

# Check fused pattern w/ unfused default (switch happens within MLP)
mlp.cuda()
mlp_default.cuda()

# Load same weight parameters into both models
default_param_dict = OrderedDict(
[
("mlp.2.weight", v) if k == "mlp.3.weight" else (k, v)
for k, v in mlp_default.state_dict().items()
]
)
mlp.load_state_dict(default_param_dict)
fused_res = mlp(inputs)
unfused_res = mlp_default(inputs)

if p == 0.0:
assert torch.allclose(unfused_res, fused_res, atol=1e-6, rtol=1e-2)

# Restore original flag configuration
xformers._is_functorch_available = flag_orig
Loading

0 comments on commit 089f826

Please sign in to comment.