Flash attention #7977

virginiafdez · 2024-08-01T12:26:42Z

Description

In response to Issue #7944, I added the new functionality scaled_dot_product_attention from PyTorch to re-enable flash attention, present in the original MONAI Generative Models repository. This is allowed for torch >= 2.0 and when argument save_attn = False. Errors are raised otherwise. I ran quick tests and added some checks on test_selfattention and test_crossattention scripts to make sure the outputs are the same as not using flash attention.

Types of changes

Non-breaking change (fix or new feature that would not break existing functionality).
Breaking change (fix or new feature that would cause existing functionality to change).
New tests added to cover the changes.
Integration tests passed locally by running ./runtests.sh -f -u --net --coverage.
Quick tests passed locally by running ./runtests.sh --quick --unittests --disttests.
In-line docstrings updated.
Documentation updated, tested make html command in the docs/ folder.

…to issue Project-MONAI#7946. Signed-off-by: Virginia Fernandez <virginia.fernandez@kcl.ac.uk>

…h >= 2.0) to cross attention and self attention blocks, and addition of parameters to diffusion model unet and to transformer block. Modification of tests to check this functionality. Signed-off-by: Virginia Fernandez <virginia.fernandez@kcl.ac.uk>

monai/networks/blocks/selfattention.py

monai/networks/blocks/crossattention.py

tests/test_crossattention.py

tests/test_selfattention.py

ericspod · 2024-08-01T13:02:44Z

A few minor comments but looks good otherwise. I would like others to test this change and see if memory performance improves. Thanks!

…h >= 2.0) to cross attention and self attention blocks, and addition of parameters to diffusion model unet and to transformer block. Modification of tests to check this functionality. >>> Implementation of proposed corrections Signed-off-by: Virginia Fernandez <virginia.fernandez@kcl.ac.uk>

monai/networks/blocks/crossattention.py

monai/networks/blocks/selfattention.py

Signed-off-by: Virginia Fernandez <virginia.fernandez@kcl.ac.uk>

monai/networks/blocks/transformerblock.py

…h >= 2.0) to cross attention and self attention blocks, and addition of parameters to diffusion model unet and to transformer block. Modification of tests to check this functionality. >>> Implementation of proposed corrections: - Addition of causal, dropout and scale to the call to scaled_dot_product_attention - For this, addition of self.dropout_rate as an attribute - Raising error when rel_pos_embedding is not None and use_flash_attention is True - Fix of docstrings that had gone wrong (in cross and self attention and transformer block) - Addition of two tests to self and cross attention blocks tests to account for the rel_pos_embedding error and to make sure that the causal = True call works. Signed-off-by: Virginia Fernandez <virginia.fernandez@kcl.ac.uk>

tests/test_crossattention.py

Nic-Ma · 2024-08-03T10:19:08Z

@yiheng-wang-nv Could you please help also review this PR?

Thanks in advance.

monai/networks/blocks/crossattention.py

…h >= 2.0) to cross attention and self attention blocks, and addition of parameters to diffusion model unet and to transformer block. Modification of tests to check this functionality. >>>> It was necessary to transpose query, value and key passed to the PyTorch flash attention module to get a behavior that is consistent with the xformers and no flash one, and then to transpose back the result. Behavior this way is consistent with xformers. Signed-off-by: Virginia Fernandez <virginia.fernandez@kcl.ac.uk>

monai/networks/blocks/selfattention.py

monai/networks/blocks/crossattention.py

tests/test_crossattention.py

monai/networks/blocks/selfattention.py

KumoLiu · 2024-08-05T16:45:49Z

Perhaps we need also include the test @ericspod shared which compare the result between without flash attention and attention that Pytorch says is the same as F.scaled_dot_product_attention. The dimensions appear to be highly prone to error, and additional comprehensive testing is necessary to prevent them. Thanks!

ericspod · 2024-08-05T17:52:44Z

Related issues: #7991 #7992

In particular: - modified this line <att_mat = att_mat.masked_fill(self.causal_mask[:, :, : x.shape[1], : x.shape[1]] == 0, float("-inf"))> when use_causal is True in self_attention to <att_mat = att_mat.masked_fill(self.causal_mask[:, :, : x.shape[-2], : x.shape[-2]] == 0, float("-inf"))> - added pertinent transpose calls to cross attention to ensure that the behaviour matches that of xops and that the code works, as well, for flash_attention=False. - added SkipIfPytorch[...] clause before the test_shape in test_cross_attention to make sure it does not error out for cases in the case block that use flash_attention = True - fix one rogue space on docstrings that had been added I ran autofix and mypy. cross_attention was reformatted. mypy did not suggest changes. Signed-off-by: Virginia Fernandez <virginia.fernandez@kcl.ac.uk>

In particular: - modified this line <att_mat = att_mat.masked_fill(self.causal_mask[:, :, : x.shape[1], : x.shape[1]] == 0, float("-inf"))> when use_causal is True in self_attention to <att_mat = att_mat.masked_fill(self.causal_mask[:, :, : x.shape[-2], : x.shape[-2]] == 0, float("-inf"))> - added pertinent transpose calls to cross attention to ensure that the behaviour matches that of xops and that the code works, as well, for flash_attention=False. - added SkipIfPytorch[...] clause before the test_shape in test_cross_attention to make sure it does not error out for cases in the case block that use flash_attention = True - fix one rogue space on docstrings that had been added I ran autofix and mypy. cross_attention was reformatted. mypy did not suggest changes. >>>> FIX: I forgot to sign! Signed-off-by: Virginia Fernandez <virginia.fernandez@kcl.ac.uk>

monai/networks/blocks/transformerblock.py

Signed-off-by: Virginia Fernandez <virginia.fernandez@kcl.ac.uk>

ericspod · 2024-08-06T15:28:41Z

@KumoLiu @mingxin-zheng @yiheng-wang-nv we have the flash attention added here. There are other changes we want to do in further PRs such as #7996 but we should merge this one first. I know there's issues of consistency between this layer and the one in GenerativeModels, I'm working on narrowing down where that's coming from. For now I would say we resolve whatever conversations are outstanding, run blossom, and hopefully merge soon. We can come back to these issues if there are any. Also @virginiafdez will be away from tomorrow and less able to work on things.

KumoLiu · 2024-08-06T15:30:50Z

/build

KumoLiu · 2024-08-06T15:31:54Z

run more test and will address the remaining issues in the next PR.

monai/networks/blocks/crossattention.py

Virginia Fernandez and others added 3 commits July 31, 2024 23:09

Addition of norm_eps parameter to spade_autoencoderkl.py in response …

744e575

…to issue Project-MONAI#7946. Signed-off-by: Virginia Fernandez <virginia.fernandez@kcl.ac.uk>

Merge branch 'dev' into flash-attention

c69cece

ericspod requested review from KumoLiu, ericspod and dongyang0122 August 1, 2024 12:50