[Attention Mask] Refactor all encoder-decoder attention mask #27086

patrickvonplaten · 2023-10-26T15:17:49Z

What does this PR do?

This PR refactors the attention mask of all PT Seq2Seq models. While this is a nice improvement of life, it is also necessary to effectively add FA2 and SDPA to PT Seq2Seq models (without having to change 54+ files).

In a follow-up PR it'll be much easier to add FA2 to just Bart and most important Bart-like models.
The PR slightly goes against the single-file policy, but attention masks are really the same across models and there is also only so much they can be (causal, non-causal, windowed). I think it doesn't really hurt readability as the function are very clearly defined (create 4d attention mask from 2d).

For some very big exceptions (I found only one which is LED, see comment here), we could just write part of the mask creation separately as is done.

I could also give the mask creation functions a _ to make it clearer that they are private mehtods in Transformers. Both keeping as is or changing is good for me.

@amyeroberts @LysandreJik @ArthurZucker this is ready for a review!

HuggingFaceDocBuilderDev · 2023-10-26T15:38:37Z

The documentation is not available anymore as the PR was closed or merged.

src/transformers/models/idefics/modeling_idefics.py

LysandreJik

As you have mentioned in your PR description, I would personally be in favor of having _ prefixes just to ensure they aren't leveraged by third-parties thinking they're public. They're prominently displayed in the forward pass of a significant number of models, so it's worth adding the prefix.

Very cool PR! This is much cleaner this way, IMO

LysandreJik · 2023-10-27T11:59:10Z

src/transformers/models/conditional_detr/modeling_conditional_detr.py

@@ -1522,7 +1499,7 @@ def forward(
                layer_outputs = self.gradient_checkpointing_func(
                    decoder_layer.__call__,
                    hidden_states,
-                    combined_attention_mask,
+                    None,


Huh that combined attention_mask was pretty weird

src/transformers/models/mpt/modeling_mpt.py

src/transformers/modeling_attn_mask_utils.py

amyeroberts

Thanks for refactoring all this code. It's so much cleaner and great to see all the repeated code being deleted 🙏

Just a few nit comments. Only request is that a handful of our biggest models have their slow tests run as a sense check before merging.

amyeroberts · 2023-10-27T11:39:35Z

src/transformers/modeling_attn_mask_utils.py

+
+    def __init__(self, is_causal: bool, sliding_window: Optional[int] = None):
+        self.is_causal = is_causal
+        self.sliding_window = sliding_window


We could add a check here that the pass sliding_window is positive if not None

src/transformers/modeling_attn_mask_utils.py

amyeroberts · 2023-10-27T12:00:41Z

src/transformers/models/falcon/modeling_falcon.py

-                attention_mask = self.attn_mask_converter.to_causal_4d(
-                    batch_size, seq_length, key_value_length, dtype=inputs_embeds.dtype, device=inputs_embeds.device
-                )
+            attention_mask = prepare_4d_causal_attention_mask(


tests/test_modeling_utils.py

amyeroberts · 2023-10-27T12:09:23Z

tests/test_modeling_utils.py

+        assert mask_4d.shape == (bsz, 1, q_len, kv_len)
+
+        context = mask_converter.sliding_window
+        if mask_converter.is_causal and context is None:


I'm personally not a fan of having test functions with lots of if/else statements - it tends to lead to utility functions which try to handle everything and can be error prone.

Hmm fair, I'd say though for tests it's ok since only us maintainers look at them. For me, the if-else statements actually helped quite a bit to map out all the different scenarios that exist (causal mask + window, causal mask + no winow, non-causal mask + window, non-causal mask + no window)

ArthurZucker

Love it!

ArthurZucker · 2023-10-27T13:16:04Z

tests/test_modeling_utils.py

+
+
+@require_torch
+class AttentionMaskTester(unittest.TestCase):


would also advocate for self.assert that we use more now 😉 but it's a nit

Hmm, I really think assert ... is much better to use in tests because the error message is much cleaner. E.g. if you do:

self.assertTrue(expected_ids == predicted_ids)

assuming each is a list, you're error message will just be "the assertion is not True"

where as doing:

assert expected_ids == predicted_ids

gives you a much better error message. Ok to change for consistency, but I really think that just doing assert ... is better.

Yeah but self.assertListEqual() should give you more info no?
I don't mine but just want consistency!

...ecutter-template-{{cookiecutter.modelname}}/modeling_{{cookiecutter.lowercase_modelname}}.py

src/transformers/models/mpt/modeling_mpt.py

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

…ansformers into add_fa2_bart_like

patrickvonplaten · 2023-10-27T14:02:03Z

Thanks for the fast reviews everyone! Ran slow tests

CUDA_VISIBLE_DEVICES="0" RUN_SLOW=1 pytest ...

for:

Mistral
Falcon
LLama
Whisper
Bart
LED (exceptional model)
Bloom (uses boolean mask)

Think this should be good enough!

DavidAkinpelu · 2023-10-31T03:54:16Z

@patrickvonplaten I know this PR has been merged, but I have a question regarding the Pytorch version of BART. The implementation assumes the decoder is always used in an autoregressive manner Pytorch version unlike the flax version . There could be cases of the decoder being used as an "encoder" and a "cross attention". In this case, the autoregressive nature is not required. While I think the default should be the autoregressive manner, but if is_decoder is set to false, the non-causal masking operation should be performed instead.

patrickvonplaten · 2023-11-01T17:01:42Z

non-causal masking operation should be performed instead.

@DavidAkinpelu I think you linked the FlaxAttention class, not the FlaxDecoder class above. In PT the Attention class can also be used in non-causal model, just like in Flax. If you want to use Bart in non-auto-regressive mode why don't you use BartEncoder?

DavidAkinpelu · 2023-11-12T15:12:36Z

@patrickvonplaten This paper got me thinking in that direction Mores+.

…face#27086) * [FA2 Bart] Add FA2 to all Bart-like * better * Refactor attention mask * remove all customized atteniton logic * format * mass rename * replace _expand_mask * replace _expand_mask * mass rename * add pt files * mass replace & rename * mass replace & rename * mass replace & rename * mass replace & rename * Update src/transformers/models/idefics/modeling_idefics.py * fix more * clean more * fix more * make style * fix again * finish * finish * finish * finish * finish * finish * finish * finish * finish * finish * Apply suggestions from code review * Apply suggestions from code review Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * small fix mistral * finish * finish * finish * finish --------- Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

[FA2 Bart] Add FA2 to all Bart-like

a6b4ba8

patrickvonplaten added 2 commits October 26, 2023 17:46

better

d4030d0

Refactor attention mask

6a540a3

patrickvonplaten changed the title ~~[FA2 Bart] Add FA2 to all Bart-like~~ [Attention Mask] Refactor all encoder-decoder attention mask Oct 26, 2023

patrickvonplaten added 11 commits October 26, 2023 18:01

remove all customized atteniton logic

89b5f50

format

37b0974

mass rename

a525a30

replace _expand_mask

689ccc0

replace _expand_mask

af1f843

mass rename

196f4c5

add pt files

f8b6cb1

mass replace & rename

9911f87

mass replace & rename

938d920

mass replace & rename

be2d559

mass replace & rename

4846e9f

patrickvonplaten commented Oct 27, 2023

View reviewed changes

src/transformers/models/idefics/modeling_idefics.py Show resolved Hide resolved

patrickvonplaten added 13 commits October 27, 2023 09:39

Update src/transformers/models/idefics/modeling_idefics.py

12a7a9a

fix more

6f40bc1

up

464f8ff

clean more

72c2c7c

fix more

95f5221

make style

6de76da

fix again

bc17647

finish

f816796

finish

9700352

finish

423a621

finish

c11dea2

finish

65d984c

finish

5e4a8e5

LysandreJik approved these changes Oct 27, 2023

View reviewed changes

amyeroberts approved these changes Oct 27, 2023

View reviewed changes

ArthurZucker approved these changes Oct 27, 2023

View reviewed changes

patrickvonplaten commented Oct 27, 2023

View reviewed changes

src/transformers/models/mpt/modeling_mpt.py Outdated Show resolved Hide resolved

patrickvonplaten and others added 5 commits October 27, 2023 15:29

Apply suggestions from code review

2fc17e1

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

small fix mistral

1b59d9e

finish

cfb9763

Merge branch 'add_fa2_bart_like' of https://github.com/huggingface/tr…

7b1a349

…ansformers into add_fa2_bart_like

finish

7e4c31d

patrickvonplaten added 2 commits October 27, 2023 16:03

finish

92f1c6e

finish

beecdb6

patrickvonplaten merged commit ac58937 into main Oct 27, 2023
3 checks passed

patrickvonplaten deleted the add_fa2_bart_like branch October 27, 2023 14:42

zhjohnchan mentioned this pull request Oct 28, 2023

[FA-2 / Mistral] Supprot fa-2 + right padding + forward #26912

Merged

younesbelkada mentioned this pull request Oct 28, 2023

[FA2/ Mistral] Revert previous behavior with right padding + forward #27125

Merged

sayakpaul mentioned this pull request Oct 29, 2023

[Feat] Introduce shareable diffusion workflows huggingface/diffusers#4825

Closed

3 tasks

This was referenced Oct 30, 2023

[tests / Quantization] Fix bnb test #27145

Merged

MPT models on the Hub not working with transformers main mosaicml/llm-foundry#703

Closed

ydshieh mentioned this pull request Oct 30, 2023

Remove some Kosmos-2 copied from #27149

Merged

This was referenced Oct 31, 2023

Unable to convert Falcon models to ONNX huggingface/optimum#1495

Closed

Remove attn mask patching huggingface/optimum#1473

Closed

hrQAQ mentioned this pull request Nov 7, 2023

Upstream Changes makes the demo not work bigscience-workshop/petals#536

Open

poedator mentioned this pull request Nov 14, 2023

custom 4d attention_mask as transformers .forward() argument #27493

Closed

poedator mentioned this pull request Nov 22, 2023

replaced call to _prepare_decoder_attention_mask() with _prepare_4d_causal_attention_mask() bigscience-workshop/petals#545

Draft

ArthurZucker mentioned this pull request Nov 27, 2023

type object 'OPTDecoder' has no attribute '_prepare_decoder_attention_mask'. #27701

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Attention Mask] Refactor all encoder-decoder attention mask #27086

[Attention Mask] Refactor all encoder-decoder attention mask #27086

patrickvonplaten commented Oct 26, 2023 •

edited

Loading

HuggingFaceDocBuilderDev commented Oct 26, 2023 •

edited

Loading

LysandreJik left a comment

LysandreJik Oct 27, 2023

amyeroberts left a comment

amyeroberts Oct 27, 2023

amyeroberts Oct 27, 2023

amyeroberts Oct 27, 2023

patrickvonplaten Oct 27, 2023

ArthurZucker left a comment

ArthurZucker Oct 27, 2023

patrickvonplaten Oct 27, 2023

ArthurZucker Oct 27, 2023

patrickvonplaten commented Oct 27, 2023

DavidAkinpelu commented Oct 31, 2023 •

edited

Loading

patrickvonplaten commented Nov 1, 2023

DavidAkinpelu commented Nov 12, 2023 •

edited

Loading

[Attention Mask] Refactor all encoder-decoder attention mask #27086

[Attention Mask] Refactor all encoder-decoder attention mask #27086

Conversation

patrickvonplaten commented Oct 26, 2023 • edited Loading

What does this PR do?

HuggingFaceDocBuilderDev commented Oct 26, 2023 • edited Loading

LysandreJik left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amyeroberts left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ArthurZucker left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

patrickvonplaten commented Oct 27, 2023

DavidAkinpelu commented Oct 31, 2023 • edited Loading

patrickvonplaten commented Nov 1, 2023

DavidAkinpelu commented Nov 12, 2023 • edited Loading

patrickvonplaten commented Oct 26, 2023 •

edited

Loading

HuggingFaceDocBuilderDev commented Oct 26, 2023 •

edited

Loading

DavidAkinpelu commented Oct 31, 2023 •

edited

Loading

DavidAkinpelu commented Nov 12, 2023 •

edited

Loading