implement dot_product_attention #455

CarloLucibello · 2023-01-03T09:53:05Z

In the process, also extends batched_mul to multiple batch dimensions. Fix #451 fix #391

We may want to consider hooking cudnn in a later PR

src/attention.jl

mcabbott

I think this mostly looks good, and is about the right level -- fairly simple-to-read implementation with no magic.

The thing to get right now seems to be: Does this function line up fairly nicely with what CUDA provides, so that an overload dot_product_attention(::CuArray, ...) can smoothly provide the same functionality? From a quick look

It seems that that wants a weight array which if I understand right would be steps before this function: https://github.com/JuliaGPU/CUDA.jl/blob/8a4cbdee50c716ff642eb3d9268f1a7ea4c29eb0/lib/cudnn/src/multiheadattn.jl#L20
I'm not entirely sure what its masking options are, what am I missing e.g. here https://docs.nvidia.com/deeplearning/cudnn/api/index.html#cudnnSetAttnDescriptor
Would need to read slower to see where it does & doesn't include bias, and dropout.

Maybe 1. points towards this function being the core of a larger one, which more closely matches the CUDA one? If so then it probably shouldn't have dropout, as that can trivially be composed by whatever calls it.

Lots of random comments, but probably you should ignore some on internal details as these can be fixed, but the overall interface is the question.

src/attention.jl

mcabbott · 2023-01-03T21:40:56Z

src/batched/batchedmul.jl

@@ -42,6 +44,16 @@ This will be copied, as doing so is faster than `batched_mul_generic!`.
 Both this `copy` and `batched_mul_generic!` produce `@debug` messages,
 and setting for instance `ENV["JULIA_DEBUG"] = NNlib` will display them.
 """
+
+function batched_mul(x::AbstractArray{T1,N}, y::AbstractArray{T2,N}) where {T1,T2,N}


My vote is to make this an internal _batched_mul_4 or something for now. Partly because I think explaining what does and doesn't work becomes more complicated with this method. And that doesn't have to be solved to add attention.

It's a pity to not make things available. Maybe I can leave the previous docstring unchanged and add a new one for the new method?

CarloLucibello · 2023-01-05T01:06:31Z

Regarding CUDNN, the descriptor there says

Weight matrices WQ,i , WK,i , WV,i and WO,i play similar roles, adjusting vector lengths in q , K,V inputs and in the multi-head attention final output. The user can disable any or all projections by setting qProjSize , kProjSize , vProjSize or oProjSize arguments to zero.

so we can deactivate the projections to match the api introduced in this PR. Conversely, we can add overloads here in the future if we want to consider also the projections.

For dropout:

The attnDropoutDesc and postDropoutDesc arguments are descriptors that define two dropout layers active in the training mode. The first dropout operation defined by attnDropoutDesc, is applied directly to the softmax output. The second dropout operation, specified by postDropoutDesc, alters the multi-head attention output, just before the point where residual connections are added.

The support for masking is only in the form of windows, according to the following inputs to cudnnMultiHeadAttnForward:

loWinIdx[], hiWinIdx[]

Input. Two host integer arrays specifying the start and end indices of the attention window for each Q time-step. 
The start index in K, V sets is inclusive, and the end index is exclusive.

Bias in the attention logits doesn't seem to be supported.

CarloLucibello · 2023-01-07T06:24:11Z

I think this is good to go

darsnack · 2023-01-09T14:16:18Z

src/attention.jl

+        if mask === :causal
+            mask = make_causal_mask(logits)
+        end


I think a cleaner API would be to let the mask keyword be a function. The nothing case is mask = identity and the causal case is mask = make_causal_mask (which I feel should be just causal_mask to be succinct).

Is there a reason to construct the mask on the fly? The calling layer in Flux can probably make and store the mask once. Then the other option is to allow nothing or an array. Then the user passes in mask = causal_mask(ntoken).

What is the function which you pass, in this proposal?

mask = identity means this is applied to the array.

mask = make_causal_mask means it constructs a boolean matrix.

Agree that constructing the same matrix every time seems a bit wasteful, although probably not a big cost, there are quite a few larger copies made in this thing.

With mask = identity, the usual masking could be causal_mask! which is basically for i,j in ...; if i<j; x[i,j] = -Inf end; i.e. it just mutates the data array. This should be safe as the gradient of batched_mul does not need the original values.

You're right, it shouldn't be identity, it should be trues_like though I'd be okay with nothing in order skip computing a mask at all.

My comment about constructing on the fly was not a performance concern. I just think it is more intuitive to pass in exactly the mask array I want used. It's an easier rule to remember and also scalable to whatever masking scheme is desired.

The downside is that you have to make an array the right size. If you have several layers and want the same scheme for each, then perhaps it's a pain. Whereas a function like trues_like is told the size automatically.

(The implementation can branch on mask === trues_like to avoid work in the default case. We can also branch on the type of const causal_mask = triu ∘ trues_like if necc.)

While encoding this as a bool array makes some sense, it's also a little weird in that the implementation doesn't directly consume this. Maybe better than my mutating idea above, we can modify softmax to take a mask argument, and fuse it into the broadcast there, I think.

That's true, but generally the size of this matrix which is # of tokens X # of tokens is known ahead of time. Even so, I agree that not needing to pass in this info is cleaner.

I mostly wanted to avoid "symbol switches" for arguments.

Yes to avoiding symbols. I like this mask = trues_like proposal the best so far.

One question I haven't looked at is what format the CUDNN thing is going to want.

Instead of saying mask is either an array or callable, could we say it should be either an array or marker type for which one can override some init_mask(x, k, v) function? This would allow us to shift the conditionals out of the attention functions, while still allowing for relatively terse syntax like mask = CausalMask() when users don't want to precompute their own. You could imagine nice party tricks like passing mask = I.

#460 is a go at this masked softmax idea.

With that, the default of no mask can in fact be mask = Returns(true) here, instead of trues_like. And the terse causal mask can be const causal_mask = triu ∘ trues_like, or a function equivalent to this (maybe it can be more efficient, not sure triu works on CuArrays). No conditionals required.

Edit: making #460 work on GPU too won't be just a few lines. But even without that, mask::Function = trues_like as the interface seems nice, instead of having to independently make something the right size.

triu only works on AbstractMatrix, which is not sufficient for the attention.

For this first implementation, I prefer to keep it more minimalistic and just accept nothing or arrays (I will remove :causal)

CarloLucibello · 2023-01-22T23:41:05Z

:causal removed, now we only accept array masks or nothing. Good to go?

src/attention.jl

ToucheSir

Mostly LGTM and appears to give us enough surface area for cuDNN. Just a couple final questions:

test/attention.jl

ToucheSir · 2023-01-24T21:11:11Z

src/attention.jl

+    if bias !== nothing
+        logits = logits .+ bias
+    end
+
+    if mask !== nothing
+        neginf = typemin(eltype(logits))
+        logits = ifelse.(mask, logits, neginf)
+    end


WDYT about making these internal methods which dispatch on nothing? That way there's zero control flow and Zygote is happy. The main question is whether the additional code + complexity introduced would be worth the compile and runtime reduction.

jondeuce · 2023-02-03T01:36:02Z

This looks great, I've long thought that there should be some basic attention mechanisms available here.

Before settling on the API for passing masks, I thought I should mention that "non-boolean" masking such as Attention with Linear Biases (ALiBi) have been used for some pretty substantial projects, e.g. BigScience's BLOOM, and it would be useful to have this option.

This would only be a slight generalization from what is currently implemented here (see Fig. 3 in the ALiBi paper), and could easily be incorporated by dispatching on the type of the mask:

mask::AbstractArray{Bool} acts like ifelse.(mask, logits, -Inf)
mask::AbstractArray{<:Real} acts like logits .+ mask

This would actually be almost exactly what PyTorch does as well, which is a nice bonus. From the torch.nn.MultiheadAttention docs:

Binary, byte, and float masks are supported. For a binary mask, a True value indicates that the corresponding position is not allowed to attend. For a byte mask, a non-zero value indicates that the corresponding position is not allowed to attend. For a float mask, the mask values will be added to the attention weight.

Interesting that PyTorch appears to have the meaning of true and false reversed compared to what is implemented in this PR, but Keras has the same convention as this PR (see the docs and the code). Not sure which meaning is more natural 🤷‍♂️.

CarloLucibello · 2023-02-03T07:08:25Z

@jondeuce the attention bias in the PR already does what you suggest. Should we collapse bias into mask or keep the two separate? I was inspired by https://flax.readthedocs.io/en/latest/api_reference/_autosummary/flax.linen.dot_product_attention.html?highlight=attention
for the interface

jondeuce · 2023-02-03T07:38:40Z

@CarloLucibello Ahh, yes that is a flexible approach too, and clearly covers ALiBi-style masks. It's funny, I did notice the bias being added, but my brain did not register that bias would be an additive mask, instead I thought of learnable biases like in Dense layers.

The more I think about it, the more I like the way you have it. They are orthogonal and mutually compatible ways to apply a mask, and I don't think combining them and then doing different operations based on the mask type is conceptually any simpler anyways.

CarloLucibello commented Jan 3, 2023

View reviewed changes

src/attention.jl Outdated Show resolved Hide resolved

mcabbott reviewed Jan 3, 2023

View reviewed changes

CarloLucibello added 12 commits January 5, 2023 15:34

add dot_product_attention

bf64ca8

run tests

9da0005

docs

2193639

address some review comments

eabcc02

fix tests

aac281d

fix fdrop

4d5a6d9

additional method

5a5c58b

bias is positional argument

19d377a

test bias

10e99c7

fix tests on julia 1.6

e61909c

typos

43632ee

improve docs

958171b

CarloLucibello force-pushed the cl/attention branch from 820d45e to 958171b Compare January 5, 2023 14:35

CarloLucibello requested a review from mcabbott January 5, 2023 17:30

darsnack reviewed Jan 9, 2023

View reviewed changes

remove :causal

df8aa9b

CarloLucibello commented Jan 24, 2023

View reviewed changes

src/attention.jl Outdated Show resolved Hide resolved

Update src/attention.jl

09ac33b

ToucheSir reviewed Jan 24, 2023

View reviewed changes

add function barrier

d17de5e

CarloLucibello requested a review from ToucheSir February 3, 2023 08:56

ToucheSir approved these changes Feb 3, 2023

View reviewed changes

CarloLucibello merged commit 1203b21 into master Feb 3, 2023

chengchingwen mentioned this pull request Jun 24, 2023

Disable CUDNN_SOFTMAX_FAST or use a separate math mode variable for softmax #506

Open

theabhirath mentioned this pull request May 5, 2023

cannot match attention layer output to pytorch's one FluxML/Metalhead.jl#231

Closed

CarloLucibello deleted the cl/attention branch June 15, 2023 17:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

implement dot_product_attention #455

implement dot_product_attention #455

CarloLucibello commented Jan 3, 2023 •

edited

Loading

mcabbott left a comment •

edited

Loading

mcabbott Jan 3, 2023

CarloLucibello Jan 4, 2023

CarloLucibello commented Jan 5, 2023 •

edited

Loading

CarloLucibello commented Jan 7, 2023

darsnack Jan 9, 2023

mcabbott Jan 9, 2023 •

edited

Loading

darsnack Jan 9, 2023

mcabbott Jan 9, 2023 •

edited

Loading

darsnack Jan 9, 2023

mcabbott Jan 9, 2023 •

edited

Loading

ToucheSir Jan 9, 2023

mcabbott Jan 9, 2023 •

edited

Loading

chengchingwen Jan 10, 2023

CarloLucibello Jan 10, 2023

CarloLucibello commented Jan 22, 2023 •

edited

Loading

ToucheSir left a comment

ToucheSir Jan 24, 2023 •

edited

Loading

jondeuce commented Feb 3, 2023

CarloLucibello commented Feb 3, 2023

jondeuce commented Feb 3, 2023

implement dot_product_attention #455

implement dot_product_attention #455

Conversation

CarloLucibello commented Jan 3, 2023 • edited Loading

mcabbott left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

CarloLucibello commented Jan 5, 2023 • edited Loading

CarloLucibello commented Jan 7, 2023

Choose a reason for hiding this comment

mcabbott Jan 9, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mcabbott Jan 9, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mcabbott Jan 9, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mcabbott Jan 9, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

CarloLucibello commented Jan 22, 2023 • edited Loading

ToucheSir left a comment

Choose a reason for hiding this comment

ToucheSir Jan 24, 2023 • edited Loading

Choose a reason for hiding this comment

jondeuce commented Feb 3, 2023

CarloLucibello commented Feb 3, 2023

jondeuce commented Feb 3, 2023

CarloLucibello commented Jan 3, 2023 •

edited

Loading

mcabbott left a comment •

edited

Loading

CarloLucibello commented Jan 5, 2023 •

edited

Loading

mcabbott Jan 9, 2023 •

edited

Loading

mcabbott Jan 9, 2023 •

edited

Loading

mcabbott Jan 9, 2023 •

edited

Loading

mcabbott Jan 9, 2023 •

edited

Loading

CarloLucibello commented Jan 22, 2023 •

edited

Loading

ToucheSir Jan 24, 2023 •

edited

Loading