refactor attn layers #240

vchiley · 2023-03-16T06:37:22Z

This PR refactors all attn_impl to use one cls which calls diff attn fns (flash, triton, torch) internally.
This means the state dict of one attn_impl is guaranteed to be the same as the other (this is tested now).

Note: this PR uses unpacked attn variants (ie q, k, v are chunked out from qkv), since kv_caching will concatenated tokens to kv and therefore q will be diff seq len from k and v.

Test also now compare all variants to each other with clip, qkln, and alibi (when possible). This is like 55 really quick tests.

@dskhudia lmk if this works for the inference work

@dskhudia also noted that we'll want to move to torch.nn.functional.scaled_dot_product_attention. maybe now, maybe later.

@samhavens re ift. Note: the flash path will unpad the input and do the attn calculation unpadded; the other impl should do calc on the entire pad input but it should be only up to the longest seq in the batch (not the max seq len of the model)

examples/llm/src/models/layers/attention.py

bmosaicml

Great work Vitaliy this looks clean and ready for caching

vchiley · 2023-03-17T17:13:16Z

Running a 125m (with clip 6) using flash, triton, torch in wandb proj: attn_refactor. MFU is about where it should be. Training curves look identical.

wandb proj: attn_refactor also has a 7b run with triton to verify MFU is where it should be (using 16 GPUs) (its at > 43% MFU; checks out)
Also adding 7B with flash to see diff (not training to convergence, just for a 100ish steps). Flash version gets 41% MFU.
Note 7B run uses activation_checkpointing_reentrant: false and limit_all_gathers: true.

examples/llm/src/models/layers/attention.py

dskhudia · 2023-03-17T22:09:47Z

For inference, ideally we should have export test for onnx/torchscript but can be added in a followup PR.

abhi-mosaic

This looks awesome @vchiley ! I added a few comments but they're minor.

examples/llm/src/models/layers/attention.py

dskhudia · 2023-03-17T22:13:25Z

Looks great. Much cleaner now.

also noted that we'll want to move to torch.nn.functional.scaled_dot_product_attention.

^ can be done later once we move to 2.0.

examples/llm/src/models/layers/attention.py

vchiley requested review from dskhudia, abhi-mosaic and bmosaicml March 16, 2023 06:37

vchiley self-assigned this Mar 16, 2023

vchiley force-pushed the attn_refactor branch 2 times, most recently from c468655 to aa912dd Compare March 16, 2023 06:59

refactor attn layers

c7d0aaf

vchiley force-pushed the attn_refactor branch from aa912dd to c7d0aaf Compare March 16, 2023 07:06

vchiley and others added 2 commits March 16, 2023 00:16

Merge branch 'main' into attn_refactor

1bbfe9a

cleanup

57b8f09

vchiley requested a review from alextrott16 March 16, 2023 16:29

vchiley and others added 4 commits March 16, 2023 11:52

Merge branch 'mosaicml:main' into attn_refactor

27805c0

enable cross attn

9e36d8a

updt

116cfcc

updt

46b573f

vchiley marked this pull request as draft March 17, 2023 02:20

vchiley added 3 commits March 17, 2023 02:28

merge main

2604848

adding unpacked flash and torch

ff8e8c8

cleanup

51e3879

vchiley force-pushed the attn_refactor branch from a041b35 to 51e3879 Compare March 17, 2023 06:53

vchiley marked this pull request as ready for review March 17, 2023 06:56

lint

d85070a

vchiley requested a review from samhavens March 17, 2023 16:00

bmosaicml reviewed Mar 17, 2023

View reviewed changes

examples/llm/src/models/layers/attention.py Show resolved Hide resolved

bmosaicml approved these changes Mar 17, 2023

View reviewed changes

vchiley added 2 commits March 17, 2023 18:38

refac

1261558

pass attn fn thu __init__

79f782e

vchiley force-pushed the attn_refactor branch from 5955f22 to 5041b02 Compare March 17, 2023 19:43

refac

8cbe513

vchiley force-pushed the attn_refactor branch 2 times, most recently from 3b845b2 to e8707f6 Compare March 17, 2023 21:38

mv bias subset

5fcfb6e

vchiley force-pushed the attn_refactor branch from e8707f6 to 5fcfb6e Compare March 17, 2023 21:45

dskhudia reviewed Mar 17, 2023

View reviewed changes

examples/llm/src/models/layers/attention.py Outdated Show resolved Hide resolved

abhi-mosaic approved these changes Mar 17, 2023

View reviewed changes

abhi-mosaic reviewed Mar 17, 2023

View reviewed changes

examples/llm/src/models/layers/attention.py Outdated Show resolved Hide resolved

dskhudia approved these changes Mar 17, 2023

View reviewed changes

vchiley and others added 3 commits March 17, 2023 23:09

pr cmts & enable non-causal alibi

84757ac

fool proof tests

eae4f61

Merge branch 'main' into attn_refactor

949befc

vchiley merged commit 2b09481 into mosaicml:main Mar 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor attn layers #240

refactor attn layers #240

vchiley commented Mar 16, 2023 •

edited

Loading

bmosaicml left a comment

vchiley commented Mar 17, 2023 •

edited

Loading

dskhudia commented Mar 17, 2023

abhi-mosaic left a comment

dskhudia commented Mar 17, 2023

refactor attn layers #240

refactor attn layers #240

Conversation

vchiley commented Mar 16, 2023 • edited Loading

bmosaicml left a comment

Choose a reason for hiding this comment

vchiley commented Mar 17, 2023 • edited Loading

dskhudia commented Mar 17, 2023

abhi-mosaic left a comment

Choose a reason for hiding this comment

dskhudia commented Mar 17, 2023

vchiley commented Mar 16, 2023 •

edited

Loading

vchiley commented Mar 17, 2023 •

edited

Loading