Flash Decode GQA (and MQA) Improvements (Round 1) #12739

caixunshiren · 2024-09-16T22:09:40Z

Ticket

This PR contains the round 1 improvements outlined in #12330 :

Support transpose_q, which both q of shape [1 x qh x b x h] and [1 x b x qh x h] are supported as input
Support GQA on a shared cache, which KV of shape [1 x kh x s x h] is supported
Add support for tensor indices for GQA

FYI @sraizada-tt @cglagovichTT

Post commit pipeline: https://github.com/tenstorrent/tt-metal/actions/runs/10956618476/job/30423043917

caixunshiren · 2024-09-18T14:35:53Z

all post commit: https://github.com/tenstorrent/tt-metal/actions/runs/10951786445

…on to user

…d batch

caixunshiren added kernels kernels, such as hlks or llks or below P1 LLM_feature llama3 labels Sep 16, 2024

caixunshiren self-assigned this Sep 16, 2024

caixunshiren marked this pull request as ready for review September 17, 2024 18:23

caixunshiren requested review from eyonland, patrickroberts, yan-zaretskiy, cfjchu, xanderchin, TT-BrianLiu, ayerofieiev-tt and dmakoviichuk-tt as code owners September 17, 2024 18:23

caixunshiren requested review from cglagovichTT and sraizada-tt September 17, 2024 18:24

caixunshiren force-pushed the page-attention-gqa branch from f7ae607 to a650ec3 Compare September 17, 2024 22:35

cfjchu approved these changes Sep 18, 2024

View reviewed changes

sraizada-tt approved these changes Sep 18, 2024

View reviewed changes

caixunshiren force-pushed the page-attention-gqa branch from a650ec3 to bbcb8f7 Compare September 18, 2024 14:34

caixunshiren temporarily deployed to dev September 18, 2024 14:35 — with GitHub Actions Inactive

caixunshiren temporarily deployed to dev September 18, 2024 14:50 — with GitHub Actions Inactive

caixunshiren changed the title ~~Flash Decode GQA (and MQA) Improvements~~ Flash Decode GQA (and MQA) Improvements (Round 1) Sep 18, 2024

caixunshiren temporarily deployed to dev September 18, 2024 14:53 — with GitHub Actions Inactive

caixunshiren added 6 commits September 20, 2024 01:28

#12330: added transpose_q option for gqa and exposed share cache opti…

21c3218

…on to user

#12330: added share cache option in flash decode gqa

c555949

#12330: fixed regression in paged flash decode

8df5e90

#12330: added tensor index support for flash decode gqa

00b566a

#12330: disabled sdpa gpa test cases

cc1db08

#12330: added reshape in llam3.18b attention to explicitly show padde…

c11caf8

…d batch

caixunshiren force-pushed the page-attention-gqa branch from bbcb8f7 to c11caf8 Compare September 20, 2024 01:51

caixunshiren requested review from yieldthought, mtairum and uaydonat as code owners September 20, 2024 01:51

caixunshiren temporarily deployed to dev September 20, 2024 01:52 — with GitHub Actions Inactive

caixunshiren temporarily deployed to dev September 20, 2024 02:01 — with GitHub Actions Inactive

caixunshiren temporarily deployed to dev September 20, 2024 02:03 — with GitHub Actions Inactive

uaydonat approved these changes Sep 20, 2024

View reviewed changes

Merge branch 'main' into page-attention-gqa

518c63d

sraizada-tt temporarily deployed to dev September 20, 2024 09:18 — with GitHub Actions Inactive

sraizada-tt temporarily deployed to dev September 20, 2024 09:28 — with GitHub Actions Inactive

sraizada-tt temporarily deployed to dev September 20, 2024 09:30 — with GitHub Actions Inactive

sraizada-tt merged commit 72f5ccd into main Sep 20, 2024
105 checks passed

sraizada-tt deleted the page-attention-gqa branch September 20, 2024 10:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flash Decode GQA (and MQA) Improvements (Round 1) #12739

Flash Decode GQA (and MQA) Improvements (Round 1) #12739

caixunshiren commented Sep 16, 2024 •

edited by sraizada-tt

Loading

caixunshiren commented Sep 18, 2024 •

edited

Loading

Flash Decode GQA (and MQA) Improvements (Round 1) #12739

Flash Decode GQA (and MQA) Improvements (Round 1) #12739

Conversation

caixunshiren commented Sep 16, 2024 • edited by sraizada-tt Loading

Ticket

caixunshiren commented Sep 18, 2024 • edited Loading

caixunshiren commented Sep 16, 2024 •

edited by sraizada-tt

Loading

caixunshiren commented Sep 18, 2024 •

edited

Loading