Add a inplace concat custom op based on CUDA VMM API #9126

lszxb · 2024-09-11T08:49:57Z

PR types

Performance optimization

PR changes

Others

Description

这一PR尝试为当前的大模型推理过程增加基于CUDA VMM API的inplace concat支持（原理类似于vAttention），从而避免在每一个解码步都复制一次整个KV Cache。
该功能暂时只实现了自定义算子，未来还需要增加相关的pass以自动适配其他模型。
目前这一PR在llama模型上应用了这一方案，在3072 input+1024 output的情况下大约有10%的提升。

目前主要的思路是：

使用一种特殊的Tensor，其显存由VMM API分配，这种Tensor使用特殊的phi::Allocation，在创建时预留大量的虚拟地址空间，可以在必要时分配物理页映射到虚拟地址空间。
为了兼容剩余的调用，cache的shape为batch x seq_len x num_head x head_dim，但由于状态在cache的尾部追加，cache的内存布局应该是seq_len x batch x num_head x head_dim。
vtensor_reserve_one_token自定义算子的语义大致如下：
如果key_cache不是VTensor，则新分配一个VTensor，并将原先key_cache中的数据复制到这个新的VTensor中。然后使用VTensor的扩展机制，在尾部预留新的一个token的空间，并将key_states复制到这个新的空间中。
如果key_cache是VTensor，直接使用VTensor的扩展机制，在尾部预留新的一个token的空间，并将key_states复制到这个新的空间中。

目前可能存在的问题：

仅支持每次追加1个token的空间。
目前分配的虚拟地址空间大小和block大小为定值（1GiB与32MiB），可能暴露相关的API给用户进行调整会更好？
输入和输出的key_cache共享同一块空间，在某些情况下可能会产生冲突。
该方法依赖于每个step使用的kv cache是同一个Tensor，若有某些其他操作改变了kv cache的Tensor（比如说clone到另一个Tensor），则会导致失效，因此也需要配合这个PR的优化才可使用（assign_out_操作会导致复制）。
通过该算子分配的显存无法使用现有的Allocator进行统一管理。
未能测试HIP的支持情况。

Moved to #9320

paddle-bot · 2024-09-11T08:50:02Z

Thanks for your contribution!

codecov · 2024-09-16T05:35:50Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 52.78%. Comparing base (81f5ab5) to head (2004d21).

❗ Current head 2004d21 differs from pull request most recent head a6cf882

Please upload reports for the commit a6cf882 to get more accurate results.

Additional details and impacted files

@@           Coverage Diff            @@
##           develop    #9126   +/-   ##
========================================
  Coverage    52.78%   52.78%           
========================================
  Files          661      661           
  Lines       106945   106945           
========================================
  Hits         56452    56452           
  Misses       50493    50493

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

CLAassistant · 2024-10-28T03:35:01Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
31 out of 33 committers have signed the CLA.

✅ a31413510
✅ lszxb
✅ fightfat
✅ DrownFish19
✅ gzy19990617
✅ lvdongyi
✅ greycooker
✅ lugimzzz
✅ JZ-LIANG
✅ ckl117
✅ zhangbo9674
✅ Wennie396
✅ yuanlehome
✅ DesmonDay
✅ zhangyuqin1998
✅ Hanyonggong
✅ AndSonder
✅ sneaxiy
✅ vivienfanghuagood
✅ tizhou86
✅ liym27
✅ Mangodadada
✅ dynamicheart
✅ Crystal-X-111
✅ zhiqiu
✅ FeixLiu
✅ GuoxiaWang
✅ cqulilujia
✅ wawltor
✅ LiYuRio
✅ gongel
❌ ZHUI
❌ bo-ke
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

lszxb mentioned this pull request Sep 15, 2024

[WeeklyReports] 2024.08.26~2024.09.08 周报汇总 PFCCLab/Camp#373

Open

21 tasks

yuanlehome marked this pull request as ready for review October 28, 2024 03:38

yuanlehome closed this Oct 28, 2024

yuanlehome force-pushed the add_vtensor_op branch from 2004d21 to a6cf882 Compare October 28, 2024 03:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a inplace concat custom op based on CUDA VMM API #9126

Add a inplace concat custom op based on CUDA VMM API #9126

lszxb commented Sep 11, 2024 •

edited by yuanlehome

Loading

paddle-bot bot commented Sep 11, 2024

codecov bot commented Sep 16, 2024 •

edited

Loading

CLAassistant commented Oct 28, 2024 •

edited

Loading

Add a inplace concat custom op based on CUDA VMM API #9126

Add a inplace concat custom op based on CUDA VMM API #9126

Conversation

lszxb commented Sep 11, 2024 • edited by yuanlehome Loading

PR types

PR changes

Description

paddle-bot bot commented Sep 11, 2024

codecov bot commented Sep 16, 2024 • edited Loading

Codecov Report

CLAassistant commented Oct 28, 2024 • edited Loading

lszxb commented Sep 11, 2024 •

edited by yuanlehome

Loading

codecov bot commented Sep 16, 2024 •

edited

Loading

CLAassistant commented Oct 28, 2024 •

edited

Loading