Request for Head-Specific KV Cache Compression Feature #7

FFY0 · 2024-11-21T13:47:49Z

🚀 Feature

Adding support for head-specific KV cache compression which employs variable compression rates for each attention head.

Motivation

Ada-KV[1] has demonstrated that employing different compression rates across attention heads can significantly enhance cache compression methods. Recently, numerous head-specific approaches, such as DuoAttention[2], RazorAttention[3], and HeadKV[4], have emerged, each introducing unique techniques to improve compression quality through head-specific methods. However, these methods involve handling variable-length cache entries across different heads, a feature that KVPress currently does not support. We believe supporting this feature will significantly enhance the flexibility of KVPress and align it with emerging head-specific compression strategies.

[1] Feng, Y., Lv, J., Cao, Y., Xie, X., & Zhou, S. K. (2024). Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference. arXiv preprint arXiv:2407.11550.
[2] Xiao, G., Tang, J., Zuo, J., Guo, J., Yang, S., Tang, H., ... & Han, S. (2024). DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads. arXiv preprint arXiv:2410.10819.
[3] Tang, H., Lin, Y., Lin, J., Han, Q., Hong, S., Yao, Y., & Wang, G. (2024). Razorattention: Efficient kv cache compression through retrieval heads. arXiv preprint arXiv:2407.15891.
[4] Fu, Y., Cai, Z., Asi, A., Xiong, W., Dong, Y., & Xiao, W. (2024). Not All Heads Matter: A Head-Level KV Cache Compression Method with Integrated Retrieval and Reasoning. arXiv preprint arXiv:2410.19258.

SimJeg · 2024-11-21T13:53:49Z

Hi @FFY0,

Definitely a good issue, that's a key feature for several compression techniques. However it requires to implement a new kernel to be efficient so it's a significant effort (except if we find a trick... I do have some ideas ^^)

FFY0 · 2024-11-21T14:53:47Z

Thanks, @SimJeg!
Looking forward to the head-specific KV cache compression feature. This will effectively drive progress in the field of head-wise adaptive compression! 🚀

FFY0 · 2024-11-30T05:45:08Z

Hi, @SimJeg.

Recently, I tried to implement a Head-Specific KV Cache compression solution within the current project architecture and developed the Ada-SnapKV compression method as described in the AdaKV paper. This solution introduces several new components while minimizing intrusive changes to the existing architecture. The main modifications include:

To support head-specific cache management, I created a new cache class, DynamicCacheSplitHeadFlatten, along with the corresponding CUDA kernel, update_flatten_klenN_view, to manage and update a flattened KV cache layout.
For efficient attention computation in head-specific methods, I extended the LlamaAttention class by adding a new AdaLlamaFlashAttention class. This class manages some metadata of the flattened KV cache layout and utilizes FlashAttention to perform the necessary computations with a flattened KV cache layout.
Introduced a new press base class for Head-Specific KV Cache compression methods, AdaBasePress, by inheriting from the existing BasePress. This class is responsible for performing compression on the flattened KV Cache layout and updating the corresponding metadata after compression.
Developed a specific subclass, AdaSnapKVPress, based on AdaBasePress, which implements the Ada-SnapKV method proposed in the AdaKV paper.

Once a new subclass based on AdaBasePress is called (e.g. AdaSnapKVPress), AdaLlamaFlashAttention and DynamicCacheSplitHeadFlatten, along with the associated CUDA kernel, are automatically integrated to support Head-Specific KV Cache compression.

So far, I have obtained some preliminary results for Ada-SnapKV on the ruler benchmark, and the performance looks promising. Moving forward, I plan to conduct some tests on corner cases. The code is currently available in a branch of my forked repository. I would appreciate your feedback or suggestions. If progress aligns with expectations, I would be happy to continue working on this and eventually attempt to merge the changes into the main branch.

Commit Details

SimJeg · 2024-12-04T07:59:13Z

Thanks @FFY0 for the hard work ! We need to decide internally if we want to host kernels in this repository. Is the kernel you propose here already available by pip install somewhere else ?

FFY0 · 2024-12-05T11:26:39Z

Hi @SimJeg,

This kernel is a modified version of the original AdaKV kernel. It is currently compiled within the adakvpress/kvpress/csrc folder and is not hosted elsewhere. If hosting this kernel in the repository is preferred, I am happy to follow your decision.

I will also make further adjustments to the code and merge the branch you mentioned into my commit, and it seems they could integrate easily.

SimJeg · 2024-12-18T09:27:50Z

I created a branch introducing another way to do head-wise compression here. It does not contain AdaKVPress but a simple RandomHeadPress and a related notebook to show how to use it.

How it works:

it introduces a new DynamicHeadCache which has an indices argument pointing to the KV pairs indices that should be masked during decoding. This cache does not reduce peak memory usage but the attention outputs will be the same as if these KV pairs had been removed.
during decoding, the DynamicHeadCache updates the keys and values at the indices with fake keys k and null values v=0 where k is picked so that $e^{q \cdot k} = 0$ for the new input queries q (k is computed using least squares)

This implementation is very short (~ 60 loc) and fits nicely with the kvpress package, however:

it requires to add the query_states in the cache_kwargs and the only way to do it in a concise manner is to use exec which is not safe (see __init__.py)
it does not effectively reduce peak memory usage (somehow similar to ThinKPress, it fakes compression)

@FFY0 the reason I investigated it is that your current PR implies many changes of the current code:

adds a kernel
adds a cache
adds new attention classes that are not model agnostics
updates the pipeline

Anyway, I don't think I will merge this branch as is because of this exec workaround for point 3 might not be safe goes against the repeat yourself philosophy of transformers.

SimJeg · 2024-12-18T12:50:51Z

I just pushed an udpated version (commit) without exec. It's a bit cleaner but the downside is that it adds a lot of lines of codes (i.e. thousands !).

How it works:

For each model identified by a name (e.g. llama), a modeling_{name}.py is automatically generated with a given version of transformers that adds query_states in cache_kwargs. This is done once using kvpress.models.utils.rewrite_modeling_scripts
During kvpress initialization, the trasnformers {NAME}_ATTENTION_CLASSES is updated with the same dictionary coming from kvpress.models.modeling_{name}

SimJeg added the good first issue Good for newcomers label Nov 21, 2024

SimJeg added the feature request New feature or request label Nov 26, 2024

maxjeblick assigned SimJeg Dec 3, 2024

SimJeg mentioned this issue Dec 4, 2024

Add simlayerkv #22

Closed

FFY0 mentioned this issue Dec 7, 2024

Head-Specific KV Cache Compression Feature (Ada-SnapKV, AdaKV) #25

Open

SimJeg mentioned this issue Dec 9, 2024

Bridge to vllm #26

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Request for Head-Specific KV Cache Compression Feature #7

Request for Head-Specific KV Cache Compression Feature #7

FFY0 commented Nov 21, 2024

SimJeg commented Nov 21, 2024

FFY0 commented Nov 21, 2024

FFY0 commented Nov 30, 2024 •

edited

Loading

SimJeg commented Dec 4, 2024

FFY0 commented Dec 5, 2024

SimJeg commented Dec 18, 2024

SimJeg commented Dec 18, 2024

Request for Head-Specific KV Cache Compression Feature #7

Request for Head-Specific KV Cache Compression Feature #7

Comments

FFY0 commented Nov 21, 2024

🚀 Feature

Motivation

SimJeg commented Nov 21, 2024

FFY0 commented Nov 21, 2024

FFY0 commented Nov 30, 2024 • edited Loading

SimJeg commented Dec 4, 2024

FFY0 commented Dec 5, 2024

SimJeg commented Dec 18, 2024

SimJeg commented Dec 18, 2024

FFY0 commented Nov 30, 2024 •

edited

Loading