-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Head-Specific KV Cache Compression Feature (Ada-SnapKV, AdaKV) #25
base: main
Are you sure you want to change the base?
Conversation
add_AdaKV_initial_version AdaKV Signed-off-by: FFY0 <ffyfengyuan@gmail.com>
Add ThinKPress (NVIDIA#20)
Signed-off-by: FFY0 <ffyfengyuan@gmail.com>
I also have some confusion regarding batch support in the current repository. It seems that much of the code assumes a batch size of 1. This is because the compression logic doesn't appear to account for padding tokens caused by varying sequence lengths across different samples. Meanwhile, the current unit tests seem to use dummy inputs with a batch size greater than 1. To align with other methods, the current implementation of Ada-SnapKV is limited to scenarios where the batch size is 1. If necessary, I will explore support for multiple batch sizes in the future. |
Signed-off-by: FFY0 <ffyfengyuan@gmail.com>
Hi @SimJeg , I have added unit tests in It seems the current CI Action workflows require approval before they can run. I'm not very familiar with this process, so please let me know if there’s anything else I can contribute to. Additionally, the CI Action might fail due to the requirements of the new kernel build process. This issue may need further discussion on how to manage the kernel moving forward to identify a solution. |
Hi @FFY0, thanks for your hard work on this PR, the results you shared look really promising. One of the goal of the recent refacto was to welcome more complex presses such yours. We started to look at your PR and will come back with feedback. |
Add Feature of Head-Specific KV Cache Compression at issue
I have got some results of Ada-SnapKV on the 4K Ruler benchmark. The results look promising. I have placed the corresponding results in a new notebook, which also includes a brief explanation of the flattened KV cache layout employed by head-specific KV cache compression during computation and a tutorial on how to customize new Head-Specific methods based on the latest
AdaBasePress
.Additionally, it seems that the Head-Specific KV Cache Compression feature may require custom unit test workflow, such as instantiating new attention classes before loading models. As a result, simply adding Ada-SnapKV into the current unit test may cause failures. I will attempt to resolve this issue in the future. Feel free to let me know if there's anything else you'd like me to refine or if you need additional details!