-
Notifications
You must be signed in to change notification settings - Fork 10.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
demo : per-layer KV / partial offloading of KV cache #3457
Conversation
Regardless of the performance effects, this is a good change since it makes the KV cache addressing more intuitive |
Definately worth it to set fewer layers but get higher prompt processing speed out of it. |
Could this PR, when combined with the performance gains in #3776, allow 70b models in q4_K_M / q4_K_S precision to run on a 3090 at more than 1-2 tokens/second? |
I will try to update this PR to latest |
Ok, some notes:
|
Will leave this PR intact for reference. Opened a new PR: #4309 @oobabooga and anyone else who is interested - would be nice to run some tests with #4309 to make sure it works as expected |
Currently, the entire KV cache is allocated as a single tensor for all the layers. As a consequence, the KV cache is either fully on the CPU, or fully offloaded to the GPU.
With this change, the KV cache is allocated on a different tensor per layer. The result is more granular control over the parts of the KV offloaded to the GPU.
In this demo, when partially offloading a model, the KV cache corresponding to the offloaded layers is also offloaded. This increases performance at the expense of more VRAM.
Is it worth it compared to just offloading more layers? I am not sure, but probably wouldn't hurt to have more flexibility.
Note: only implemented for llama models. CUDA only.
Edit: removed a few unnecessary copies that caused performance to degrade.
Llama2 70B on a single 24 GB GPU:
Device 0: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6
v1