-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Question]: Question about KV-cache storage #20
Comments
Hi @DerrickYLJ, thanks for your support in MInference.
However, we have made some system optimizations that allow 1M pre-filling to run on a single A100, details are shown in Appendix C.3. In our demo video, to perform 1M tokens inference on a single A100, we load the KV cache to the CPU, as shown in this code. Additionally, several studies focus on KV cache compression (like H20, SnapKV) and KV cache quantization (KIVI). You might consider using these solutions.
Thanks again for your interest and support! |
Thank you very much for your reply! As for 1., I read through the function of "minference_kv_cache_cpu_forward" but am unsure how exactly MInference loads the KV cache to CPU implementation-wise. As for 2., I think I still encounter the problem of building
|
Hi @DerrickYLJ, thank you for the information. It appears that the issue is related to PyCUDA. We will remove the dependency on PyCUDA in the next version. |
Could you please answer my first question by just briefly explaining the logic of offloading kv-cache to CPU?
|
Sure, the logic of "kv_cache_cpu" is very simple. When you use "kv_cache_cpu," it loads the KV cache into CPU memory. During the decoding phase, it transfers the used KV cache to GPU memory. This is just a preliminary implementation. Since our current solution only optimizes the prefilling stage and existing KV cache compression methods generally perform poorly, we implemented this version of loading for experimental and demonstration purposes. Although it has higher latency, it is faster than recomputation. |
Describe the issue
Thank you for the amazing work!
Does the model store the whole kv-cache of prefilling and generation on device? If so, how can the device hold the memory of 1M kv values; if not, how did you reduce the overhead of loading kv-values from host to device, and vice versa?
What exactly does it mean by "(1) FlashAttention-2 (2) Triton == 2.1.0 are requirements"? I tried to use
pip install Minference
w/t havingFlashAttention-2
andTriton == 2.1.0
installed, and then it outputtedERROR: Failed building wheel for pycuda
.The text was updated successfully, but these errors were encountered: