[Question]: Question about KV-cache storage #20

DerrickYLJ · 2024-07-06T05:28:41Z

Describe the issue

Thank you for the amazing work!

Does the model store the whole kv-cache of prefilling and generation on device? If so, how can the device hold the memory of 1M kv values; if not, how did you reduce the overhead of loading kv-values from host to device, and vice versa?
What exactly does it mean by "(1) FlashAttention-2 (2) Triton == 2.1.0 are requirements"? I tried to use pip install Minference w/t having FlashAttention-2 and Triton == 2.1.0 installed, and then it outputted ERROR: Failed building wheel for pycuda.

The text was updated successfully, but these errors were encountered:

iofu728 · 2024-07-07T08:26:03Z

Hi @DerrickYLJ, thanks for your support in MInference.

MInference 1.0 focuses on speeding up the pre-filling stage of long-context LLMs inference, reducing the time from 30 minutes to 3 minutes for 1M tokens on an A100. This work does not address the KV cache storage issue. Future work on MInference will include solutions to reduce KV cache memory overhead.

However, we have made some system optimizations that allow 1M pre-filling to run on a single A100, details are shown in Appendix C.3. In our demo video, to perform 1M tokens inference on a single A100, we load the KV cache to the CPU, as shown in this code.

Additionally, several studies focus on KV cache compression (like H20, SnapKV) and KV cache quantization (KIVI). You might consider using these solutions.

Our pip package depends on flash-attn and triton. It looks like you're encountering issues related to pycuda. You can try the following steps:
1. Check if pycuda is installed successfully.
2. Build from source:
```
git clone https://github.com/microsoft/MInference
pip install -e .
```
1. If the issue persists, please provide details including OS, Python version, CUDA version, PyTorch version, and the error log.

Thanks again for your interest and support!

DerrickYLJ · 2024-07-09T01:52:33Z

Thank you very much for your reply!

As for 1., I read through the function of "minference_kv_cache_cpu_forward" but am unsure how exactly MInference loads the KV cache to CPU implementation-wise.

As for 2., I think I still encounter the problem of building pycuda when running pip install -e ..
Details:

OS:
Icon name: computer-server
Chassis: server
Machine ID: 2305030051f947988b5faecaf45ece43
Boot ID: 00739920e39a457999c5ae3b99f47675
Operating System: Springdale Open Enterprise Linux 8.6 (Modena)
CPE OS Name: cpe:/o:springdale:enterprise_linux:8.6:GA
Kernel: Linux 4.18.0-372.32.1.el8_6.x86_64
Architecture: x86-64
CUDA version: 12.4
PyTorch version: 2.3.1
Python version: 3.8.12
Error Log:

 from bpl-subset/bpl_subset/boost/python/converter/arg_to_python_base.hpp:7,
                       from bpl-subset/bpl_subset/libs/python/src/converter/arg_to_python_base.cpp:6:
      bpl-subset/bpl_subset/boost/python/detail/wrap_python.hpp:50:11: fatal error: pyconfig.h: No such file or directory
       # include <pyconfig.h>
                 ^~~~~~~~~~~~
      compilation terminated.
      /tmp/pip-build-env-wusrfsd3/overlay/lib/python3.8/site-packages/setuptools/command/build_py.py:215: _Warning: Package 'pycuda.cuda' is absent from the `packages` configuration.
      !!
      
              ********************************************************************************
              ############################
              # Package would be ignored #
              ############################
              Python recognizes 'pycuda.cuda' as an importable package[^1],
              but it is absent from setuptools' `packages` configuration.
      
              This leads to an ambiguous overall configuration. If you want to distribute this
              package, please make sure that 'pycuda.cuda' is explicitly added
              to the `packages` configuration field.
      
              Alternatively, you can also rely on setuptools' discovery methods
              (for example by using `find_namespace_packages(...)`/`find_namespace:`
              instead of `find_packages(...)`/`find:`).
      
              You can read more about "package discovery" on setuptools documentation page:
      
              - https://setuptools.pypa.io/en/latest/userguide/package_discovery.html
      
              If you don't want 'pycuda.cuda' to be distributed and are
              already explicitly excluding 'pycuda.cuda' via
              `find_namespace_packages(...)/find_namespace` or `find_packages(...)/find`,
              you can try to use `exclude_package_data`, or `include-package-data=False` in
              combination with a more fine grained `package-data` configuration.
      
              You can read more about "package data files" on setuptools documentation page:
      
              - https://setuptools.pypa.io/en/latest/userguide/datafiles.html
      
      
              [^1]: For Python, any directory (with suitable naming) can be imported,
                    even if it does not contain any `.py` files.
                    On the other hand, currently there is no concept of package data
                    directory, all directories are treated like packages.
              ********************************************************************************
      
      !!
        check.warn(importable)
      error: command '/usr/bin/gcc' failed with exit code 1
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for pycuda
Failed to build pycuda
ERROR: ERROR: Failed to build installable wheels for some pyproject.toml based projects (pycuda)

iofu728 · 2024-07-09T08:51:58Z

Thank you very much for your reply!

As for 1., I read through the function of "minference_kv_cache_cpu_forward" but am unsure how exactly MInference loads the KV cache to CPU implementation-wise.

As for 2., I think I still encounter the problem of building pycuda when running pip install -e .. Details:

OS:
Icon name: computer-server
Chassis: server
Machine ID: 2305030051f947988b5faecaf45ece43
Boot ID: 00739920e39a457999c5ae3b99f47675
Operating System: Springdale Open Enterprise Linux 8.6 (Modena)
CPE OS Name: cpe:/o:springdale:enterprise_linux:8.6:GA
Kernel: Linux 4.18.0-372.32.1.el8_6.x86_64
Architecture: x86-64
CUDA version: 12.4
PyTorch version: 2.3.1
Python version: 3.8.12
Error Log:

 from bpl-subset/bpl_subset/boost/python/converter/arg_to_python_base.hpp:7,
                       from bpl-subset/bpl_subset/libs/python/src/converter/arg_to_python_base.cpp:6:
      bpl-subset/bpl_subset/boost/python/detail/wrap_python.hpp:50:11: fatal error: pyconfig.h: No such file or directory
       # include <pyconfig.h>
                 ^~~~~~~~~~~~
      compilation terminated.
      /tmp/pip-build-env-wusrfsd3/overlay/lib/python3.8/site-packages/setuptools/command/build_py.py:215: _Warning: Package 'pycuda.cuda' is absent from the `packages` configuration.
      !!
      
              ********************************************************************************
              ############################
              # Package would be ignored #
              ############################
              Python recognizes 'pycuda.cuda' as an importable package[^1],
              but it is absent from setuptools' `packages` configuration.
      
              This leads to an ambiguous overall configuration. If you want to distribute this
              package, please make sure that 'pycuda.cuda' is explicitly added
              to the `packages` configuration field.
      
              Alternatively, you can also rely on setuptools' discovery methods
              (for example by using `find_namespace_packages(...)`/`find_namespace:`
              instead of `find_packages(...)`/`find:`).
      
              You can read more about "package discovery" on setuptools documentation page:
      
              - https://setuptools.pypa.io/en/latest/userguide/package_discovery.html
      
              If you don't want 'pycuda.cuda' to be distributed and are
              already explicitly excluding 'pycuda.cuda' via
              `find_namespace_packages(...)/find_namespace` or `find_packages(...)/find`,
              you can try to use `exclude_package_data`, or `include-package-data=False` in
              combination with a more fine grained `package-data` configuration.
      
              You can read more about "package data files" on setuptools documentation page:
      
              - https://setuptools.pypa.io/en/latest/userguide/datafiles.html
      
      
              [^1]: For Python, any directory (with suitable naming) can be imported,
                    even if it does not contain any `.py` files.
                    On the other hand, currently there is no concept of package data
                    directory, all directories are treated like packages.
              ********************************************************************************
      
      !!
        check.warn(importable)
      error: command '/usr/bin/gcc' failed with exit code 1
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for pycuda
Failed to build pycuda
ERROR: ERROR: Failed to build installable wheels for some pyproject.toml based projects (pycuda)

Hi @DerrickYLJ, thank you for the information. It appears that the issue is related to PyCUDA. We will remove the dependency on PyCUDA in the next version.

DerrickYLJ · 2024-07-10T03:53:16Z

Could you please answer my first question by just briefly explaining the logic of offloading kv-cache to CPU?

Thank you very much for your reply!

As for 1., I read through the function of "minference_kv_cache_cpu_forward" but am unsure how exactly MInference loads the KV cache to CPU implementation-wise.

As for 2., I think I still encounter the problem of building pycuda when running pip install -e .. Details:

OS:
Icon name: computer-server
Chassis: server
Machine ID: 2305030051f947988b5faecaf45ece43
Boot ID: 00739920e39a457999c5ae3b99f47675
Operating System: Springdale Open Enterprise Linux 8.6 (Modena)
CPE OS Name: cpe:/o:springdale:enterprise_linux:8.6:GA
Kernel: Linux 4.18.0-372.32.1.el8_6.x86_64
Architecture: x86-64
CUDA version: 12.4
PyTorch version: 2.3.1
Python version: 3.8.12
Error Log:

 from bpl-subset/bpl_subset/boost/python/converter/arg_to_python_base.hpp:7,
                       from bpl-subset/bpl_subset/libs/python/src/converter/arg_to_python_base.cpp:6:
      bpl-subset/bpl_subset/boost/python/detail/wrap_python.hpp:50:11: fatal error: pyconfig.h: No such file or directory
       # include <pyconfig.h>
                 ^~~~~~~~~~~~
      compilation terminated.
      /tmp/pip-build-env-wusrfsd3/overlay/lib/python3.8/site-packages/setuptools/command/build_py.py:215: _Warning: Package 'pycuda.cuda' is absent from the `packages` configuration.
      !!
      
              ********************************************************************************
              ############################
              # Package would be ignored #
              ############################
              Python recognizes 'pycuda.cuda' as an importable package[^1],
              but it is absent from setuptools' `packages` configuration.
      
              This leads to an ambiguous overall configuration. If you want to distribute this
              package, please make sure that 'pycuda.cuda' is explicitly added
              to the `packages` configuration field.
      
              Alternatively, you can also rely on setuptools' discovery methods
              (for example by using `find_namespace_packages(...)`/`find_namespace:`
              instead of `find_packages(...)`/`find:`).
      
              You can read more about "package discovery" on setuptools documentation page:
      
              - https://setuptools.pypa.io/en/latest/userguide/package_discovery.html
      
              If you don't want 'pycuda.cuda' to be distributed and are
              already explicitly excluding 'pycuda.cuda' via
              `find_namespace_packages(...)/find_namespace` or `find_packages(...)/find`,
              you can try to use `exclude_package_data`, or `include-package-data=False` in
              combination with a more fine grained `package-data` configuration.
      
              You can read more about "package data files" on setuptools documentation page:
      
              - https://setuptools.pypa.io/en/latest/userguide/datafiles.html
      
      
              [^1]: For Python, any directory (with suitable naming) can be imported,
                    even if it does not contain any `.py` files.
                    On the other hand, currently there is no concept of package data
                    directory, all directories are treated like packages.
              ********************************************************************************
      
      !!
        check.warn(importable)
      error: command '/usr/bin/gcc' failed with exit code 1
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for pycuda
Failed to build pycuda
ERROR: ERROR: Failed to build installable wheels for some pyproject.toml based projects (pycuda)

iofu728 · 2024-07-11T09:33:10Z

Could you please answer my first question by just briefly explaining the logic of offloading kv-cache to CPU?

Thank you very much for your reply!
As for 1., I read through the function of "minference_kv_cache_cpu_forward" but am unsure how exactly MInference loads the KV cache to CPU implementation-wise.
As for 2., I think I still encounter the problem of building pycuda when running pip install -e .. Details:

OS:
Icon name: computer-server
Chassis: server
Machine ID: 2305030051f947988b5faecaf45ece43
Boot ID: 00739920e39a457999c5ae3b99f47675
Operating System: Springdale Open Enterprise Linux 8.6 (Modena)
CPE OS Name: cpe:/o:springdale:enterprise_linux:8.6:GA
Kernel: Linux 4.18.0-372.32.1.el8_6.x86_64
Architecture: x86-64
CUDA version: 12.4
PyTorch version: 2.3.1
Python version: 3.8.12
Error Log:

 from bpl-subset/bpl_subset/boost/python/converter/arg_to_python_base.hpp:7,
                       from bpl-subset/bpl_subset/libs/python/src/converter/arg_to_python_base.cpp:6:
      bpl-subset/bpl_subset/boost/python/detail/wrap_python.hpp:50:11: fatal error: pyconfig.h: No such file or directory
       # include <pyconfig.h>
                 ^~~~~~~~~~~~
      compilation terminated.
      /tmp/pip-build-env-wusrfsd3/overlay/lib/python3.8/site-packages/setuptools/command/build_py.py:215: _Warning: Package 'pycuda.cuda' is absent from the `packages` configuration.
      !!
      
              ********************************************************************************
              ############################
              # Package would be ignored #
              ############################
              Python recognizes 'pycuda.cuda' as an importable package[^1],
              but it is absent from setuptools' `packages` configuration.
      
              This leads to an ambiguous overall configuration. If you want to distribute this
              package, please make sure that 'pycuda.cuda' is explicitly added
              to the `packages` configuration field.
      
              Alternatively, you can also rely on setuptools' discovery methods
              (for example by using `find_namespace_packages(...)`/`find_namespace:`
              instead of `find_packages(...)`/`find:`).
      
              You can read more about "package discovery" on setuptools documentation page:
      
              - https://setuptools.pypa.io/en/latest/userguide/package_discovery.html
      
              If you don't want 'pycuda.cuda' to be distributed and are
              already explicitly excluding 'pycuda.cuda' via
              `find_namespace_packages(...)/find_namespace` or `find_packages(...)/find`,
              you can try to use `exclude_package_data`, or `include-package-data=False` in
              combination with a more fine grained `package-data` configuration.
      
              You can read more about "package data files" on setuptools documentation page:
      
              - https://setuptools.pypa.io/en/latest/userguide/datafiles.html
      
      
              [^1]: For Python, any directory (with suitable naming) can be imported,
                    even if it does not contain any `.py` files.
                    On the other hand, currently there is no concept of package data
                    directory, all directories are treated like packages.
              ********************************************************************************
      
      !!
        check.warn(importable)
      error: command '/usr/bin/gcc' failed with exit code 1
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for pycuda
Failed to build pycuda
ERROR: ERROR: Failed to build installable wheels for some pyproject.toml based projects (pycuda)

Sure, the logic of "kv_cache_cpu" is very simple. When you use "kv_cache_cpu," it loads the KV cache into CPU memory. During the decoding phase, it transfers the used KV cache to GPU memory. This is just a preliminary implementation. Since our current solution only optimizes the prefilling stage and existing KV cache compression methods generally perform poorly, we implemented this version of loading for experimental and demonstration purposes. Although it has higher latency, it is faster than recomputation.

DerrickYLJ added the question Further information is requested label Jul 6, 2024

iofu728 self-assigned this Jul 7, 2024

iofu728 added the feature request New feature or request label Jul 9, 2024

iofu728 mentioned this issue Jul 11, 2024

[ToDo]: V0.1.5 Iteration Plan #27

Closed

10 tasks

iofu728 mentioned this issue Jul 12, 2024

Hotfix(MInference): fix the import warnings, fix the apply_rotary_pos… #30

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question]: Question about KV-cache storage #20

[Question]: Question about KV-cache storage #20

DerrickYLJ commented Jul 6, 2024 •

edited

Loading

iofu728 commented Jul 7, 2024

DerrickYLJ commented Jul 9, 2024 •

edited

Loading

iofu728 commented Jul 9, 2024

DerrickYLJ commented Jul 10, 2024

iofu728 commented Jul 11, 2024

[Question]: Question about KV-cache storage #20

[Question]: Question about KV-cache storage #20

Comments

DerrickYLJ commented Jul 6, 2024 • edited Loading

Describe the issue

iofu728 commented Jul 7, 2024

DerrickYLJ commented Jul 9, 2024 • edited Loading

iofu728 commented Jul 9, 2024

DerrickYLJ commented Jul 10, 2024

iofu728 commented Jul 11, 2024

DerrickYLJ commented Jul 6, 2024 •

edited

Loading

DerrickYLJ commented Jul 9, 2024 •

edited

Loading