Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question]: Question about KV-cache storage #20

Open
DerrickYLJ opened this issue Jul 6, 2024 · 5 comments
Open

[Question]: Question about KV-cache storage #20

DerrickYLJ opened this issue Jul 6, 2024 · 5 comments
Assignees
Labels
feature request New feature or request question Further information is requested

Comments

@DerrickYLJ
Copy link

DerrickYLJ commented Jul 6, 2024

Describe the issue

Thank you for the amazing work!

  1. Does the model store the whole kv-cache of prefilling and generation on device? If so, how can the device hold the memory of 1M kv values; if not, how did you reduce the overhead of loading kv-values from host to device, and vice versa?

  2. What exactly does it mean by "(1) FlashAttention-2 (2) Triton == 2.1.0 are requirements"? I tried to use pip install Minference w/t having FlashAttention-2 and Triton == 2.1.0 installed, and then it outputted ERROR: Failed building wheel for pycuda.

@DerrickYLJ DerrickYLJ added the question Further information is requested label Jul 6, 2024
@iofu728 iofu728 self-assigned this Jul 7, 2024
@iofu728
Copy link
Contributor

iofu728 commented Jul 7, 2024

Hi @DerrickYLJ, thanks for your support in MInference.

  1. MInference 1.0 focuses on speeding up the pre-filling stage of long-context LLMs inference, reducing the time from 30 minutes to 3 minutes for 1M tokens on an A100. This work does not address the KV cache storage issue. Future work on MInference will include solutions to reduce KV cache memory overhead.

However, we have made some system optimizations that allow 1M pre-filling to run on a single A100, details are shown in Appendix C.3. In our demo video, to perform 1M tokens inference on a single A100, we load the KV cache to the CPU, as shown in this code.

Additionally, several studies focus on KV cache compression (like H20, SnapKV) and KV cache quantization (KIVI). You might consider using these solutions.

  1. Our pip package depends on flash-attn and triton. It looks like you're encountering issues related to pycuda. You can try the following steps:
    1. Check if pycuda is installed successfully.
    2. Build from source:
    git clone https://github.com/microsoft/MInference
    pip install -e .
    1. If the issue persists, please provide details including OS, Python version, CUDA version, PyTorch version, and the error log.

Thanks again for your interest and support!

@DerrickYLJ
Copy link
Author

DerrickYLJ commented Jul 9, 2024

Thank you very much for your reply!

As for 1., I read through the function of "minference_kv_cache_cpu_forward" but am unsure how exactly MInference loads the KV cache to CPU implementation-wise.

As for 2., I think I still encounter the problem of building pycuda when running pip install -e ..
Details:

  1. OS:
    Icon name: computer-server
    Chassis: server
    Machine ID: 2305030051f947988b5faecaf45ece43
    Boot ID: 00739920e39a457999c5ae3b99f47675
    Operating System: Springdale Open Enterprise Linux 8.6 (Modena)
    CPE OS Name: cpe:/o:springdale:enterprise_linux:8.6:GA
    Kernel: Linux 4.18.0-372.32.1.el8_6.x86_64
    Architecture: x86-64
  2. CUDA version: 12.4
  3. PyTorch version: 2.3.1
  4. Python version: 3.8.12
  5. Error Log:
 from bpl-subset/bpl_subset/boost/python/converter/arg_to_python_base.hpp:7,
                       from bpl-subset/bpl_subset/libs/python/src/converter/arg_to_python_base.cpp:6:
      bpl-subset/bpl_subset/boost/python/detail/wrap_python.hpp:50:11: fatal error: pyconfig.h: No such file or directory
       # include <pyconfig.h>
                 ^~~~~~~~~~~~
      compilation terminated.
      /tmp/pip-build-env-wusrfsd3/overlay/lib/python3.8/site-packages/setuptools/command/build_py.py:215: _Warning: Package 'pycuda.cuda' is absent from the `packages` configuration.
      !!
      
              ********************************************************************************
              ############################
              # Package would be ignored #
              ############################
              Python recognizes 'pycuda.cuda' as an importable package[^1],
              but it is absent from setuptools' `packages` configuration.
      
              This leads to an ambiguous overall configuration. If you want to distribute this
              package, please make sure that 'pycuda.cuda' is explicitly added
              to the `packages` configuration field.
      
              Alternatively, you can also rely on setuptools' discovery methods
              (for example by using `find_namespace_packages(...)`/`find_namespace:`
              instead of `find_packages(...)`/`find:`).
      
              You can read more about "package discovery" on setuptools documentation page:
      
              - https://setuptools.pypa.io/en/latest/userguide/package_discovery.html
      
              If you don't want 'pycuda.cuda' to be distributed and are
              already explicitly excluding 'pycuda.cuda' via
              `find_namespace_packages(...)/find_namespace` or `find_packages(...)/find`,
              you can try to use `exclude_package_data`, or `include-package-data=False` in
              combination with a more fine grained `package-data` configuration.
      
              You can read more about "package data files" on setuptools documentation page:
      
              - https://setuptools.pypa.io/en/latest/userguide/datafiles.html
      
      
              [^1]: For Python, any directory (with suitable naming) can be imported,
                    even if it does not contain any `.py` files.
                    On the other hand, currently there is no concept of package data
                    directory, all directories are treated like packages.
              ********************************************************************************
      
      !!
        check.warn(importable)
      error: command '/usr/bin/gcc' failed with exit code 1
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for pycuda
Failed to build pycuda
ERROR: ERROR: Failed to build installable wheels for some pyproject.toml based projects (pycuda)

@iofu728
Copy link
Contributor

iofu728 commented Jul 9, 2024

Thank you very much for your reply!

As for 1., I read through the function of "minference_kv_cache_cpu_forward" but am unsure how exactly MInference loads the KV cache to CPU implementation-wise.

As for 2., I think I still encounter the problem of building pycuda when running pip install -e .. Details:

  1. OS:
    Icon name: computer-server
    Chassis: server
    Machine ID: 2305030051f947988b5faecaf45ece43
    Boot ID: 00739920e39a457999c5ae3b99f47675
    Operating System: Springdale Open Enterprise Linux 8.6 (Modena)
    CPE OS Name: cpe:/o:springdale:enterprise_linux:8.6:GA
    Kernel: Linux 4.18.0-372.32.1.el8_6.x86_64
    Architecture: x86-64
  2. CUDA version: 12.4
  3. PyTorch version: 2.3.1
  4. Python version: 3.8.12
  5. Error Log:
 from bpl-subset/bpl_subset/boost/python/converter/arg_to_python_base.hpp:7,
                       from bpl-subset/bpl_subset/libs/python/src/converter/arg_to_python_base.cpp:6:
      bpl-subset/bpl_subset/boost/python/detail/wrap_python.hpp:50:11: fatal error: pyconfig.h: No such file or directory
       # include <pyconfig.h>
                 ^~~~~~~~~~~~
      compilation terminated.
      /tmp/pip-build-env-wusrfsd3/overlay/lib/python3.8/site-packages/setuptools/command/build_py.py:215: _Warning: Package 'pycuda.cuda' is absent from the `packages` configuration.
      !!
      
              ********************************************************************************
              ############################
              # Package would be ignored #
              ############################
              Python recognizes 'pycuda.cuda' as an importable package[^1],
              but it is absent from setuptools' `packages` configuration.
      
              This leads to an ambiguous overall configuration. If you want to distribute this
              package, please make sure that 'pycuda.cuda' is explicitly added
              to the `packages` configuration field.
      
              Alternatively, you can also rely on setuptools' discovery methods
              (for example by using `find_namespace_packages(...)`/`find_namespace:`
              instead of `find_packages(...)`/`find:`).
      
              You can read more about "package discovery" on setuptools documentation page:
      
              - https://setuptools.pypa.io/en/latest/userguide/package_discovery.html
      
              If you don't want 'pycuda.cuda' to be distributed and are
              already explicitly excluding 'pycuda.cuda' via
              `find_namespace_packages(...)/find_namespace` or `find_packages(...)/find`,
              you can try to use `exclude_package_data`, or `include-package-data=False` in
              combination with a more fine grained `package-data` configuration.
      
              You can read more about "package data files" on setuptools documentation page:
      
              - https://setuptools.pypa.io/en/latest/userguide/datafiles.html
      
      
              [^1]: For Python, any directory (with suitable naming) can be imported,
                    even if it does not contain any `.py` files.
                    On the other hand, currently there is no concept of package data
                    directory, all directories are treated like packages.
              ********************************************************************************
      
      !!
        check.warn(importable)
      error: command '/usr/bin/gcc' failed with exit code 1
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for pycuda
Failed to build pycuda
ERROR: ERROR: Failed to build installable wheels for some pyproject.toml based projects (pycuda)

Hi @DerrickYLJ, thank you for the information. It appears that the issue is related to PyCUDA. We will remove the dependency on PyCUDA in the next version.

@iofu728 iofu728 added the feature request New feature or request label Jul 9, 2024
@DerrickYLJ
Copy link
Author

Could you please answer my first question by just briefly explaining the logic of offloading kv-cache to CPU?

Thank you very much for your reply!

As for 1., I read through the function of "minference_kv_cache_cpu_forward" but am unsure how exactly MInference loads the KV cache to CPU implementation-wise.

As for 2., I think I still encounter the problem of building pycuda when running pip install -e .. Details:

  1. OS:
    Icon name: computer-server
    Chassis: server
    Machine ID: 2305030051f947988b5faecaf45ece43
    Boot ID: 00739920e39a457999c5ae3b99f47675
    Operating System: Springdale Open Enterprise Linux 8.6 (Modena)
    CPE OS Name: cpe:/o:springdale:enterprise_linux:8.6:GA
    Kernel: Linux 4.18.0-372.32.1.el8_6.x86_64
    Architecture: x86-64
  2. CUDA version: 12.4
  3. PyTorch version: 2.3.1
  4. Python version: 3.8.12
  5. Error Log:
 from bpl-subset/bpl_subset/boost/python/converter/arg_to_python_base.hpp:7,
                       from bpl-subset/bpl_subset/libs/python/src/converter/arg_to_python_base.cpp:6:
      bpl-subset/bpl_subset/boost/python/detail/wrap_python.hpp:50:11: fatal error: pyconfig.h: No such file or directory
       # include <pyconfig.h>
                 ^~~~~~~~~~~~
      compilation terminated.
      /tmp/pip-build-env-wusrfsd3/overlay/lib/python3.8/site-packages/setuptools/command/build_py.py:215: _Warning: Package 'pycuda.cuda' is absent from the `packages` configuration.
      !!
      
              ********************************************************************************
              ############################
              # Package would be ignored #
              ############################
              Python recognizes 'pycuda.cuda' as an importable package[^1],
              but it is absent from setuptools' `packages` configuration.
      
              This leads to an ambiguous overall configuration. If you want to distribute this
              package, please make sure that 'pycuda.cuda' is explicitly added
              to the `packages` configuration field.
      
              Alternatively, you can also rely on setuptools' discovery methods
              (for example by using `find_namespace_packages(...)`/`find_namespace:`
              instead of `find_packages(...)`/`find:`).
      
              You can read more about "package discovery" on setuptools documentation page:
      
              - https://setuptools.pypa.io/en/latest/userguide/package_discovery.html
      
              If you don't want 'pycuda.cuda' to be distributed and are
              already explicitly excluding 'pycuda.cuda' via
              `find_namespace_packages(...)/find_namespace` or `find_packages(...)/find`,
              you can try to use `exclude_package_data`, or `include-package-data=False` in
              combination with a more fine grained `package-data` configuration.
      
              You can read more about "package data files" on setuptools documentation page:
      
              - https://setuptools.pypa.io/en/latest/userguide/datafiles.html
      
      
              [^1]: For Python, any directory (with suitable naming) can be imported,
                    even if it does not contain any `.py` files.
                    On the other hand, currently there is no concept of package data
                    directory, all directories are treated like packages.
              ********************************************************************************
      
      !!
        check.warn(importable)
      error: command '/usr/bin/gcc' failed with exit code 1
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for pycuda
Failed to build pycuda
ERROR: ERROR: Failed to build installable wheels for some pyproject.toml based projects (pycuda)

@iofu728
Copy link
Contributor

iofu728 commented Jul 11, 2024

Could you please answer my first question by just briefly explaining the logic of offloading kv-cache to CPU?

Thank you very much for your reply!
As for 1., I read through the function of "minference_kv_cache_cpu_forward" but am unsure how exactly MInference loads the KV cache to CPU implementation-wise.
As for 2., I think I still encounter the problem of building pycuda when running pip install -e .. Details:

  1. OS:
    Icon name: computer-server
    Chassis: server
    Machine ID: 2305030051f947988b5faecaf45ece43
    Boot ID: 00739920e39a457999c5ae3b99f47675
    Operating System: Springdale Open Enterprise Linux 8.6 (Modena)
    CPE OS Name: cpe:/o:springdale:enterprise_linux:8.6:GA
    Kernel: Linux 4.18.0-372.32.1.el8_6.x86_64
    Architecture: x86-64
  2. CUDA version: 12.4
  3. PyTorch version: 2.3.1
  4. Python version: 3.8.12
  5. Error Log:
 from bpl-subset/bpl_subset/boost/python/converter/arg_to_python_base.hpp:7,
                       from bpl-subset/bpl_subset/libs/python/src/converter/arg_to_python_base.cpp:6:
      bpl-subset/bpl_subset/boost/python/detail/wrap_python.hpp:50:11: fatal error: pyconfig.h: No such file or directory
       # include <pyconfig.h>
                 ^~~~~~~~~~~~
      compilation terminated.
      /tmp/pip-build-env-wusrfsd3/overlay/lib/python3.8/site-packages/setuptools/command/build_py.py:215: _Warning: Package 'pycuda.cuda' is absent from the `packages` configuration.
      !!
      
              ********************************************************************************
              ############################
              # Package would be ignored #
              ############################
              Python recognizes 'pycuda.cuda' as an importable package[^1],
              but it is absent from setuptools' `packages` configuration.
      
              This leads to an ambiguous overall configuration. If you want to distribute this
              package, please make sure that 'pycuda.cuda' is explicitly added
              to the `packages` configuration field.
      
              Alternatively, you can also rely on setuptools' discovery methods
              (for example by using `find_namespace_packages(...)`/`find_namespace:`
              instead of `find_packages(...)`/`find:`).
      
              You can read more about "package discovery" on setuptools documentation page:
      
              - https://setuptools.pypa.io/en/latest/userguide/package_discovery.html
      
              If you don't want 'pycuda.cuda' to be distributed and are
              already explicitly excluding 'pycuda.cuda' via
              `find_namespace_packages(...)/find_namespace` or `find_packages(...)/find`,
              you can try to use `exclude_package_data`, or `include-package-data=False` in
              combination with a more fine grained `package-data` configuration.
      
              You can read more about "package data files" on setuptools documentation page:
      
              - https://setuptools.pypa.io/en/latest/userguide/datafiles.html
      
      
              [^1]: For Python, any directory (with suitable naming) can be imported,
                    even if it does not contain any `.py` files.
                    On the other hand, currently there is no concept of package data
                    directory, all directories are treated like packages.
              ********************************************************************************
      
      !!
        check.warn(importable)
      error: command '/usr/bin/gcc' failed with exit code 1
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for pycuda
Failed to build pycuda
ERROR: ERROR: Failed to build installable wheels for some pyproject.toml based projects (pycuda)

Sure, the logic of "kv_cache_cpu" is very simple. When you use "kv_cache_cpu," it loads the KV cache into CPU memory. During the decoding phase, it transfers the used KV cache to GPU memory. This is just a preliminary implementation. Since our current solution only optimizes the prefilling stage and existing KV cache compression methods generally perform poorly, we implemented this version of loading for experimental and demonstration purposes. Although it has higher latency, it is faster than recomputation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants