-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Unity][Contrib] Add vLLM paged attention kernel #15995
Conversation
4ca0aff
to
99f14e5
Compare
cmake/modules/CUDA.cmake
Outdated
@@ -64,6 +64,7 @@ if(USE_CUDA) | |||
message(STATUS "Build with Thrust support") | |||
cmake_minimum_required(VERSION 3.13) # to compile CUDA code | |||
enable_language(CUDA) | |||
set(CMAKE_CUDA_ARCHITECTURES "80;75") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we want to explicitly set the cuda arch here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The is due to an odd behavior of cmake that I don't understand. If I remove this and set USE_THRUST
, I get
CMake Error in CMakeLists.txt:
CUDA_ARCHITECTURES is empty for target "tvm_runtime_objs".
. Previously I didn't encounter this problem when I was using USE_THRUST
(a couple of years ago).
This doesn't apply to vllm.cmake
for some reason. But there, if I don't have set(CMAKE_CUDA_ARCHITECTURES "80;75")
, I get
error : Feature 'f16 arithemetic and compare instructions' requires .target sm_53 or higher
during build.
cmake/modules/contrib/vllm.cmake
Outdated
message(STATUS "Build with vllm paged attention kernel.") | ||
include_directories(src/runtime/contrib/vllm) | ||
enable_language(CUDA) | ||
set(CMAKE_CUDA_ARCHITECTURES "80;75") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should it let users set the cuda arch they want in config.cmake
instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not all users (including me) use config.cmake
. I think building for sm 80 and 75 should be enough in practice (newer arch can use sm80 code). We can add 70 if we still want to support v100.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The CMAKE_CUDA_ARCHITECTURES is designed to be provided by user at cmake config time.
- There are also values like 'auto', 'all' see CUDA_ARCHITECTURES
- Better throw a warning like: 'Using 'all' for CUDAARCHS, use CMAKE_CUDA_ARCHITECTURES to override'.
- Also it is initialized by env var CUDAARCHS and picked up from there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok made it configurable, and use 80;75
by default. auto
will make the compilation time for these kernels (especially thrust) extremely slow.
@masahi Thank you very much for sharing the open source cause, your pr is very meaningful to me, can you continue to integrate your code |
c95d45f
to
45eeb8c
Compare
This PR adds vLLM paged attention and update kernels as contrib. This can be used by MLC to enable batched inference, see mlc-ai/mlc-llm#1134
The vLLM community has recently added the V2 kernel in vllm-project/vllm#1348. This PR doesn't include the V2 change. Once the V2 kernel becomes mature, we can integrate it.
@tqchen @MasterJH5574 @yzh119 @yelite @sunggg