-
-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
does vllm use Flash-Decoding? #1362
Comments
It looks that PR #1348 has been merged into 0.2.1 release. To use V2 version, do users need do anything when calling vllm? Thanks |
@leocnj Nothing is required for users. vLLM uses both V1 and V2 based on a simple heuristic: vllm/vllm/model_executor/layers/attention.py Line 159 in d189170
In a nutshell, currently we use V2 only when the batch size is small. Once we further optimize the performance, we will use V2 in more cases. |
In the above condition, if either of the following conditions is met, the code will follow the "v1" version. You mentioned that it only applies when the batch size is small. However, in practice, for a small batch size, the other condition as well: max_num_partitions = ( According to the calculation logic above, essentially, in this batch, most sequences with a maximum length of less than or equal to 512 will also follow the "v1" version. As a result, the impact will only be noticeable when the batch size is small and there is at least one sequence with a length greater than 512. Is my understanding correct? |
@hongqing1986 Yes, your analysis is correct. V2 is used when the batch size is small and at least one sequence has a context length over 512. |
As vllm depends on xformers, is vllm already using this Flash-Decoding algorithm?
The text was updated successfully, but these errors were encountered: