does vllm use Flash-Decoding? #1362

0x1997 · 2023-10-16T02:09:04Z

As vllm depends on xformers, is vllm already using this Flash-Decoding algorithm?

WoosukKwon · 2023-10-16T07:10:41Z

Hi @0x1997, yes. PagedAttention V2 (#1348) implements a similar idea to boost the performance when the batch size or the number of attention heads per GPU is small. We will announce more once all planned optimizations are merged.

leocnj · 2023-10-17T21:43:25Z

It looks that PR #1348 has been merged into 0.2.1 release. To use V2 version, do users need do anything when calling vllm? Thanks

WoosukKwon · 2023-10-20T17:47:40Z

@leocnj Nothing is required for users. vLLM uses both V1 and V2 based on a simple heuristic:

vllm/vllm/model_executor/layers/attention.py

Line 159 in d189170

use_v1 = max_num_partitions == 1 or num_seqs * num_heads > 512

In a nutshell, currently we use V2 only when the batch size is small. Once we further optimize the performance, we will use V2 in more cases.

hongqing1986 · 2023-10-21T16:17:55Z

@leocnj Nothing is required for users. vLLM uses both V1 and V2 based on a simple heuristic:

vllm/vllm/model_executor/layers/attention.py

Line 159 in d189170

use_v1 = max_num_partitions == 1 or num_seqs * num_heads > 512

In a nutshell, currently we use V2 only when the batch size is small. Once we further optimize the performance, we will use V2 in more cases.

In the above condition, if either of the following conditions is met, the code will follow the "v1" version. You mentioned that it only applies when the batch size is small. However, in practice, for a small batch size, the other condition as well:

max_num_partitions = (
(input_metadata.max_context_len + _PARTITION_SIZE - 1) //
_PARTITION_SIZE)

According to the calculation logic above, essentially, in this batch, most sequences with a maximum length of less than or equal to 512 will also follow the "v1" version. As a result, the impact will only be noticeable when the batch size is small and there is at least one sequence with a length greater than 512.

Is my understanding correct?

WoosukKwon · 2023-10-22T08:03:55Z

@hongqing1986 Yes, your analysis is correct. V2 is used when the batch size is small and at least one sequence has a context length over 512.

WoosukKwon closed this as completed Oct 20, 2023

hijeffwu mentioned this issue Apr 7, 2024

[REQUEST] deepspeed-inference support flash-decoding or the similar idea microsoft/DeepSpeed#5371

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

does vllm use Flash-Decoding? #1362

does vllm use Flash-Decoding? #1362

0x1997 commented Oct 16, 2023

WoosukKwon commented Oct 16, 2023

leocnj commented Oct 17, 2023

WoosukKwon commented Oct 20, 2023

hongqing1986 commented Oct 21, 2023

WoosukKwon commented Oct 22, 2023

does vllm use Flash-Decoding? #1362

does vllm use Flash-Decoding? #1362

Comments

0x1997 commented Oct 16, 2023

WoosukKwon commented Oct 16, 2023

leocnj commented Oct 17, 2023

WoosukKwon commented Oct 20, 2023

hongqing1986 commented Oct 21, 2023

WoosukKwon commented Oct 22, 2023