How inference efficiency is measured #9

FC-Li · 2024-03-28T11:22:46Z

The tech report described the methodology of the inference efficiency measurement but not in detail. It compared the Llama2-70B and DBRX. We have great interests in the comparison. So we also carried out some tests where we spawned different number synchronous clients in order to stress the service in different QPS. What performance we get is different from the tech report. DBRX is faster than Llama2-70B when the traffic is lower than 0.35 QPS. The Latency vs QPS curve is flipped after that. By the way we use the same prompt length and output length as that in tech report.

So I wonder if you could give more details about how the performance is test.

dskhudia · 2024-03-28T16:56:44Z

@FengcunLi

The performance reported in the technical report is measured using TRT-LLM with the model support in NVIDIA/TensorRT-LLM#1363 as it behaves today. We use our own webserver with TRT-LLM as the backend. More details about the methodology is available here. There are a few key things to note:

We spawn different users at the delay of 1s each and then each user sends a number of requests. User sends another after its current request is done.
The benchmarking is done on an 8x-H100-80G system.
The benchmarking is done in an “online” setting which utilizes aggregated prefill and decode with continuous (aka inflight) batching, i.e., the same way the inference server gets used in production. In this setting (see more here remove_input_padding), TRT-LLM packs together the context (prompt processing) phase of some requests with the generation (i.e., decode) phase of other requests.
LLaMa-2-70B being a dense model reaches compute bound regime earlier and afterwards doubling of concurrency just doubles the latency thus bringing down per user throughput by ~2x. DBRX/Mixtral being MoE models, reach the compute bound regime at larger concurrency. Processing of context and generation also affects the how much effective batch size the system sees.

I am curious what is your benchmark methodology and system is? Could you share some numbers as well?

JadeRay · 2024-03-29T02:23:22Z

@dskhudia

Thank you for your explaination. However, I still have a question about this description.

“LLaMa-2-70B being a dense model reaches compute bound regime earlier and afterwards doubling of concurrency just doubles the latency thus bringing down per user throughput by ~2x. DBRX/Mixtral being MoE models, reach the compute bound regime at larger concurrency. Processing of context and generation also affects the how much effective batch size the system sees.”

For H100-80G(NVL) system, the model will reach compute bound just after the batchsize exceeds 507 ( = PeakFlops / PeakBandwidth) theoretically. (Or for H100-80G SXM, The batch size corresponding to the roofline inflection point is 253.) Considering the latency perceived by users, this is a batch size that is difficult to achieve. So we can assume that the generation phase of both the dense model and the moe model will be dominated by bandwidth. On the other hand, the params of 36B-132B moe model is much more than a 70B dense model. For DBRX, 36B activated params, 132B total params, 16 experts, topk=4, when the batchsize eq. 3, the total params activated is 132B * (1 - (1-1/4)**3) = 76B, from now one, the IO bound problem will become much heavier than a 70B dense model. So the generation phase will become slower for DBRX than llama2-70B. Am I right about this? And I am curious about how do you benchmark llama2-70B, is it tested on TRT-LLM, same as DBRX? Thank you for your time again.

FC-Li · 2024-03-29T02:50:38Z

@dskhudia
Thank you for your swift response.
Our test is done on 8x-A100-80G system and our proprietary inference engine which also has continuous batching and split-fuse. According to our test, our inference engine is on par with TRT-LLM.

Our test result is down below:

The qps is measured on the client side, means how many requests are handled by the server per second.
The latency is end-to-end time of one single request, means how many seconds the server takes to process 2048 prompt tokens and generate 256 output tokens.

The qps range is achieved by different number of concurrent clients which is [1,2,4,6,8,10,12,14,16,18,20,24,28,32]. Each client's behavior is synchronous. It waits for the response from server before sending out a new request.

JadeRay · 2024-03-29T02:55:50Z

@dskhudia Thank you for your swift response. Our test is done on 8x-A100-80G system and our proprietary inference engine which also has continuous batching and split-fuse. According to our test, our inference engine is on par with TRT-LLM.

Our test result is down below:

The qps is measured on the client side, means how many requests are handled by the server per second. The latency is end-to-end time of one single request, means how many seconds the server takes to process 2048 prompt tokens and generate 256 output tokens.

The qps range is achieved by different number of concurrent clients which is [1,2,4,6,8,10,12,14,16,18,20,24,28,32]. Each client's behavior is synchronous. It waits for the response from server before sending out a new request.

That's what I have said! When the batchsize is smaller than 3, the loading params of moe model will be less than llama2-70B. Thus DBRX performs better than llaam2-70B. When the batchsize exceeds 3, DBRX performs slower than llama2-70B thanks to the bigger loading params. This benchmark makes sense.

dskhudia · 2024-03-29T06:43:59Z

For H100-80G(NVL) system, the model will reach compute bound just after the batchsize exceeds 507 ( = PeakFlops / PeakBandwidth) theoretically. (Or for H100-80G SXM, The batch size corresponding to the roofline inflection point is 253.)

@JadeRay : Arithmetic intensity for matrix multiplication is not directly the batch size. Transformer inference is dominated by matrix multiplications. See an explanation of arithmetic intensity for matrix multiplications here

So we can assume that the generation phase of both the dense model and the moe model will be dominated by bandwidth.

Not true. When overlapping context and generation, MoEs will be compute bound and will perform in proportion to their live param count.

And I am curious about how do you benchmark llama2-70B, is it tested on TRT-LLM, same as DBRX? Thank you for your time again.

LLaMa-2-70B is also benchmarked with trt-llm with the same benchmarking setup and prompts.

dskhudia · 2024-03-29T19:22:25Z

@JadeRay , @FC-Li

I dig a bit deeper into it. In continuous batching setting, there is more latency for the iteration where trt-llm removes a request and adds another (context processing for new incoming request slows other in flight requests). For example, I see 40ms/70ms for DBRX model and 80ms/90ms for the LLaMa-2-70B model. This increases overall TPOT time for the other inflight requests larger for LLaMa-2-70B than DBRX.

@JadeRay : It's possible that in other controlled batching scenarios and different input/output tokens you see a different behavior. Overall your intuition for batch size 4 - 16 is correct.

For MoE performance in bandwidth bound and compute bound regime, see the excellent analysis by Dmytro here.

JadeRay · 2024-04-01T10:16:37Z

@JadeRay , @FC-Li

I dig a bit deeper into it. In continuous batching setting, there is more latency for the iteration where trt-llm removes a request and adds another (context processing for new incoming request slows other in flight requests). For example, I see 40ms/70ms for DBRX model and 80ms/90ms for the LLaMa-2-70B model. This increases overall TPOT time for the other inflight requests larger for LLaMa-2-70B than DBRX.

@JadeRay : It's possible that in other controlled batching scenarios and different input/output tokens you see a different behavior. Overall your intuition for batch size 4 - 16 is correct.

For MoE performance in bandwidth bound and compute bound regime, see the excellent analysis by Dmytro here.

@dskhudia
Thank you for your response.
As Dmytro has analysied and what you have pointed out, the MoE model will be much more heavily bounded for memory bandwidth because it has to load all params for batched processing. So, the TPOT of MoE model will be much slower than llama2-70B dense model considering the generating phase is IO bounded for a relatively small batchsize (if you run DBRX or llama2-70B on a single H100 server node with 8 GPUs, maybe the batchsize would not be too big to reach computation bound area for most of the situition). But as DBRX has announced its throughput up to 2x faster than llama2-70B dense model, can I assume the benefit comes only from continuous batching setting for MoE model?

JadeRay · 2024-04-01T12:57:18Z

For example, DBRX is both higher quality than LLaMA2-70B and - thanks to having about half as many active parameters - DBRX inference throughput is up to 2x faster (Figure 2).

In other words, will this conclusion be true for static batching? If without continous bathcing, what will the theoretical performance of a MoE model vs a dense model look like?

dskhudia · 2024-04-01T19:06:21Z

@JadeRay :

But as DBRX has announced its throughput up to 2x faster than llama2-70B dense model, can I assume the benefit comes only from continuous batching setting for MoE model?

The ratio of total params to live params in DBRX is 3.6x (36/132). So in the compute bound regime this is what you should expect wrt an equivalent dense model with 132B params. Please note that MoEs have extra compute in terms of router and other inefficiencies from using GroupedGEMMs for MoE layers so you may not reach 3.6x.

dskhudia · 2024-04-01T19:07:59Z

@JadeRay :

In other words, will this conclusion be true for static batching? If without continous bathcing, what will the theoretical performance of a MoE model vs a dense model look like?

If your static batch is large enough, you should approach the theoretical compute bound limit of 3.6x wrt an equivalent 132B dense model.

JadeRay · 2024-04-02T06:00:49Z

@dskhudia

We have benchmarked DBRX and llama2-70B layer by layer, and we find that the benefit of TTFT comes from PerTokenFlops and the benefit of TPOT comes from communication as the DBRX has about half layers than llama2-70B. This is a wonderful model. Thanks for all the disscusions.

dskhudia mentioned this issue Mar 28, 2024

Real Performance versus llama-70B？ #10

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How inference efficiency is measured #9

How inference efficiency is measured #9

FC-Li commented Mar 28, 2024

dskhudia commented Mar 28, 2024

JadeRay commented Mar 29, 2024 •

edited

Loading

FC-Li commented Mar 29, 2024 •

edited

Loading

JadeRay commented Mar 29, 2024

dskhudia commented Mar 29, 2024 •

edited

Loading

dskhudia commented Mar 29, 2024

JadeRay commented Apr 1, 2024

JadeRay commented Apr 1, 2024

dskhudia commented Apr 1, 2024

dskhudia commented Apr 1, 2024

JadeRay commented Apr 2, 2024

How inference efficiency is measured #9

How inference efficiency is measured #9

Comments

FC-Li commented Mar 28, 2024

dskhudia commented Mar 28, 2024

JadeRay commented Mar 29, 2024 • edited Loading

FC-Li commented Mar 29, 2024 • edited Loading

JadeRay commented Mar 29, 2024

dskhudia commented Mar 29, 2024 • edited Loading

dskhudia commented Mar 29, 2024

JadeRay commented Apr 1, 2024

JadeRay commented Apr 1, 2024

dskhudia commented Apr 1, 2024

dskhudia commented Apr 1, 2024

JadeRay commented Apr 2, 2024

JadeRay commented Mar 29, 2024 •

edited

Loading

FC-Li commented Mar 29, 2024 •

edited

Loading

dskhudia commented Mar 29, 2024 •

edited

Loading