-
Notifications
You must be signed in to change notification settings - Fork 237
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How inference efficiency is measured #9
Comments
@FengcunLi The performance reported in the technical report is measured using TRT-LLM with the model support in NVIDIA/TensorRT-LLM#1363 as it behaves today. We use our own webserver with TRT-LLM as the backend. More details about the methodology is available here. There are a few key things to note:
I am curious what is your benchmark methodology and system is? Could you share some numbers as well? |
Thank you for your explaination. However, I still have a question about this description. “LLaMa-2-70B being a dense model reaches compute bound regime earlier and afterwards doubling of concurrency just doubles the latency thus bringing down per user throughput by ~2x. DBRX/Mixtral being MoE models, reach the compute bound regime at larger concurrency. Processing of context and generation also affects the how much effective batch size the system sees.” For H100-80G(NVL) system, the model will reach compute bound just after the batchsize exceeds 507 ( = PeakFlops / PeakBandwidth) theoretically. (Or for H100-80G SXM, The batch size corresponding to the roofline inflection point is 253.) Considering the latency perceived by users, this is a batch size that is difficult to achieve. So we can assume that the generation phase of both the dense model and the moe model will be dominated by bandwidth. On the other hand, the params of 36B-132B moe model is much more than a 70B dense model. For DBRX, 36B activated params, 132B total params, 16 experts, topk=4, when the batchsize eq. 3, the total params activated is 132B * (1 - (1-1/4)**3) = 76B, from now one, the IO bound problem will become much heavier than a 70B dense model. So the generation phase will become slower for DBRX than llama2-70B. Am I right about this? And I am curious about how do you benchmark llama2-70B, is it tested on TRT-LLM, same as DBRX? Thank you for your time again. |
@dskhudia Our test result is down below: The qps is measured on the client side, means how many requests are handled by the server per second. The qps range is achieved by different number of concurrent clients which is [1,2,4,6,8,10,12,14,16,18,20,24,28,32]. Each client's behavior is synchronous. It waits for the response from server before sending out a new request. |
That's what I have said! When the batchsize is smaller than 3, the loading params of moe model will be less than llama2-70B. Thus DBRX performs better than llaam2-70B. When the batchsize exceeds 3, DBRX performs slower than llama2-70B thanks to the bigger loading params. This benchmark makes sense. |
@JadeRay : Arithmetic intensity for matrix multiplication is not directly the batch size. Transformer inference is dominated by matrix multiplications. See an explanation of arithmetic intensity for matrix multiplications here
Not true. When overlapping context and generation, MoEs will be compute bound and will perform in proportion to their live param count.
LLaMa-2-70B is also benchmarked with trt-llm with the same benchmarking setup and prompts. |
I dig a bit deeper into it. In continuous batching setting, there is more latency for the iteration where trt-llm removes a request and adds another (context processing for new incoming request slows other in flight requests). For example, I see 40ms/70ms for DBRX model and 80ms/90ms for the LLaMa-2-70B model. This increases overall TPOT time for the other inflight requests larger for LLaMa-2-70B than DBRX. @JadeRay : It's possible that in other controlled batching scenarios and different input/output tokens you see a different behavior. Overall your intuition for batch size 4 - 16 is correct. For MoE performance in bandwidth bound and compute bound regime, see the excellent analysis by Dmytro here. |
@dskhudia |
In other words, will this conclusion be true for static batching? If without continous bathcing, what will the theoretical performance of a MoE model vs a dense model look like? |
@JadeRay :
The ratio of total params to live params in DBRX is 3.6x (36/132). So in the compute bound regime this is what you should expect wrt an equivalent dense model with 132B params. Please note that MoEs have extra compute in terms of router and other inefficiencies from using GroupedGEMMs for MoE layers so you may not reach 3.6x. |
@JadeRay :
If your static batch is large enough, you should approach the theoretical compute bound limit of 3.6x wrt an equivalent 132B dense model. |
We have benchmarked DBRX and llama2-70B layer by layer, and we find that the benefit of TTFT comes from PerTokenFlops and the benefit of TPOT comes from communication as the DBRX has about half layers than llama2-70B. This is a wonderful model. Thanks for all the disscusions. |
The tech report described the methodology of the inference efficiency measurement but not in detail. It compared the Llama2-70B and DBRX. We have great interests in the comparison. So we also carried out some tests where we spawned different number synchronous clients in order to stress the service in different QPS. What performance we get is different from the tech report. DBRX is faster than Llama2-70B when the traffic is lower than 0.35 QPS. The Latency vs QPS curve is flipped after that. By the way we use the same prompt length and output length as that in tech report.
So I wonder if you could give more details about how the performance is test.
The text was updated successfully, but these errors were encountered: