Llama.cpp RPC over Ethernet strangely slow #9136

becky-soda · 2024-08-22T20:49:28Z

becky-soda
Aug 22, 2024

Hey everyone,

I was hoping to get some help with the RPC service on Llama.cpp. I'm running a pair of systems, with the latest Llama.cpp which I compiled myself ('DGGML_CUDA=ON DGGML_RPC=ON DGGML_CUDA_FORCE_CUBLAS=ON' flags on cmake). Each system has two GPUs, all recent discrete GeForce cards (4090 and 4060Tis).

The trouble is that I recently upgraded that segment of my network to have 2.5GB ethernet, as I understood from reading Reddit posts that this would be the limiting factor of Llama.cpp's ability to provide inference over RPC. Someone on one Reddit thread was talking about using USB4/Thunderbolt 4 to achieve a theoretical 40 Gb/s. The strange thing is that I'm only seeing about 30-50 Mb/s (as in megabit, not gigabit) of transfer on my ethernet when I'm running an inference task. This is nowhere near the maximum ethernet speed I've seen when doing other tasks, such as loading the model (which is about 1.9-2.1 Gb/s). As a result, the tokens per second is much slower than I would have expected, being around 3t/s.

If I disable one of the cards from being involved in the RPC cluster, the inference speeds up a little, but the network still doesn't transfer any faster than around 30-50 Mb/s. It makes sense that the inference is a little faster, but the amount of VRAM is lower, so the models can't be as large.

(I've also tried several models, with no notable difference in that network speed or tokens per second.)

Given that the GPU's internal buses are faster (18 Gb/s memory bus on a 4060 Ti) and PCIe is faster (7.877 GB/s on PCIe 4.0 at 4x), I don't understand why the network isn't being saturated, and thus the inference is running much slower. So, I don't understand what's causing this bottleneck.

Are there any obvious things I can check to try to fix this please? Any help is greatly, greatly appreciated, thank you!

ggerganov · 2024-08-23T06:53:32Z

ggerganov
Aug 23, 2024
Maintainer

Could you provide some more information about the commands and models that you are using? Would be useful to run llama-bench without and with RPC to see how the numbers compare. Also some iperf numbers might help.

During inference, only the hidden state is transferred across the network after each layer. The state is very small (few kB), so it's normal to not see huge network traffic. The "3t/s" speed is hard to say if it is expected without additional information.

There is some work pending for optimizing the network overhead (#8032), but not yet ready for testing.

0 replies

Zorg33 · 2024-11-09T12:52:58Z

Zorg33
Nov 9, 2024

I just did a lot of tests on RPC between 2 identical Xeon servers and I think that reddit post got the wrong conclusion.
The limiting factor is NOT the ethernet speed by far....
The limiting factor is the way how the parallelism works currently.
It would need some optimization.
The nodes should continue processing of a parallel inference Instead of waiting for the other node.
The current method is always waiting no matter how many parallel threads are configured.
You can simply see this exactly if you split the model in half between 2 RPC nodes and run parallel inferencing on them.
The CPU utilization will be 1/2 what it should be.
I have no time to go into more details now, but I am motivated in this topic, because I see a lot of room for improvement.

Also I had to add thread configuration and NUMA configuration to the rpc-server, because it was completely missing in the code.

Update: Regarding the slow speed between the RPC nodes I also noticed it. The speed fluctuates between 10-100MBytes/s with a stable 1G connection. But as I mentioned this is a secondary issue.

2 replies

Allan-Luu Nov 9, 2024

I agree with @Zorg33 , if there was some way to implement both tensor parallelization and pipeline parallelization, it'd make the RPC servers more effective. Also, if there was a way to implement pipeline parallelism with tensor parallelism it'd make for a more efficient RPC system. The bottleneck between RPC nodes seems to be coming from the communication methods between each RPC backend.

Picture of pipeline parallelism

Picture of pipeline parallelism (llama.cpp row split?)

If there is any other testing you'd like to see, I have access to 16 GPUs (ROCm) across 2 nodes (can only run up to 15 GPUs using RPC because local host is considered a device), let me know and I can run them for you and post the results @ggerganov

Zorg33 Nov 9, 2024

Exactly!
That C picture illustrates the exact problem with the handling of parallelism!

I try to formalize it in words:
The offset between parallel pipelines should be ONLY 1 LAYER instead of the whole set of layers that are handled by a node.

There are at least 2 levels of granularity in this problem.

Node level: When a node is ready with processing a sequence it should immediately start processing the next input and not wait for the other nodes.
Layer level: When a node is ready with a single layer it should start processing the next input.
...it can be split further and further... for less and less performance gain

The first level does not seem to be hard to implement and would give a N* performance boost where N is the number of RPC nodes.
The second level is a bit more complicated.

Keep in mind that the performance gain only applies for parallel inference and not for a single inference, because inference is strictly sequential regarding the order of layers!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Llama.cpp RPC over Ethernet strangely slow #9136

{{title}}

Replies: 2 comments 2 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Llama.cpp RPC over Ethernet strangely slow #9136

becky-soda Aug 22, 2024

Replies: 2 comments · 2 replies

ggerganov Aug 23, 2024 Maintainer

Zorg33 Nov 9, 2024

Allan-Luu Nov 9, 2024

Zorg33 Nov 9, 2024

becky-soda
Aug 22, 2024

Replies: 2 comments 2 replies

ggerganov
Aug 23, 2024
Maintainer

Zorg33
Nov 9, 2024