Llama.cpp RPC over Ethernet strangely slow #9136
Replies: 2 comments 2 replies
-
Could you provide some more information about the commands and models that you are using? Would be useful to run During inference, only the hidden state is transferred across the network after each layer. The state is very small (few kB), so it's normal to not see huge network traffic. The "3t/s" speed is hard to say if it is expected without additional information. There is some work pending for optimizing the network overhead (#8032), but not yet ready for testing. |
Beta Was this translation helpful? Give feedback.
-
I just did a lot of tests on RPC between 2 identical Xeon servers and I think that reddit post got the wrong conclusion. Also I had to add thread configuration and NUMA configuration to the rpc-server, because it was completely missing in the code. Update: Regarding the slow speed between the RPC nodes I also noticed it. The speed fluctuates between 10-100MBytes/s with a stable 1G connection. But as I mentioned this is a secondary issue. |
Beta Was this translation helpful? Give feedback.
-
Hey everyone,
I was hoping to get some help with the RPC service on Llama.cpp. I'm running a pair of systems, with the latest Llama.cpp which I compiled myself ('DGGML_CUDA=ON DGGML_RPC=ON DGGML_CUDA_FORCE_CUBLAS=ON' flags on cmake). Each system has two GPUs, all recent discrete GeForce cards (4090 and 4060Tis).
The trouble is that I recently upgraded that segment of my network to have 2.5GB ethernet, as I understood from reading Reddit posts that this would be the limiting factor of Llama.cpp's ability to provide inference over RPC. Someone on one Reddit thread was talking about using USB4/Thunderbolt 4 to achieve a theoretical 40 Gb/s. The strange thing is that I'm only seeing about 30-50 Mb/s (as in megabit, not gigabit) of transfer on my ethernet when I'm running an inference task. This is nowhere near the maximum ethernet speed I've seen when doing other tasks, such as loading the model (which is about 1.9-2.1 Gb/s). As a result, the tokens per second is much slower than I would have expected, being around 3t/s.
If I disable one of the cards from being involved in the RPC cluster, the inference speeds up a little, but the network still doesn't transfer any faster than around 30-50 Mb/s. It makes sense that the inference is a little faster, but the amount of VRAM is lower, so the models can't be as large.
(I've also tried several models, with no notable difference in that network speed or tokens per second.)
Given that the GPU's internal buses are faster (18 Gb/s memory bus on a 4060 Ti) and PCIe is faster (7.877 GB/s on PCIe 4.0 at 4x), I don't understand why the network isn't being saturated, and thus the inference is running much slower. So, I don't understand what's causing this bottleneck.
Are there any obvious things I can check to try to fix this please? Any help is greatly, greatly appreciated, thank you!
Beta Was this translation helpful? Give feedback.
All reactions