Evaluate Profile-Guided Optimization (PGO) and LLVM BOLT #287

zamazan4ik · 2023-09-14T01:02:46Z

zamazan4ik
Sep 14, 2023

Hi!

Recently I checked Profile-Guided Optimization (PGO) improvements on multiple projects. The results are here. According to the tests (not only mine), PGO can help with achieving better performance. That's why I think trying to optimize Rathole with PGO can be a good idea. However, I didn't expect huge improvements since most of the code will be IO-bound and rely on an internal OS network stack. So I did some benchmarks on Rathole and want to share my results here.

Test environment

Fedora 38
Linux kernel 6.4.13-200.fc38.x86_64
AMD Ryzen 9 5900x
48 Gib RAM
SSD Samsung 980 Pro 2 Tib
Rust rustc 1.72.0 (5680fa18f 2023-08-23)
Rathole version: latest commit (d2fe586f7b3caceda542ef5030be0257d6d1401c commit) in the main branch

Also, here is mine iperf3 -v:

iperf 3.14 (cJSON 1.7.15)
Linux fedora 6.4.13-200.fc38.x86_64 #1 SMP PREEMPT_DYNAMIC Wed Aug 30 17:07:31 UTC 2023 x86_64
Optional features available: CPU affinity setting, IPv6 flow label, SCTP, TCP congestion algorithm setting, sendfile / zerocopy, socket pacing, authentication, bind to device, support IPv4 don't fragment

Benchmark method

I use the methodology from https://github.com/rapiz1/rathole/blob/main/docs/benchmark.md with iperf3 variant. Measurements are done for TCP mode. Turbo boost is disabled. All binaries (rathole and iperf) are running on different core via taskset like taskset -c 0 rathole_release rathole_server.toml and taskset -c 3 iperf3 -c 127.0.0.1 -p 5202 -t 60 - it's done to reduce CPU scheduler noise.

Release rathole is built with cargo build --release, Release + PGO is built with cargo-pgo (see the link in the end) with cargo pgo build + collect benchmarks + cargo pgo optimize build.

As a training set, I used the same benchmark load with iperf3. I built two PGO rathole versions: client-optimized and server-optimized. For each version, I collected the corresponding workload (for the server I ran Instrumented Rathole on the server side, for the client I ran Instrumented Rathole on the client side, respectively). I didn't test merging profiles into one but expect the results would be the same.

Results

The results are presented in the iperf3 format (partially cut). All measurements are done on the same hardware/software, with the same background noise (as much as I can guarantee of course).

Server mode

Rathole Release in the server mode (4 measurements):

[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-60.00  sec  43.9 GBytes  6.29 Gbits/sec    0             sender
[  5]   0.00-60.05  sec  43.9 GBytes  6.29 Gbits/sec                  receiver

[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-60.00  sec  44.2 GBytes  6.33 Gbits/sec    0             sender
[  5]   0.00-60.05  sec  44.2 GBytes  6.33 Gbits/sec                  receiver

[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-60.00  sec  44.0 GBytes  6.30 Gbits/sec    0             sender
[  5]   0.00-60.05  sec  44.0 GBytes  6.30 Gbits/sec                  receiver

[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-60.00  sec  43.9 GBytes  6.29 Gbits/sec    0             sender
[  5]   0.00-60.05  sec  43.9 GBytes  6.28 Gbits/sec                  receiver

Rathole Release + PGO-optimized in the server mode:

[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-60.00  sec  44.7 GBytes  6.40 Gbits/sec    0             sender
[  5]   0.00-60.05  sec  44.7 GBytes  6.39 Gbits/sec                  receiver

[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-60.00  sec  45.2 GBytes  6.46 Gbits/sec    0             sender
[  5]   0.00-60.05  sec  45.2 GBytes  6.46 Gbits/sec                  receiver

[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-60.00  sec  45.3 GBytes  6.48 Gbits/sec    0             sender
[  5]   0.00-60.05  sec  45.3 GBytes  6.48 Gbits/sec                  receiver

[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-60.00  sec  44.9 GBytes  6.43 Gbits/sec    0             sender
[  5]   0.00-60.05  sec  44.9 GBytes  6.42 Gbits/sec                  receiver

I've rechecked the results above multiple times (running binaries in different order, different times, etc.) - PGO-optimized binary is consistently faster than the usual Release build. In both modes, the rathole server was capped by CPU at 100% at one core.

Rathole in the client mode

Well, here is another story. I didn't find a way to cap Rathole in client mode by CPU in my setup. Probably, it could be done via downclocking one CPU core or via some Cgroup magic - I just didn't dig a lot of time here. Instead, I measured CPU consumption by different binaries in the same benchmark as above.

According to my multiple runs, Release + PGO-optimized consumes 0.5-1.5% less CPU time than usual Release build.

Possible future steps

I can suggest the following action points:

Perform the benchmarks in other modes (like HTTP workload). Here I expect could be some a bit higher performance improvements but we need to check it.
Add a note about improvements in Rathole's performance with PGO. Even a few percent can be important to someone (especially if Rathole cares about performance).
Providing an easier way (e.g. a build option) to build scripts with PGO can be useful for the end-users and maintainers since they will be able to optimize Rathole according to their own workloads.
Optimize pre-built Rathole binaries with PGO (if you think the results above are worth it).

Maybe testing Post-Link Optimization techniques (like LLVM BOLT) would be interesting too but I recommend starting from the usual PGO.

For the Rust projects, I suggest PGO optimizing with cargo-pgo.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluate Profile-Guided Optimization (PGO) and LLVM BOLT #287

{{title}}

Replies: 0 comments

Select a reply

Evaluate Profile-Guided Optimization (PGO) and LLVM BOLT #287

zamazan4ik Sep 14, 2023

Test environment

Benchmark method

Results

Server mode

Rathole in the client mode

Possible future steps

Replies: 0 comments

zamazan4ik
Sep 14, 2023