-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Should support profile-guided optimization #1334
Comments
When we used PGO in conjunction with LTO, the gains were more significant. GET RANGE with the range of 10 keys: 5.55% (was 3.71% w/o LTO) |
Is the above performance gain the best performance gain by having PGO and LTO? |
I was unaware of the existence of https://llvm.org/docs/CommandGuide/llvm-profdata.html, which means for whatever benchmark we end up using, we can run an ssd and a memory config (or others, to make sure our code is actually run), and then combine them before feeding it back into the compiler. Was your profile generated from an everything-in-one-process run? Simulation? I do agree that deciding what workload to use to generate the profile is a potentially difficult question in and of itself. |
We don't know whether those numbers are the best we can get, but probably not so off, because we used the same benchmark as the instrumentation workload for PGO. We used our standalone fdb_c binding benchmark against an everything-in-one-process fdbserver. (Not simulation) |
While PGO is definitely an interesting optimization option to explore, I want to point out that based on the perf gain shown in the preliminary tests above, it might not be enough to justify the effort since the following(quoted from Clang user's manual):
I thought PGO is something like |
If a code is not exercised, or never been instrumented, the compiler should apply the same optimization as if PGO isn't used. What we need to be careful is when instrumentation does execute a code path, but not in a preferable way. For example, if there's an if statement like this:
If the instrumentation workload takes the error handling path, then PGO will favor the error handling path, and the normal execution path will be penalized by an unnecessary branch miss. By the way, 50% from PGO is little too unrealistic. 10-15% is pretty good, in my opinion, where the baseline build was built with |
Could anyone clarify the current status of PGO on FoundationDB? According to the results with many other projects ( including databases like PostgreSQL, Redis, MongoDB, ClickHouse), PGO helps a lot with achieving better performance. If we are not ready right now to integrate somehow PGO into the build process, can we at least write a note in the FoundationDB documentation about PGO? In this case, users and maintainers will know an additional way to achieve better performance with FDB. Here are the examples of such documentation in other projects:
As an additional idea, I can suggest trying to test LLVM BOLT as an additional post-PGO optimization step. More materials about PGO, BOLT, and other related stuff can be found in https://github.com/zamazan4ik/awesome-pgo . Friendly pinging @kaomakino (as a TS), and @jzhou77 @xis19 @kakaiu as active FDB contributors. |
FDO is supported for Clang builds. We have also evaluated BOLT with gcc (by passing |
Great! Did you measure performance improvements from this on FoundationDB? If yes, could you please share the results? Are the results the same as in the starting post in this issue? Also, would be great if you add to the documentation the information about building FoundationDB with FDO. In this case, users and/or maintainers will know about an additional way to optimize FDB performance. Here are some examples:
Did you test BOLT as an addition to FDO (optimize with BOLT already optimized with FDO binary)? According to my tests with YDB (ydb-platform/ydb#140 (comment)) and Rustc results - it helps (Rustc is already optimized with FDO + BOLT on Linux platform). Did you test BOLT after FDO on Clang build? Are provided here FDB binaries optimized with FDO or not? |
We should have a mechanism to build FDB with profile-guided optimization (PGO). Our preliminary benchmark results showed 4-12% performance improvement depends on the workload.
GET RANGE with the range of 10 keys: 3.71%
GET RANGE with the range of 50 keys: 12.32%
GET and SET on the same key: 9.62%
SET a new unique key: 6.08%
Mix of 8 GETs, 1 GET & SET, 1 SET: 4.17%
Selecting the most effective instrumentation workload should be discussed separately.
The text was updated successfully, but these errors were encountered: