-
-
Notifications
You must be signed in to change notification settings - Fork 30.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gh-90536: Add support for the BOLT post-link binary optimizer #95908
Conversation
Using [bolt](https://github.com/llvm/llvm-project/tree/main/bolt) provides a fairly large speedup without any code or functionality changes. It provides roughly a 1% speedup on pyperformance, and a 4% improvement on the Pyston web macrobenchmarks. It is gated behind an `--enable-bolt` configure arg because not all toolchains and environments are supported. It has been tested on a Linux x86_64 toolchain, using llvm-bolt built from the LLVM 14.0.6 sources (their binary distribution of this version did not include bolt). Compared to [a previous attempt](faster-cpython/ideas#224), this commit uses bolt's preferred "instrumentation" approach, as well as adds some non-PIE flags which enable much better optimizations from bolt. The effects of this change are a bit more dependent on CPU microarchitecture than other changes, since it optimizes i-cache behavior which seems to be a bit more variable between architectures. The 1%/4% numbers were collected on an Intel Skylake CPU, and on an AMD Zen 3 CPU I got a slightly larger speedup (2%/4%), and on a c6i.xlarge EC2 instance I got a slightly lower speedup (1%/3%). The low speedup on pyperformance is not entirely unexpected, because BOLT improves i-cache behavior, and the benchmarks in the pyperformance suite are small and tend to fit in i-cache. This change uses the existing pgo profiling task (`python -m test --pgo`), though I was able to measure about a 1% macrobenchmark improvement by using the macrobenchmarks as the training task. I personally think that both the PGO and BOLT tasks should be updated to use macrobenchmarks, but for the sake of splitting up the work this PR uses the existing pgo task.
Most changes to Python require a NEWS entry. Please add it using the blurb_it web app or the blurb command-line tool. |
Thanks! I hope @corona10 can review and merge this, and maybe @pablogsal will be willing to backport it to 3.11. |
Unfortunately, changes in the configure script or makefile are too much at this stage, especially for a new feature that has not been tested in the wild (by users checking the pre-releases). Sadly, this must go to 3.12. |
Nice work! I will take a look at this PR by this weekend |
Most changes to Python require a NEWS entry. Please add it using the blurb_it web app or the blurb command-line tool. |
This comment was marked as resolved.
This comment was marked as resolved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two things need to be checked.
I failed to build the binary with this patch, This can be due to the BOLT bug but I would like to know which BOLT version you used.-> solved
BOLT-INFO: Allocation combiner: 30 empty spaces coalesced (dyn count: 63791805).
#0 0x0000563eb3e8d705 PrintStackTraceSignalHandler(void*) Signals.cpp:0:0
#1 0x0000563eb3e8b2d4 SignalHandler(int) Signals.cpp:0:0
#2 0x00007fc228930520 (/lib/x86_64-linux-gnu/libc.so.6+0x42520)
#3 0x0000563eb4ebd106 llvm::bolt::BinaryFunction::translateInputToOutputAddress(unsigned long) const (/usr/local/bin/llvm-bolt+0x1c52106)
#4 0x0000563eb3f52658 llvm::bolt::DWARFRewriter::updateUnitDebugInfo(llvm::DWARFUnit&, llvm::bolt::DebugInfoBinaryPatcher&, llvm::bolt::DebugAbbrevWriter&, llvm::bolt::DebugLocWriter&, llvm::bolt::DebugRangesSectionWriter&, llvm::Optional<unsigned long>) (/usr/local/bin/llvm-bolt+0xce7658)
#5 0x0000563eb3f5688b llvm::bolt::DWARFRewriter::updateDebugInfo()::'lambda0'(unsigned long, llvm::DWARFUnit*)::operator()(unsigned long, llvm::DWARFUnit*) const DWARFRewriter.cpp:0:0
#6 0x0000563eb3f5c45a llvm::bolt::DWARFRewriter::updateDebugInfo() (/usr/local/bin/llvm-bolt+0xcf145a)
#7 0x0000563eb3f1aef8 llvm::bolt::RewriteInstance::updateMetadata() (/usr/local/bin/llvm-bolt+0xcafef8)
#8 0x0000563eb3f428e6 llvm::bolt::RewriteInstance::run() (/usr/local/bin/llvm-bolt+0xcd78e6)
#9 0x0000563eb355ccf8 main (/usr/local/bin/llvm-bolt+0x2f1cf8)
#10 0x00007fc228917d90 __libc_start_call_main ./csu/../sysdeps/nptl/libc_start_call_main.h:58:16
#11 0x00007fc228917e40 call_init ./csu/../csu/libc-start.c:128:20
#12 0x00007fc228917e40 __libc_start_main ./csu/../csu/libc-start.c:379:5
#13 0x0000563eb35dbd75 _start (/usr/local/bin/llvm-bolt+0x370d75)
PLEASE submit a bug report to https://github.com/llvm/llvm-project/issues/ and include the crash backtrace.
Stack dump:
0. Program arguments: /usr/local/bin/llvm-bolt python -o python.bolt -data=python.fdata -update-debug-sections -reorder-blocks=ext-tsp -reorder-functions=hfsort+ -split-functions=3 -icf=1 -inline-all -split-eh -reorder-functions-use-hot-size -peepholes=all -jump-tables=aggressive -inline-ap -indirect-call-promotion=all -dyno-stats -use-gnu-stack -frame-opt=hot
make: *** [Makefile:800: bolt-opt] Segmentation fault (core dumped
While profiling, I met the test failure, would you like to check that the optimized binary pass all std python test? (e.g python -m test), I met the related issue with the last attempts and it was solved by profiling through-> solvedpython -m test
./python.bolt_inst -m test --pgo --timeout=1200 || true
0:00:00 load avg: 2.17 Run tests sequentially (timeout: 20 min)
0:00:00 load avg: 2.17 [ 1/44] test_array
0:00:01 load avg: 2.17 [ 2/44] test_base64
0:00:02 load avg: 2.07 [ 3/44] test_binascii
0:00:02 load avg: 2.07 [ 4/44] test_binop
0:00:02 load avg: 2.07 [ 5/44] test_bisect
0:00:02 load avg: 2.07 [ 6/44] test_bytes
0:00:06 load avg: 2.07 [ 7/44] test_bz2
0:00:06 load avg: 2.07 [ 8/44] test_cmath
0:00:07 load avg: 2.07 [ 9/44] test_codecs
0:00:08 load avg: 1.99 [10/44] test_collections
0:00:09 load avg: 1.99 [11/44] test_complex
0:00:10 load avg: 1.99 [12/44] test_dataclasses
0:00:10 load avg: 1.99 [13/44] test_datetime
0:00:14 load avg: 1.83 [14/44] test_decimal
0:00:18 load avg: 1.76 [15/44] test_difflib
0:00:19 load avg: 1.76 [16/44] test_embed
0:00:21 load avg: 1.76 [17/44] test_float
0:00:22 load avg: 1.76 [18/44] test_fstring
0:00:23 load avg: 1.70 [19/44] test_functools
0:00:23 load avg: 1.70 [20/44] test_generators
0:00:24 load avg: 1.70 [21/44] test_hashlib
0:00:25 load avg: 1.70 [22/44] test_heapq
0:00:26 load avg: 1.70 [23/44] test_int
0:00:26 load avg: 1.70 [24/44] test_itertools
0:00:32 load avg: 1.64 [25/44] test_json
0:00:36 load avg: 1.59 [26/44] test_long
0:00:39 load avg: 1.54 [27/44] test_lzma
0:00:39 load avg: 1.54 [28/44] test_math
0:00:42 load avg: 1.50 [29/44] test_memoryview
0:00:43 load avg: 1.50 [30/44] test_operator
0:00:44 load avg: 1.50 [31/44] test_ordered_dict
0:00:46 load avg: 1.50 [32/44] test_patma
0:00:46 load avg: 1.50 [33/44] test_pickle
0:00:52 load avg: 1.46 [34/44] test_pprint
0:00:52 load avg: 1.42 [35/44] test_re
0:00:53 load avg: 1.42 [36/44] test_set
0:01:00 load avg: 1.39 [37/44] test_sqlite3
0:01:05 load avg: 1.36 [38/44] test_statistics
0:01:10 load avg: 1.33 [39/44] test_struct
0:01:11 load avg: 1.33 [40/44] test_tabnanny
0:01:12 load avg: 1.30 [41/44] test_time
0:01:15 load avg: 1.30 [42/44] test_unicode
test test_unicode failed
0:01:17 load avg: 1.28 [43/44] test_xml_etree -- test_unicode failed (1 failure)
0:01:19 load avg: 1.28 [44/44] test_xml_etree_c
Total duration: 1 min 21 sec
Tests result: FAILURE
I will share further investigation into this patch.
FYI, this is my environment.
- OS: Ubuntu 22.04 LTS
- BOLT revision e9b213131ae9c57f4f151d3206916676135b31b0
- gcc (Ubuntu 11.2.0-19ubuntu1) 11.2.0
Hmm, I will try to build BOLT from LLVM 14.0.6 |
I found why the BOLT was failed, I will downgrade the gcc version into 10.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for work! All pipeline works correctly.
Please update https://github.com/python/cpython/blob/main/Doc/using/configure.rst too.
(If possible https://github.com/python/cpython/blob/main/Doc/whatsnew/3.12.rst too, I will update the whats new if you are too busy)
But please emphasize that this feature is experimental optimization support.
I am going to measure the performance enhancement soon through the pyperformance and also for the l1 i-cache miss ratio.
Looks like https://github.com/pyston/python-macrobenchmarks does not support Python 3.1[1-2] yet right? Please let me know if I know wrong.
plus
https://github.com/python/cpython/blob/main/Misc/ACKS Add your name in this file too :)
A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated. Once you have made the requested changes, please leave a comment on this pull request containing the phrase |
@gvanrossum @kmod cc @markshannon Interesting result!
Benchmark hidden because not significant (3): pickle, scimark_sparse_mat_mult, unpack_sequence |
Another benchmark from Azure VM(Ubuntu 20.04.4 LTS gcc 9.4.0): But let's measure the benchmark from the Faster CPython machine after the PR is merged. |
Makefile.pre.in
Outdated
@@ -640,6 +640,15 @@ profile-opt: profile-run-stamp | |||
-rm -f profile-clean-stamp | |||
$(MAKE) @DEF_MAKE_RULE@ CFLAGS_NODIST="$(CFLAGS_NODIST) $(PGO_PROF_USE_FLAG)" LDFLAGS_NODIST="$(LDFLAGS_NODIST)" | |||
|
|||
bolt-opt: @PREBOLT_RULE@ | |||
rm -f *.fdata | |||
@LLVM_BOLT@ $(BUILDPYTHON) -instrument -instrumentation-file-append-pid -instrumentation-file=$(abspath $(BUILDPYTHON).bolt) -o $(BUILDPYTHON).bolt_inst |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@LLVM_BOLT@ $(BUILDPYTHON) -instrument -instrumentation-file-append-pid -instrumentation-file=$(abspath $(BUILDPYTHON).bolt) -o $(BUILDPYTHON).bolt_inst | |
@LLVM_BOLT@ ./$(BUILDPYTHON) -instrument -instrumentation-file-append-pid -instrumentation-file=$(abspath $(BUILDPYTHON).bolt) -o $(BUILDPYTHON).bolt_inst |
Makefile.pre.in
Outdated
@LLVM_BOLT@ $(BUILDPYTHON) -instrument -instrumentation-file-append-pid -instrumentation-file=$(abspath $(BUILDPYTHON).bolt) -o $(BUILDPYTHON).bolt_inst | ||
./$(BUILDPYTHON).bolt_inst $(PROFILE_TASK) || true | ||
@MERGE_FDATA@ $(BUILDPYTHON).*.fdata > $(BUILDPYTHON).fdata | ||
@LLVM_BOLT@ $(BUILDPYTHON) -o $(BUILDPYTHON).bolt -data=$(BUILDPYTHON).fdata -update-debug-sections -reorder-blocks=ext-tsp -reorder-functions=hfsort+ -split-functions=3 -icf=1 -inline-all -split-eh -reorder-functions-use-hot-size -peepholes=all -jump-tables=aggressive -inline-ap -indirect-call-promotion=all -dyno-stats -use-gnu-stack -frame-opt=hot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@LLVM_BOLT@ $(BUILDPYTHON) -o $(BUILDPYTHON).bolt -data=$(BUILDPYTHON).fdata -update-debug-sections -reorder-blocks=ext-tsp -reorder-functions=hfsort+ -split-functions=3 -icf=1 -inline-all -split-eh -reorder-functions-use-hot-size -peepholes=all -jump-tables=aggressive -inline-ap -indirect-call-promotion=all -dyno-stats -use-gnu-stack -frame-opt=hot | |
@LLVM_BOLT@ ./$(BUILDPYTHON) -o $(BUILDPYTHON).bolt -data=$(BUILDPYTHON).fdata -update-debug-sections -reorder-blocks=ext-tsp -reorder-functions=hfsort+ -split-functions=3 -icf=1 -inline-all -split-eh -reorder-functions-use-hot-size -peepholes=all -jump-tables=aggressive -inline-ap -indirect-call-promotion=all -dyno-stats -use-gnu-stack -frame-opt=hot |
I success to get cache miss-related metadata and also I got the pyperformance result which is similar to my previous attempts and Kevin's report. Environment
Binary Size
ICache miss
Benchmark (1.01x faster)https://gist.github.com/corona10/5726d1528176677d4c694265edfc4bf5 |
Co-authored-by: Dong-hee Na <donghee.na92@gmail.com>
Co-authored-by: Dong-hee Na <donghee.na92@gmail.com>
@aaupov I would like to recommend creating issue for your suggestions on https://github.com/faster-cpython/ideas or https://github.com/python/cpython/issues. I think that is faster-cpython repo is more proper :) |
Using the patch adapted from python/cpython#95908
Using the patch adapted from python/cpython#95908
Using the patch adapted from python/cpython#95908
Using the patch adapted from python/cpython#95908
Using the patch adapted from python/cpython#95908
Using the patch adapted from python/cpython#95908
Using the patch adapted from python/cpython#95908
Another question and important view of performance tuning. Gcc pgo and clang pgo are different , and gcc pgo profiler like profile-generate, can get more deeply data for pgo, instead of clang profile-generate. So, would be nice to make new flags with --enable-lto-gcc --enable-pgo-gcc,but considering at gcc level reorder flag needing for BOLT at clang
And one compilechain completely in clang Thank you very much |
Using the patch adapted from python/cpython#95908
Using the patch adapted from python/cpython#95908
Using the patch adapted from python/cpython#95908
Using the patch adapted from python/cpython#95908
Using bolt
provides a fairly large speedup without any code or functionality
changes. It provides roughly a 1% speedup on pyperformance, and a
4% improvement on the Pyston web macrobenchmarks.
It is gated behind an
--enable-bolt
configure arg because not alltoolchains and environments are supported. It has been tested on a
Linux x86_64 toolchain, using llvm-bolt built from the LLVM 14.0.6
sources (their binary distribution of this version did not include bolt).
Compared to a previous attempt,
this commit uses bolt's preferred "instrumentation" approach, as well as adds some non-PIE
flags which enable much better optimizations from bolt.
The effects of this change are a bit more dependent on CPU microarchitecture
than other changes, since it optimizes i-cache behavior which seems
to be a bit more variable between architectures. The 1%/4% numbers
were collected on an Intel Skylake CPU, and on an AMD Zen 3 CPU I
got a slightly larger speedup (2%/4%), and on a c6i.xlarge EC2 instance
I got a slightly lower speedup (1%/3%).
The low speedup on pyperformance is not entirely unexpected, because
BOLT improves i-cache behavior, and the benchmarks in the pyperformance
suite are small and tend to fit in i-cache.
This change uses the existing pgo profiling task (
python -m test --pgo
),though I was able to measure about a 1% macrobenchmark improvement by
using the macrobenchmarks as the training task. I personally think that
both the PGO and BOLT tasks should be updated to use macrobenchmarks,
but for the sake of splitting up the work this PR uses the existing pgo task.