C++ Benchmarking Tutorial using Google Benchmark
This repository is a practical example of common pitfalls in benchmarking high-performance applications. It's extensively-commented source is also available in a more digestible article form.
Clone the repository and execute the following commands to build and run the tutorial:
cmake -B ./build_release
cmake --build ./build_release --config Release
./build_release/tutorial
# For JSON output
./build_release/tutorial --benchmark_format=json
# For output to a file
./build_release/tutorial --benchmark_out=results.json
# To match a specific benchmark
./build_release/tutorial --benchmark_filter=i32_addition
While primarily designed for GNU C Compiler, this tutorial is also compatible with Clang.
Note that certain features may not work with LLVM, MSVC, ICC, NVCC, and other compilers.
It includes practical demonstrations of Parallel STL in GCC, focusing on different std::execution
policies in the std::sort
algorithm.
For advanced parallel algorithm benchmarks, see ashvardanian/ParallelReductionsBenchmark.
There are more articles on benchmarking in the "Less Slow" blog:
- Optimizing C++ & CUDA for High-Speed Parallel Reductions
- Challenges in Maximizing DDR4 Bandwidth
- Comparing GCC Compiler and Manual Assembly Performance
- Enhancing SciPy Performance with AVX-512 & SVE.
To enhance stability and reproducibility, use the --benchmark_enable_random_interleaving=true
flag which shuffles and interleaves benchmarks as described here.
./build_release/tutorial --benchmark_enable_random_interleaving=true
Utilize Google Benchmark's compare.py
tool for CLI-based comparison of benchmarking results from different JSON files.
The repository contains screenshots of the comparison of the following benchmarks:
- AMD ThreadRipper PRO 3995WX against Dual AMD EPYC 7302 16-Core CPUs: screenshot
- AMD ThreadRipper PRO 3995WX with
-O3
vs-O1
optimization levels: screenshot
Google Benchmark supports User-Requested Performance Counters through libpmf
.
Note that collecting these may require sudo
privileges.
sudo ./build_release/tutorial --benchmark_enable_random_interleaving=true --benchmark_format=json --benchmark_perf_counters="CYCLES,INSTRUCTIONS"
Alternatively, use the Linux perf
tool for performance counter collection:
sudo perf stat taskset 0xEFFFEFFFEFFFEFFFEFFFEFFFEFFFEFFF ./build_release/tutorial --benchmark_enable_random_interleaving=true --benchmark_filter=super_sort
Example output on AMD ThreadRipper PRO 3995WX:
Performance counter stats for 'taskset 0xEFFFEFFFEFFFEFFFEFFFEFFFEFFFEFFF ./build_release/tutorial --benchmark_enable_random_interleaving=true --benchmark_filter=super_sort':
23048674.55 msec task-clock # 35.901 CPUs utilized
6627669 context-switches # 0.288 K/sec
75843 cpu-migrations # 0.003 K/sec
119085703 page-faults # 0.005 M/sec
91429892293048 cycles # 3.967 GHz (83.33%)
13895432483288 stalled-cycles-frontend # 15.20% frontend cycles idle (83.33%)
3277370121317 stalled-cycles-backend # 3.58% backend cycles idle (83.33%)
16689799241313 instructions # 0.18 insn per cycle
# 0.83 stalled cycles per insn (83.33%)
3413731599819 branches # 148.110 M/sec (83.33%)
11861890556 branch-misses # 0.35% of all branches (83.34%)
642.008618457 seconds time elapsed
21779.611381000 seconds user
1244.984080000 seconds sys