** NOTE: THIS PACKAGE IS DEPRECATED. PLEASE SWITCH OVER TO USING THE BenchmarkTools.jl PACKAGE **
A package to make Julia benchmarking both easy and rigorous. See the design section for details about what we mean by rigor.
For basic benchmarking, you can use the @benchmark
macro, which takes
in a Julia function call expression and returns the results of benchmarking that
expression:
using Benchmarks
@benchmark sin(2.0)
@benchmark sum(rand(1000))
@benchmark rand(1:100)
@benchmark (3.0+5im)^3.2
@benchmark svd(rand(10, 10))
The @benchmark
macro will repeatedly call the function many times in order to
rigorously estimate its actual execution time (see below for more details). It
is designed to test the performance of a single function as a complementary
tool alongside @code_warntype
and @code_llvm
. To this end, all arguments to
the benchmarked function are evaluated before benchmarking it; they are
considered setup. This means that, unlike the builtin @time
macro,
@benchmark sum(rand(1000))
will not include the time or allocations involved
in generating the 1000-length random vector. It also means that it will
repeatedly sum the same vector many times, so if there is additional variance
in the timings due to data dependence a single @benchmark
call will not
uncover it.
For more advanced usage (and for all automated usage), you need to use a
lower-level interface than the @benchmark
macro. Here we give a minimal
example of that lower-level interface:
import Benchmarks
Benchmarks.@benchmarkable(
sin_benchmark!, # benchmark name
nothing, # setup expression
sin(2.0), # benchmark "core"
nothing # teardown expression
)
r = Benchmarks.execute(sin_benchmark!)
In the design section, we'll describe the way that the lower-level interface works.
Benchmarking is hard. To do it well, you have to care about the details of how code is executed and the statistical challenges that make it hard to generalize correctly from benchmark data. To explain the rationale for design of the package, we discuss the problems that the package is designed to solve.
The first problem that the Benchmarks package tries to solve is the problem of
measurement error: benchmarking any expression that can be evaluated faster
than the system clock's resolution can track is vulnerable to measuring the
system clock's performance rather than the expression's performance. Naive
estimates, like those generated by the @time
macro are often totally
inaccurate. To convince yourself, consider the following timings:
@time rand(100);
@time sum(rand(100));
@time sum(rand(100));
On my system, the results of this code often suggest that sum(rand(100))
can
be evaluated faster than rand(100)
. That's obviously absurd -- and the
timings returned by @time
are unusably inaccurate for small functions like
these.
The reason for that inaccuracy is that summing 100 elements can occur much
faster than the system clock's finest resolution. As such, you're almost
exclusively measuring variability in the system clock when you evaluate @time sum(rand(100))
.
To deal with this, the Benchmarks package exploits a simple linear
relationship: evaluating an expression N times in a row should take
approximately N times as long as evaluating the same expression exactly 1
time. This linear relationship holds almost perfectly as N grows. Thus, we
solve the problem of measurement error by measuring the time it takes to
evaluate sum
or rand
a very large number of times. Then we apply linear
regression to estimate the time it takes to evaluate each function exactly once.
Since the @benchmark
macro considers its arguments as setup, it detangles the
two operations. You can immediately see that summing is an order of magnitude
faster than the array and random number creation. If you'd like to measure the
two operations together, it's quite simple to create and benchmark a temporary
function:
f() = sum(rand(1000))
@benchmark f()
When you repeatedly evaluate the same expression, you find that the timings you measure are not all the same. Instead, there is often substantial variability across measurements.
Traditional statistical methods are often employed to resolve this problem. The Benchmarks package tries to estimate the average time that it would take to evaluate an expression. Because the average is estimated from a small sample of measurements, we have to acknowledge that our estimate is uncertain. We do this by reporting 95% confidence intervals (hereafter called CI's) for the average time per evaluation.
Getting the best estimate of the average time requires gathering as many samples as possible. Most people want their benchmarks to run in a finite amount of time, so we need to impose some budget. The Benchmarks package exposes both a sample budget and a time budget. For very long computations, the time budget prevents the benchmarking process from taking more than 10s. This can be configured.
For mid-range computations, we also allow users to insist on acquiring at most 100 samples. This can be effective if the time budget seems excessive for the purpose of estimating CI's accurately.
For very fast computations that require OLS modeling, we ignore the samples budget, although we respect the user's specificed time budget. This is because users cannot reasonably expect to know how many samples they need to gather to estimate the average time accurately.
Sometimes we want to benchmark functions that depend upon external resources. For example, we want to time how long it takes us to parse a file, which we must first download.
To handle these scenarios, the @benchmarkable
macro takes in three
expressions:
setup
: This expression will be evaluated once before the benchmark starts executing.core
: This expression is the core expression we want to benchmark.teardown
: This expression will be evaluated once after the core expression stops executing.
Note that it's not possible to affect the core's function arguments through the setup or teardown expressions.
A guiding design principle for the Benchmarks package is that it should measure the function's performance as it would typically perform in real-world usage, and do so as consistently as possible. Repeatedly calling a function many thousands of times in a benchmarking loop, however, is not very typical. If we are not careful in the design, LLVM may be able to perform optimizations that interact with the benchmarking loop or remove it entirely.
In order to prevent unwanted optimizations, the @benchmark
macro protects the
function call within an inner @noinline
function. By preventing it from
inlining into the benchmarking loop, we force it to always make a function
call. This puts a bound on the minimum time resolution at one function call,
even if the benchmarked function does nothing and would otherwise take no time
at all:
julia> noop() = nothing
@benchmark noop()
================ Benchmark Results ========================
Time per evaluation: 3.04 ns [3.04 ns, 3.05 ns]
Proportion of time in GC: 0.00% [0.00%, 0.00%]
Memory allocated: 0.00 bytes
Number of allocations: 0 allocations
Number of samples: 11701
Number of evaluations: 139311001
R² of OLS model: 0.997
Time spent benchmarking: 0.53 s
Note that the noop()
function call still inlines as it normally would into
the inner function; the above just represents the time it takes to execute one
iteration of a loop with just one function call.
Another difficulty is that Julia can sometimes perform constant-propagation
from a function's arguments to simplify the algorithms in its interior. As a
simple example, let's look at 1+2
. The @code_llvm
macro shows that it adds
its arguments... but if we create a function where the arguments to +
are
constant, it will inline and remove the addition entirely!
julia> @code_llvm 1+2
define i64 @"julia_+_23345"(i64, i64) {
top:
%2 = add i64 %1, %0
ret i64 %2
}
julia> f() = 1+2
@code_llvm f()
define i64 @julia_f_23346() {
top:
ret i64 3
}
In general, we don't know whether a user wants to consider their arguments as
constants or not. In order to ensure that the behavior is always consistent, we
prevent this constant propagation through the function boundary. This typically
makes @benchmark
behave as though it is executing the code displayed by
@code_llvm
/@code_native
.