Benchmarking harness for Julia #9456

eschnett · 2014-12-24T20:29:08Z

Julia as a language is not only about correctness, it is also about performance. Consequently, there should be a benchmarking harness in Base, equivalent to the testing harness Base.Test. This would allow several interesting things:

Performance regressions are noticed earlier
People can look at the benchmarks, and learn efficient coding styles
The benefit of certain changes (@inbounds, @fastmath) is immediately clear
Dangerous changes "for the sake of performance" (e.g. using @inbounds in Base) can be vetted; they would only be allowed if they actually show a performance benefit

Since performance varies by architecture, it is probably necessary to set up a few dedicated testing machines where the benchmarks can be run regularly. These machines would need to keep a history of benchmark results for comparison. I've seen http://speed.julialang.org/ which looks interesting -- I wonder if it could be set up to look at potentially thousands of small benchmark results.

I imagine that many Julia packages will in the future contain @bench statements in addition to @test statements.

The text was updated successfully, but these errors were encountered:

johnmyleswhite · 2014-12-24T21:25:03Z

+1 for this

vtjnash · 2014-12-24T22:34:33Z

fwiw, make perf runs the benchmarks on the homepage (http://julialang.org/) and is often used to vet improvements. @time and @elapsed already exist in base. i've seen other's use (perhaps from a package?) a macro that computes statistics for timings. there are very numerous julia-dev and julia-user posts that explain the pitfalls and methods of benchmarking.

without a concrete action, I'm going to recommend that this issue be close as being non-implementable / not being an issue. most functionality (like this) is probably best developed in a package anyways

IainNZ · 2014-12-24T23:47:38Z

http://pkg.julialang.org/?pkg=Benchmark ?

ViralBShah · 2014-12-25T05:50:32Z

I prefer this being in a package, like the one above, where it can be actively developed.

eschnett · 2014-12-25T18:08:25Z

Functionality such as @time and time_ns etc. can be used to implement a benchmark harness, similar to the way in which == and println can be used to implement a testing harness. However, there's quite a bit of distance between == and @test, where @test is so easy to use that writing tests is a breeze. I want the same for benchmarks.

I have a concrete suggestion that I was going to put either into a gist or into a package, but given that Benchmark already exists and is close to what I have in mind, I should maybe just try to implement it there. However, the end goal is not a package, but functionality in Base, so that it can be used by Base.

Here is an example of what I currently have in mind. This would benchmark integer addition.

@bench i->i+1

The output would look like:

Benchmark: i->i+1   time: 3.104e-08 s   variation: 1.473%

where "time" is the time to run each kernel, and "variation" is a measure of the timing inaccuracy.

I'm aware of the usual pitfalls of benchmarking, i.e. ensuring that the benchmark is not optimized away, not being drowned by looping overhead or timing noise, etc. I'm also aware of (some of?) the issues regarding defining such a variation -- I have an idea for how I would do it, but the community may decide differently.

sjkelly · 2014-12-25T22:36:24Z

How would this define CSV output? I like how Benchmarks.jl let me commit performance logs using dataframes: https://github.com/twadleigh/Meshes.jl/blob/38fa5fd48d083fc832c9c121fd70e419d9115330/perf/data/binary_stl_load_topology.csv.

vtjnash · 2014-12-25T23:20:24Z

those aren't really the list of issues i was concerned about when I said "the pitfalls of benchmarking". much of that has already been covered on the julia mailing lists, however, so i won't repeat here.

however, it's significant to note that while the example benchmark you gave above does provide a benchmark of many things (for example: function calls, compiler optimizations, language frameworks). the one thing it does not do is benchmark integer addition. there is way too much processor variability during execution of those statements to consider it a benchmark of "integer addition". i would instead argue that such a measure cannot actually exist in the context of a CPU. memory fetches, function overhead, pipelining, and context switches are just a few of the issues with the defining a measurement. and even if you have the best optimized code in the world, if you are using the wrong algorithm (or it doesn't pass a test suite), you lose (due to the tragedy of premature optimization).

tkelman · 2014-12-26T03:12:33Z

I feel benchmarks are only as valuable as the decisions/actions that people make based on them. We have a pretty extensive set of performance tests for Base already, but they don't get run or looked at very often lately, aside from the main numbers that are on the website. It would be incredibly valuable to bring back some continuous tracking like speed.julialang.org if it helps identify performance regressions or improvements more systematically. The new buildbot infrastructure has been working very well for the purposes of building binaries, but there might be too much variability due to the VM environment to get good performance data there.

mbauman · 2014-12-29T17:12:00Z

I think there's definitely room for some cleanup in the perf scripts/makefiles. Simply having a script that would take two git commit/tree-ishes, automatically build and run the code speed suite on each, and then output the differences, in order of most significant would be pretty straightforward and very useful. Heck, it could even plot them if a plotting package is installed. Seems like a good up-for-grabs/undergraduate project.

eschnett · 2015-01-06T15:55:54Z

@vtjnash Above, I was describing how the interface to a benchmarking harness should look like, not how it should be implemented. Of course one needs an optimizing compiler to avoid spurious overhead, needs to perform many operations to mitigate timing overhead, needs to ensure inlining to avoid function call overhead, etc. But that isn't the point here -- I chose "integer addition" only because that led to a small example. Replace it by "insert values into a dictionary", or "multiply two matrices", or "task creation overhead" if you like; those can definitely be benchmarked.

I've seen some positive responses here, as well as pretty concrete suggestions about making benchmark results more easily available, e.g. "Seems like a good up-for-grabs/undergraduate project." Could you re-open this issue so that this doesn't get forgotten?

vtjnash · 2015-04-29T01:26:40Z

i came across this paper on measurement bias, and thought the various individuals interested in benchmarking in this thread might find it a good read:
http://www.inf.usi.ch/faculty/hauswirth/publications/asplos09.pdf
also, I think the next time i hear someone claim that bounds-checking caused an unacceptable 10% slowdown in their program, I'll ask if they've remembered to optimize their linker order.

also, one of the other publications by the same group offers some evidence that we should be doing statistical profiling with a non-uniform timeout:
http://sape.inf.usi.ch/publications/pldi10
that might be a really good "up-for-grabs" project, to experiment with Julia's sampling profiler (in profile.c) and determine if it gives improved results with a non-uniform sampling interval

timholy · 2015-04-29T01:40:21Z

Nice set of links, @vtjnash!

ScottPJones · 2015-04-29T02:12:51Z

@vtjnash, yes, great link... I had to deal with that a lot on early Alpha chips (direct mapped cache instead of n-way associative cache... link order could make huge difference in performance... until we figured out what it was, we were going crazy trying to get consistent benchmark results...

vtjnash closed this as completed Dec 29, 2014

jakebolewski reopened this Jan 6, 2015

jakebolewski added the help wanted Indicates that a maintainer wants help on an issue or pull request label Jan 6, 2015

vtjnash removed the help wanted Indicates that a maintainer wants help on an issue or pull request label Jan 7, 2015

jakebolewski closed this as completed Jun 2, 2015

tkelman mentioned this issue Jun 14, 2015

Revive performance regression testing? #11709

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmarking harness for Julia #9456

Benchmarking harness for Julia #9456

eschnett commented Dec 24, 2014

johnmyleswhite commented Dec 24, 2014

vtjnash commented Dec 24, 2014

IainNZ commented Dec 24, 2014

ViralBShah commented Dec 25, 2014

eschnett commented Dec 25, 2014

sjkelly commented Dec 25, 2014

vtjnash commented Dec 25, 2014

tkelman commented Dec 26, 2014

mbauman commented Dec 29, 2014

eschnett commented Jan 6, 2015

vtjnash commented Apr 29, 2015

timholy commented Apr 29, 2015

ScottPJones commented Apr 29, 2015

Benchmarking harness for Julia #9456

Benchmarking harness for Julia #9456

Comments

eschnett commented Dec 24, 2014

johnmyleswhite commented Dec 24, 2014

vtjnash commented Dec 24, 2014

IainNZ commented Dec 24, 2014

ViralBShah commented Dec 25, 2014

eschnett commented Dec 25, 2014

sjkelly commented Dec 25, 2014

vtjnash commented Dec 25, 2014

tkelman commented Dec 26, 2014

mbauman commented Dec 29, 2014

eschnett commented Jan 6, 2015

vtjnash commented Apr 29, 2015

timholy commented Apr 29, 2015

ScottPJones commented Apr 29, 2015