Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Benchmarking harness for Julia #9456

Closed
eschnett opened this issue Dec 24, 2014 · 13 comments
Closed

Benchmarking harness for Julia #9456

eschnett opened this issue Dec 24, 2014 · 13 comments

Comments

@eschnett
Copy link
Contributor

Julia as a language is not only about correctness, it is also about performance. Consequently, there should be a benchmarking harness in Base, equivalent to the testing harness Base.Test. This would allow several interesting things:

  • Performance regressions are noticed earlier
  • People can look at the benchmarks, and learn efficient coding styles
  • The benefit of certain changes (@inbounds, @fastmath) is immediately clear
  • Dangerous changes "for the sake of performance" (e.g. using @inbounds in Base) can be vetted; they would only be allowed if they actually show a performance benefit

Since performance varies by architecture, it is probably necessary to set up a few dedicated testing machines where the benchmarks can be run regularly. These machines would need to keep a history of benchmark results for comparison. I've seen http://speed.julialang.org/ which looks interesting -- I wonder if it could be set up to look at potentially thousands of small benchmark results.

I imagine that many Julia packages will in the future contain @bench statements in addition to @test statements.

@johnmyleswhite
Copy link
Member

+1 for this

@vtjnash
Copy link
Member

vtjnash commented Dec 24, 2014

fwiw, make perf runs the benchmarks on the homepage (http://julialang.org/) and is often used to vet improvements. @time and @elapsed already exist in base. i've seen other's use (perhaps from a package?) a macro that computes statistics for timings. there are very numerous julia-dev and julia-user posts that explain the pitfalls and methods of benchmarking.

without a concrete action, I'm going to recommend that this issue be close as being non-implementable / not being an issue. most functionality (like this) is probably best developed in a package anyways

@IainNZ
Copy link
Member

IainNZ commented Dec 24, 2014

@ViralBShah
Copy link
Member

I prefer this being in a package, like the one above, where it can be actively developed.

@eschnett
Copy link
Contributor Author

Functionality such as @time and time_ns etc. can be used to implement a benchmark harness, similar to the way in which == and println can be used to implement a testing harness. However, there's quite a bit of distance between == and @test, where @test is so easy to use that writing tests is a breeze. I want the same for benchmarks.

I have a concrete suggestion that I was going to put either into a gist or into a package, but given that Benchmark already exists and is close to what I have in mind, I should maybe just try to implement it there. However, the end goal is not a package, but functionality in Base, so that it can be used by Base.

Here is an example of what I currently have in mind. This would benchmark integer addition.

@bench i->i+1

The output would look like:

Benchmark: i->i+1   time: 3.104e-08 s   variation: 1.473%

where "time" is the time to run each kernel, and "variation" is a measure of the timing inaccuracy.

I'm aware of the usual pitfalls of benchmarking, i.e. ensuring that the benchmark is not optimized away, not being drowned by looping overhead or timing noise, etc. I'm also aware of (some of?) the issues regarding defining such a variation -- I have an idea for how I would do it, but the community may decide differently.

@sjkelly
Copy link
Contributor

sjkelly commented Dec 25, 2014

How would this define CSV output? I like how Benchmarks.jl let me commit performance logs using dataframes: https://github.com/twadleigh/Meshes.jl/blob/38fa5fd48d083fc832c9c121fd70e419d9115330/perf/data/binary_stl_load_topology.csv.

@vtjnash
Copy link
Member

vtjnash commented Dec 25, 2014

those aren't really the list of issues i was concerned about when I said "the pitfalls of benchmarking". much of that has already been covered on the julia mailing lists, however, so i won't repeat here.

however, it's significant to note that while the example benchmark you gave above does provide a benchmark of many things (for example: function calls, compiler optimizations, language frameworks). the one thing it does not do is benchmark integer addition. there is way too much processor variability during execution of those statements to consider it a benchmark of "integer addition". i would instead argue that such a measure cannot actually exist in the context of a CPU. memory fetches, function overhead, pipelining, and context switches are just a few of the issues with the defining a measurement. and even if you have the best optimized code in the world, if you are using the wrong algorithm (or it doesn't pass a test suite), you lose (due to the tragedy of premature optimization).

@tkelman
Copy link
Contributor

tkelman commented Dec 26, 2014

I feel benchmarks are only as valuable as the decisions/actions that people make based on them. We have a pretty extensive set of performance tests for Base already, but they don't get run or looked at very often lately, aside from the main numbers that are on the website. It would be incredibly valuable to bring back some continuous tracking like speed.julialang.org if it helps identify performance regressions or improvements more systematically. The new buildbot infrastructure has been working very well for the purposes of building binaries, but there might be too much variability due to the VM environment to get good performance data there.

@vtjnash vtjnash closed this as completed Dec 29, 2014
@mbauman
Copy link
Member

mbauman commented Dec 29, 2014

I think there's definitely room for some cleanup in the perf scripts/makefiles. Simply having a script that would take two git commit/tree-ishes, automatically build and run the code speed suite on each, and then output the differences, in order of most significant would be pretty straightforward and very useful. Heck, it could even plot them if a plotting package is installed. Seems like a good up-for-grabs/undergraduate project.

@eschnett
Copy link
Contributor Author

eschnett commented Jan 6, 2015

@vtjnash Above, I was describing how the interface to a benchmarking harness should look like, not how it should be implemented. Of course one needs an optimizing compiler to avoid spurious overhead, needs to perform many operations to mitigate timing overhead, needs to ensure inlining to avoid function call overhead, etc. But that isn't the point here -- I chose "integer addition" only because that led to a small example. Replace it by "insert values into a dictionary", or "multiply two matrices", or "task creation overhead" if you like; those can definitely be benchmarked.

I've seen some positive responses here, as well as pretty concrete suggestions about making benchmark results more easily available, e.g. "Seems like a good up-for-grabs/undergraduate project." Could you re-open this issue so that this doesn't get forgotten?

@jakebolewski jakebolewski reopened this Jan 6, 2015
@jakebolewski jakebolewski added the help wanted Indicates that a maintainer wants help on an issue or pull request label Jan 6, 2015
@vtjnash vtjnash removed the help wanted Indicates that a maintainer wants help on an issue or pull request label Jan 7, 2015
@vtjnash
Copy link
Member

vtjnash commented Apr 29, 2015

i came across this paper on measurement bias, and thought the various individuals interested in benchmarking in this thread might find it a good read:
http://www.inf.usi.ch/faculty/hauswirth/publications/asplos09.pdf
also, I think the next time i hear someone claim that bounds-checking caused an unacceptable 10% slowdown in their program, I'll ask if they've remembered to optimize their linker order.

also, one of the other publications by the same group offers some evidence that we should be doing statistical profiling with a non-uniform timeout:
http://sape.inf.usi.ch/publications/pldi10
that might be a really good "up-for-grabs" project, to experiment with Julia's sampling profiler (in profile.c) and determine if it gives improved results with a non-uniform sampling interval

@timholy
Copy link
Member

timholy commented Apr 29, 2015

Nice set of links, @vtjnash!

@ScottPJones
Copy link
Contributor

@vtjnash, yes, great link... I had to deal with that a lot on early Alpha chips (direct mapped cache instead of n-way associative cache... link order could make huge difference in performance... until we figured out what it was, we were going crazy trying to get consistent benchmark results...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests