CI for benchmarks online #10

lukego · 2016-10-09T14:40:15Z

This repo is cool! I am really happy to have a test suite. This seems great for people who want to maintain their own branches and keep track of how they compare with everybody else's. Like, have I broken something? Have my optimizations worked? Has somebody else made some optimizations that I should merge? etc. Just now I would like to maintain a branch called lowlevel to soak up things like intrinsics and DynASM Lua-mode so this is right on target for me.

I whipped up a Continuous Integration job to help. The CI downloads the latest code for some well-known branches, runs the benchmark suite 100 times for each branch, and reports the results. This updates automatically when any of the branches change (including the benchmark definitions).

The reason I run the benchmarks 100 times is to support tests that use randomness to exercise non-determinism in the JIT, like roulette (#9). Repeated tests mean that we can quantify how consistent the benchmark results are between runs, and once we have a metric for consistency then it is more straightforward to optimize (see LuaJIT/LuaJIT#218).

The branches I am testing now are master, v2.1, agentzh-v2.1, corsix/x64, and lukego/lowlevel. If anybody would like a branch added (or removed) just drop me a comment here. Currently the benchmark definitions are coming from my fork because I wanted to include roulette to check that variation is measured correctly.

Screenshot of the first graph (click to zoom):

and links:

Current results at the time of writing.
Permalink to the latest results.
CI Jobset page where all builds can be found (and also related files like the raw CSV.
Definition of the test runner written in Nix + shell.
Rmarkdown source for the report.

Hope somebody else finds this useful, too! Feedback & pull requests welcome. I plan to keep this operational.

The text was updated successfully, but these errors were encountered:

corsix · 2016-10-09T15:28:45Z

corsix/x64 was effectively merged into v2.1, so I don't expect to be making any more commits to it. corsix/newgc on the other hand...

lukego · 2016-10-09T17:51:24Z

@corsix Roger. I updated the config to test newgc instead of x64. The results will automatically go up on the permalink above.

lukego · 2016-10-09T19:40:24Z

Is it hopelessly naive to simply run the benchmarks by evaluating them with no arguments? https://github.com/lukego/LuaJIT-branch-tests/blob/5043523d6cb59d35e7ecf5ee51f2253ab75d8675/default.nix#L57. I suppose that I should at least save the output to check if they are really working. Some execute very quickly.

@corsix do you need any special build options for newgc?

MikePall · 2016-10-12T05:19:36Z

@lukego Maybe you missed those bench/PARAM* files that contain the N arguments to each benchmark? Scale as appropriate to give a run time of a couple seconds each. No point in running these more than a dozen times.

Consider verifying the checksum of the benchmark output against known good checksums for each N. E.g. generated with plain Lua or the C equivalents of the tests (you really need this for larger N).

Note that mandelbrot suffers from numerical instability and may give different results, depending on fused vs. unfused FP arithmetics on some platforms (JIT-compiled, i.e. fused is actually more accurate). And partialsums depends on the accuracy of a couple of math library functions, which isn't very good on some platforms.

lukego · 2016-10-12T07:30:01Z

@MikePall Aha! Thanks for pointing out bench/PARAM*. Just the thing.

For me it is important to run tests 100+ times and to seed them with entropy. While we have issues like LuaJIT/LuaJIT#218 to contend with I think that benchmark results need to be interpreted as probability distributions rather than scalar values.

(The non-determinism is perhaps more important to me than to others. In the Snabb context we absolutely cannot have a situation where you deploy 100 routers and expect 5 of them to have half the capacity of the others. People are currently using lousy workarounds like detecting system overload and calling jit.flush() to roll the dice on a new trace. I need to find a proper solution to this & the CI has to show me improvements and regressions in how dependable performance is in the presence of workload entropy.)

lukego · 2016-10-17T09:32:33Z

I have updated the CI to run from PARAM_x86_CI.txt from my branchmarks branch. This is closely based on PARAM_x86_CI.txt but I removed a couple that seemed to fail or hang.

The results permalink is the same. Hopefully the report is beginning to be meaningful. Now each benchmark takes between 0.1s and 10s which is hopefully a reasonable range for getting stable and meaningful results.

I have pulled the iteration count down to 12 from 100. The Relative Standard Deviation graph probably needs to be taken with a grain of salt. I will revisit this when time permits. (Just now I am running all the iterations in a bash loop which ties up a test server continuously. I should make each run into a separate Nix derivation so that the CI will schedule them intelligently e.g. parallelize across more servers and interleave with other CI tasks instead of blocking them.)

Notable difference by eyeball is that the report is no longer flagging corsix/newgc as slower on the binary-trees benchmark. Previously this benchmark was only running for around 0.001 seconds and so the difference may well have been due to some tiny constant factor.

SameeraDes · 2018-12-24T12:20:26Z

I am trying to run the benchmarks in continuous integration job for Aarch64 port which is in v2.1. Is there any central CI system to which the Aarch64 tests be added, or I need to setup completely new CI job for the same?

nico-abram · 2019-01-20T05:56:09Z

@lukego
https://hydra.snabb.co/build/3807227 errors with "Aborted: cannot connect to ‘root@murren-1.snabb.co’: ssh: connect to host murren-1.snabb.co port 22: Connection timed out (propagated from build 3807225) "
This (https://hydra.snabb.co/build/3803719) seems to be the most recent passing build

lukego · 2019-01-22T13:16:30Z

@nico-abram ah yes! The compute hosts running these LuaJIT benchmarks have recently been retired. I didn't think of this job because I haven't seen much activity here over the past few years and don't know how much interest there is.

If you want to run the benchmarks locally and generate the report you can use the instructions in the RaptorJIT README that I hope will work with standard LuaJIT too. I'm happy to advise if someone wants to troubleshoot a local setup or run a new CI.

If someone wants to sponsor running and updating a benchmark CI for LuaJIT then I'm also happy to help with that in my professional capacity at Snabb Solutions.

P.S. Here are some of the other ways that I put these tests to use while exploring the contribution of individual optimizations to overall performance:

Validating the HOTCOUNT table Validating the HOTCOUNT table raptorjit/raptorjit#56.
Validating LuaJIT optimizations Validating LuaJIT optimizations raptorjit/raptorjit#46
Validating LuaJIT micro-optimizations Validating LuaJIT micro-optimizations raptorjit/raptorjit#48

That last one turned up a potentially important micro-optimization:

md5 benchmark 15% speedup by removing "slow LEA" Validating LuaJIT micro-optimizations raptorjit/raptorjit#48

Surprisingly interesting to take simple benchmarks and use them to make systematic experiments!

lukego · 2019-01-22T13:33:00Z

@SameeraDes Good question. This CI is based on Nix and Nix seems to support ARM these days. So it should be possible to add an ARM server onto the backend but I don't know how much hassle to expect. The sticky-tape solution could also be for random machines to post results to Git repos in plain text and for this CI to download those are build/publish the reports.

I am meaning to migrate over to https://www.hercules-ci.com/ but haven't made time for that yet.

SameeraDes · 2019-02-04T08:46:45Z

Thanks for your response, @lukego
I have added CI based on Jenkins for ARM64 for now. It would be great if we can have central CI for all LuaJIT perf runs, I am willing to contribute for ARM64 port.

siddhesh · 2019-03-05T06:17:23Z

@lukego we have set up a CI loop for luajit on the Linaro CI to run tests on commits to v2.1 on arm64:

https://ci.linaro.org/job/luajit-aarch64-perf/

We'll be happy to add an x86_64 node to it if you have one, or add an x86_64 node ourselves.

As for other architectures, please feel free to ping me either on this issue or personally to have more nodes added to the trigger. At some point we also need to figure out a place to report the results.

lukego · 2019-03-05T09:23:32Z

@siddhesh Cool!

I am running a CI for RaptorJIT and related projects that sometimes covers LuaJIT too. I don't have spare machines to contribute to other CIs like yours though so please go ahead with your own.

lukego mentioned this issue Oct 9, 2016

Added bench/roulette.lua #9

Open

This was referenced Oct 11, 2016

Testing framework #3

Open

Add Lua 5.1/5.2 test suites #8

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CI for benchmarks online #10

CI for benchmarks online #10

lukego commented Oct 9, 2016

corsix commented Oct 9, 2016

lukego commented Oct 9, 2016

lukego commented Oct 9, 2016

MikePall commented Oct 12, 2016

lukego commented Oct 12, 2016

lukego commented Oct 17, 2016

SameeraDes commented Dec 24, 2018

nico-abram commented Jan 20, 2019

lukego commented Jan 22, 2019

lukego commented Jan 22, 2019

SameeraDes commented Feb 4, 2019

siddhesh commented Mar 5, 2019

lukego commented Mar 5, 2019

CI for benchmarks online #10

CI for benchmarks online #10

Comments

lukego commented Oct 9, 2016

corsix commented Oct 9, 2016

lukego commented Oct 9, 2016

lukego commented Oct 9, 2016

MikePall commented Oct 12, 2016

lukego commented Oct 12, 2016

lukego commented Oct 17, 2016

SameeraDes commented Dec 24, 2018

nico-abram commented Jan 20, 2019

lukego commented Jan 22, 2019

lukego commented Jan 22, 2019

SameeraDes commented Feb 4, 2019

siddhesh commented Mar 5, 2019

lukego commented Mar 5, 2019