-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CI for benchmarks online #10
Comments
|
Is it hopelessly naive to simply run the benchmarks by evaluating them with no arguments? https://github.com/lukego/LuaJIT-branch-tests/blob/5043523d6cb59d35e7ecf5ee51f2253ab75d8675/default.nix#L57. I suppose that I should at least save the output to check if they are really working. Some execute very quickly. @corsix do you need any special build options for newgc? |
@lukego Maybe you missed those Consider verifying the checksum of the benchmark output against known good checksums for each N. E.g. generated with plain Lua or the C equivalents of the tests (you really need this for larger N). Note that |
@MikePall Aha! Thanks for pointing out For me it is important to run tests 100+ times and to seed them with entropy. While we have issues like LuaJIT/LuaJIT#218 to contend with I think that benchmark results need to be interpreted as probability distributions rather than scalar values. (The non-determinism is perhaps more important to me than to others. In the Snabb context we absolutely cannot have a situation where you deploy 100 routers and expect 5 of them to have half the capacity of the others. People are currently using lousy workarounds like detecting system overload and calling |
I have updated the CI to run from PARAM_x86_CI.txt from my The results permalink is the same. Hopefully the report is beginning to be meaningful. Now each benchmark takes between 0.1s and 10s which is hopefully a reasonable range for getting stable and meaningful results. I have pulled the iteration count down to 12 from 100. The Relative Standard Deviation graph probably needs to be taken with a grain of salt. I will revisit this when time permits. (Just now I am running all the iterations in a bash loop which ties up a test server continuously. I should make each run into a separate Nix derivation so that the CI will schedule them intelligently e.g. parallelize across more servers and interleave with other CI tasks instead of blocking them.) Notable difference by eyeball is that the report is no longer flagging |
I am trying to run the benchmarks in continuous integration job for Aarch64 port which is in v2.1. Is there any central CI system to which the Aarch64 tests be added, or I need to setup completely new CI job for the same? |
@lukego |
@nico-abram ah yes! The compute hosts running these LuaJIT benchmarks have recently been retired. I didn't think of this job because I haven't seen much activity here over the past few years and don't know how much interest there is. If you want to run the benchmarks locally and generate the report you can use the instructions in the RaptorJIT README that I hope will work with standard LuaJIT too. I'm happy to advise if someone wants to troubleshoot a local setup or run a new CI. If someone wants to sponsor running and updating a benchmark CI for LuaJIT then I'm also happy to help with that in my professional capacity at Snabb Solutions. P.S. Here are some of the other ways that I put these tests to use while exploring the contribution of individual optimizations to overall performance:
That last one turned up a potentially important micro-optimization:
Surprisingly interesting to take simple benchmarks and use them to make systematic experiments! |
@SameeraDes Good question. This CI is based on Nix and Nix seems to support ARM these days. So it should be possible to add an ARM server onto the backend but I don't know how much hassle to expect. The sticky-tape solution could also be for random machines to post results to Git repos in plain text and for this CI to download those are build/publish the reports. I am meaning to migrate over to https://www.hercules-ci.com/ but haven't made time for that yet. |
Thanks for your response, @lukego |
@lukego we have set up a CI loop for luajit on the Linaro CI to run tests on commits to v2.1 on arm64: https://ci.linaro.org/job/luajit-aarch64-perf/ We'll be happy to add an x86_64 node to it if you have one, or add an x86_64 node ourselves. As for other architectures, please feel free to ping me either on this issue or personally to have more nodes added to the trigger. At some point we also need to figure out a place to report the results. |
@siddhesh Cool! I am running a CI for RaptorJIT and related projects that sometimes covers LuaJIT too. I don't have spare machines to contribute to other CIs like yours though so please go ahead with your own. |
This repo is cool! I am really happy to have a test suite. This seems great for people who want to maintain their own branches and keep track of how they compare with everybody else's. Like, have I broken something? Have my optimizations worked? Has somebody else made some optimizations that I should merge? etc. Just now I would like to maintain a branch called
lowlevel
to soak up things like intrinsics and DynASM Lua-mode so this is right on target for me.I whipped up a Continuous Integration job to help. The CI downloads the latest code for some well-known branches, runs the benchmark suite 100 times for each branch, and reports the results. This updates automatically when any of the branches change (including the benchmark definitions).
The reason I run the benchmarks 100 times is to support tests that use randomness to exercise non-determinism in the JIT, like
roulette
(#9). Repeated tests mean that we can quantify how consistent the benchmark results are between runs, and once we have a metric for consistency then it is more straightforward to optimize (see LuaJIT/LuaJIT#218).The branches I am testing now are
master
,v2.1
,agentzh-v2.1
,corsix/x64
, andlukego/lowlevel
. If anybody would like a branch added (or removed) just drop me a comment here. Currently the benchmark definitions are coming from my fork because I wanted to includeroulette
to check that variation is measured correctly.Screenshot of the first graph (click to zoom):
and links:
Hope somebody else finds this useful, too! Feedback & pull requests welcome. I plan to keep this operational.
The text was updated successfully, but these errors were encountered: