Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Benchmarking and mimalloc #67

Open
mjp41 opened this issue Jul 9, 2019 · 8 comments
Open

Benchmarking and mimalloc #67

mjp41 opened this issue Jul 9, 2019 · 8 comments

Comments

@mjp41
Copy link
Member

mjp41 commented Jul 9, 2019

Based on mimalloc's design we have managed to dramatically improve the performance on snmalloc.

We should

  • Rerun our benchmarking infrastructure to compare
    • snmalloc submitted to ISMM
    • snmalloc current master
    • mimalloc
  • Fully document the ideas, we adopted from the mimalloc paper, and how they impacted performance.

@daanx, @davidchisnall, @plietar, @sylvanc

@mjp41
Copy link
Member Author

mjp41 commented Jul 9, 2019

Although the designs between mimalloc and snmalloc were very similar I think we had a different bias in why we had per slab free lists. In snmalloc it was primarily about fragmentation rather than performance. The mimalloc paper, demonstrated that this could lead to amazing performance.

Current things we did taken from mimalloc paper/design

  • Use a local free list per allocator per size class. Optimise the fast path for alloc and dealloc #62
    We move a per slab free list into the allocator state, such that is allocates from this first.
    Several bits were improved to make this fast path better,
    • no longer using 16bit slab relative addressing
    • putting the free list structure at the start of the allocator to improve code gen
    • lots of refactoring to make the code sequence small with studious inline/noinline hints.
  • bump allocate a whole 4KiB worth of allocation into a free list, so we hit the free list path more regularly, mimalloc does the whole 64KiB in one go, and the strategy come directly from that. Optimise the fast path for alloc and dealloc #62
  • Table lookup for sizeclass, although the code was very efficient for calculating the sizeclass, the codegen for a table lookup is much beter. Optimise the fast path for alloc and dealloc #62
  • TLS init - mimalloc has a great trick of having a default allocator that will always go to the slow path, on that slow path, we can check if we are using the default allocator, and actually initialise a new one if we are. We adopted that in Lazy tls #64.
  • Uses a queue of slabs with free lists. In snmalloc, we initially used a stack of slabs, this was aimed at trying to reduce fragmentation. However, for some benchmarks/applications this means we would leave the fast path very regularly. mimalloc aims to pick slabs which it hasn't looked at recently to grab larger free lists, and hence stay on the fast path. This change did increase snmalloc's memory usage a little, but not considerably. Use a queue of slabs for free lists #65

@SchrodingerZhu
Copy link
Contributor

I am thinking about creating a benchmark website and update the results periodically (like some performance analysis website for the web frameworks).
The mimalloc-bench toolsets may be a good point to start with, I can add some visualization tools for it and then create a docker that contains all allocators. Hopefully, I can also add the data visualization module direclt into a page generator so that I can generate the report easily.

@daanx
Copy link

daanx commented Apr 2, 2020

Hi @SchrodingerZhu , if you are planning to do that, I do think the
mimalloc-bench is a great starting point (and I can give you some javascript+latex that I use to generate nice graphs; perhaps a good initial project would be to take those scripts and extend it to automate the generation of the graphs?).

ps:
I don't want to be discouraging, but I would also say that I myself would be quite hesitant about such live allocator benchmark -- it may well be detrimental in the end to allocator performance in general. Let me expand: The problem with allocators is that they need to deal with a wide range of situations and applications; but most benchmarks as they are now are just testing particular aspects and the results must be interpreted with much care; you can't just say one allocator is better because it is faster on some benchmarks. For example, for mimalloc the reduced latency (worst-case allocation times) is very important but none of the benchmarks test that (now). Similarly, hoard has many hard theoretical bounds on edge cases which makes it sometimes a bit slower in general case but it is great if you happen to be in that edge case, or mesh does heap compacting for long running processes which is not reflected, snmalloc cares much about moving objects between threads etc. There is a real risk that putting out an "easy" set of benchmarks makes people over-optimize for the specific mini benchmarks to the detriment of "real" applications whose behavior is generally more difficult to benchmark. Again, for mimalloc for example some companies use it with huge-page memory for large services with 100's of gigs of memory and I saw some other allocators stumble on that, but it is very hard to benchmark these kind of situations more generally; just even running such services is very difficult.

(Now, of course, I am already guilty myself by putting mimalloc-bench out there, but I tend to see it more as a handy toolkit and starting point for allocator evaluation. That's why I try to put always a lot of discussion and warnings about interpreting the results around the graphs.)

Anyway, just my 2c.

@mjp41
Copy link
Member Author

mjp41 commented Apr 2, 2020

@SchrodingerZhu I think it would be great to have something like this. But I also share Daan's (@daanx) worry.

Personally, I use the micro-benchmarks for finding problems rather than justifying superiority. I have been using mimalloc-bench for the last two weeks in preparation for a release of snmalloc. I found four places where perf was significantly different to mimalloc. When I studied these I found issues in the code or benchmark. Leading to several of the recent PRs.

I would never want to do a parameter sweep to micro-optimise against this benchmark suite, but if you can justify the change independent of the benchmark, then I think it is great to let them find issues. You should also check why you are faster, not just the cases where you are slower.

Perhaps, to help address Daan's concerns places where allocator are targeted at should have specific benchmarks built for them. So perhaps once we have a live site doing this. Working out how to demonstrate latency in a new benchmark would be really interesting. My change #158 that improves xmalloc-testN may harm latency, but perhaps not enough to notice.

Also, at least allocators can be switched in real applications, so that can then address some of the larger scale benchmarking.

I think something like Tech Empower benchmarks has been hugely influential in making web servers faster.

@SchrodingerZhu
Copy link
Contributor

SchrodingerZhu commented Apr 5, 2020

work in progress at: https://github.com/SchrodingerZhu/bench_suite
Haven't found a good example for Agda (though checking IO takes several seconds).

Though a preference of general performance can be useful, I agree that it is unfair for some allocators. (for instance, hardened_malloc performs slower than others on my side; but performance is not its goal; In rust environment, there is also something similar: wee_alloc is developed to reduce the wasm target size, but the performance issue is obvious. After all, having a benchmark has some benefits, it can at least help users to check the performance and provides a reference for them to make their choices).

@SchrodingerZhu
Copy link
Contributor

current output can be found at: https://github.com/SchrodingerZhu/bench_suite/blob/master/output/index.md.
Notice that these results are generated on my laptop and hence can be inaccurate.

@mjp41 It seems that snmalloc is having some higher memory peak in some cases.

@mjp41
Copy link
Member Author

mjp41 commented Apr 7, 2020

Does the machine you run on have transparent hugh pages?

You can check with

cat /sys/kernel/mm/transparent_hugepage/enabled

If it is set to always, I have found snmalloc can have a very high memory usage as the way we request memory makes the system decide to use hugh pages, and then we may not use much of them.

You can disable it with:

echo advise | sudo tee /sys/kernel/mm/transparent_hugepage/enabled 

Not sure what the correct setting is to have.

@SchrodingerZhu
Copy link
Contributor

rust related simple benchmarks are added. snmalloc and mimalloc performs quite good.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants