Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve slow path performance for allocation #143

Merged
merged 39 commits into from
Mar 31, 2020

Conversation

mjp41
Copy link
Member

@mjp41 mjp41 commented Mar 16, 2020

  • Make separate bump allocator ptr per thread and size class.

    • Now a slab in the per-class list is guaranteed to have at least one free allocation, not including
      bump allocation.
    • This enables more separation between how we are trying to find memory
  • Refactor code to

    • Separate checks and slow paths to favour tail calls
    • Generally make all slow paths tail calls to improve codegen

The general form of allocation is now

  1. Local free list for this allocator
  2. Grab next free list for this allocator
  3. Bump allocate from local bump ptr
  4. Grab new slab to bump allocate from

There are special cases interleaved into this

  • Handle message queue
  • If the thread local allocator needs initialising.

The first occurs between 1 and 2, and the second occurs between, 3 and 4.

mjp41 added 26 commits March 16, 2020 11:06
Change remote to count down 0, so fast path does not need a constant.

Use signed value so that branch does not depend on addition.
The fast path of remote_dealloc is sufficiently compact that it can be
inlined.
Turn the internal structure into tail calls, to improve fast path.
Should be no algorithmic changes.
Break lazy initialisation into two functions, so it is easier to codegen
fast paths.
Make the backup path a bit faster.  Only algorithmic change is to delay
checking for first allocation. Otherwise, should be unchanged.
The fisrt operation a new thread takes is special.  It results in
allocating an allocator, and swinging it into the TLS.  This makes
this a very special path, that is rarely tested.  This test generates
a lot of threads to cover the first alloc and dealloc operations.
Large alloc stats aren't necessarily balanced on a thread, this changes
to tracking individual pushs and pops, rather than the net effect
(with an unsigned value).
Each allocator has a bump ptr for each size class.  This is no longer
slab local.

Slabs that haven't been fully allocated no longer need to be in the DLL
for this sizeclass.
This change reduces the branching in the case of finding a new free
list. Using a non-empty cyclic list enables branch free add, and a
single branch in remove to detect the empty case.
Use needs initialisation as makes more sense for other scenarios.
Copy link
Collaborator

@davidchisnall davidchisnall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks fine, comments need some small cleanups.

src/mem/alloc.h Outdated Show resolved Hide resolved
src/mem/alloc.h Outdated Show resolved Hide resolved
src/mem/threadalloc.h Outdated Show resolved Hide resolved
src/mem/alloc.h Outdated Show resolved Hide resolved
src/mem/alloc.h Outdated Show resolved Hide resolved
src/mem/alloc.h Outdated Show resolved Hide resolved
@mjp41
Copy link
Member Author

mjp41 commented Mar 16, 2020

Any idea why the pipeline "Microsoft.snmalloc" is not firing?

src/ds/cdllist.h Outdated Show resolved Hide resolved
src/ds/cdllist.h Outdated Show resolved Hide resolved
@mjp41
Copy link
Member Author

mjp41 commented Mar 16, 2020

This PR addresses #66

@SchrodingerZhu
Copy link
Contributor

A segfault during linking?

@mjp41
Copy link
Member Author

mjp41 commented Mar 20, 2020

@SchrodingerZhu this is in the self-host test. So we are LD_PRELOADing the snmalloc we just built, and it is crashing. This is one of the harder tests to debug, but also provides us a good piece of coverage of the code.

This change is pretty big, so I wanted to go above the standard CI level, hence why it has been hanging around for a week or so. I was going to run a lot of benchmarks through it, but haven't had the time yet. With the changes to CI I guess we have hit an issue independently.

The GlobalPlaceholder allocator is now a zero init block of memory.
This removes various issues for when things are initialised. It is made read-only
to we detect write to it on some platforms.
@mjp41 mjp41 merged commit d900e29 into microsoft:master Mar 31, 2020
@mjp41 mjp41 deleted the alloc_slow_optimise branch March 31, 2020 08:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants