Improve slow path performance for allocation #143

mjp41 · 2020-03-16T11:09:04Z

Make separate bump allocator ptr per thread and size class.
- Now a slab in the per-class list is guaranteed to have at least one free allocation, not including
  bump allocation.
- This enables more separation between how we are trying to find memory
Refactor code to
- Separate checks and slow paths to favour tail calls
- Generally make all slow paths tail calls to improve codegen

The general form of allocation is now

Local free list for this allocator
Grab next free list for this allocator
Bump allocate from local bump ptr
Grab new slab to bump allocate from

There are special cases interleaved into this

Handle message queue
If the thread local allocator needs initialising.

The first occurs between 1 and 2, and the second occurs between, 3 and 4.

Change remote to count down 0, so fast path does not need a constant. Use signed value so that branch does not depend on addition.

The fast path of remote_dealloc is sufficiently compact that it can be inlined.

Turn the internal structure into tail calls, to improve fast path. Should be no algorithmic changes.

Break lazy initialisation into two functions, so it is easier to codegen fast paths.

Make the backup path a bit faster. Only algorithmic change is to delay checking for first allocation. Otherwise, should be unchanged.

The fisrt operation a new thread takes is special. It results in allocating an allocator, and swinging it into the TLS. This makes this a very special path, that is rarely tested. This test generates a lot of threads to cover the first alloc and dealloc operations.

Large alloc stats aren't necessarily balanced on a thread, this changes to tracking individual pushs and pops, rather than the net effect (with an unsigned value).

Each allocator has a bump ptr for each size class. This is no longer slab local. Slabs that haven't been fully allocated no longer need to be in the DLL for this sizeclass.

This change reduces the branching in the case of finding a new free list. Using a non-empty cyclic list enables branch free add, and a single branch in remove to detect the empty case.

Use needs initialisation as makes more sense for other scenarios.

davidchisnall

Code looks fine, comments need some small cleanups.

src/mem/alloc.h

src/mem/threadalloc.h

src/mem/alloc.h

mjp41 · 2020-03-16T13:10:34Z

Any idea why the pipeline "Microsoft.snmalloc" is not firing?

src/ds/cdllist.h

This reverts commit a857b92.

mjp41 · 2020-03-16T14:19:47Z

This PR addresses #66

SchrodingerZhu · 2020-03-20T04:08:24Z

A segfault during linking?

mjp41 · 2020-03-20T08:12:32Z

@SchrodingerZhu this is in the self-host test. So we are LD_PRELOADing the snmalloc we just built, and it is crashing. This is one of the harder tests to debug, but also provides us a good piece of coverage of the code.

This change is pretty big, so I wanted to go above the standard CI level, hence why it has been hanging around for a week or so. I was going to run a lot of benchmarks through it, but haven't had the time yet. With the changes to CI I guess we have hit an issue independently.

The GlobalPlaceholder allocator is now a zero init block of memory. This removes various issues for when things are initialised. It is made read-only to we detect write to it on some platforms.

mjp41 added 26 commits March 16, 2020 11:06

Remote dealloc refactor.

1e454cc

Clang format

91b7e08

Clang format again.

ffb7b82

Improve remote dealloc

78f40d4

Change remote to count down 0, so fast path does not need a constant. Use signed value so that branch does not depend on addition.

CR feedback.

a22e438

Clang format.

0274b71

Inline remote_dealloc

d080654

The fast path of remote_dealloc is sufficiently compact that it can be inlined.

Improve fast path in Slab::alloc

bac6336

Turn the internal structure into tail calls, to improve fast path. Should be no algorithmic changes.

Refactor initialisation to help fast path.

a52aca6

Break lazy initialisation into two functions, so it is easier to codegen fast paths.

Fixup

bd8c443

Minor tidy to statically sized dealloc.

267a726

Refactor semi-slow path for alloc

52c0ff0

Make the backup path a bit faster. Only algorithmic change is to delay checking for first allocation. Otherwise, should be unchanged.

Correctly handle reusing get_noncachable

a1d139c

Fix large alloc stats

f9e0f64

Large alloc stats aren't necessarily balanced on a thread, this changes to tracking individual pushs and pops, rather than the net effect (with an unsigned value).

Fix TLS init on large alloc path

37d7e15

Fixup slab refactor

075874e

Minor refactor.

68b49df

Minor refactor

f8b77a8

Add Bump ptrs to allocator

d656232

Each allocator has a bump ptr for each size class. This is no longer slab local. Slabs that haven't been fully allocated no longer need to be in the DLL for this sizeclass.

Bug fix

54dcb20

Change to a cycle non-empty list

bd19484

This change reduces the branching in the case of finding a new free list. Using a non-empty cyclic list enables branch free add, and a single branch in remove to detect the empty case.

Comments.

ed69bbb

Update differences

06032f2

Rename first allocation

941e28a

Use needs initialisation as makes more sense for other scenarios.

Fixup for thread alloc.

8a8a2f6

mjp41 force-pushed the alloc_slow_optimise branch from f098f39 to 8a8a2f6 Compare March 16, 2020 11:11

davidchisnall approved these changes Mar 16, 2020

View reviewed changes

src/mem/alloc.h Outdated Show resolved Hide resolved

src/mem/alloc.h Outdated Show resolved Hide resolved

src/mem/threadalloc.h Outdated Show resolved Hide resolved

Clangformat + CR feedback

2215815

davidchisnall approved these changes Mar 16, 2020

View reviewed changes

src/mem/alloc.h Outdated Show resolved Hide resolved

src/mem/alloc.h Outdated Show resolved Hide resolved

src/mem/alloc.h Outdated Show resolved Hide resolved

More CR

a857b92

davidchisnall reviewed Mar 16, 2020

View reviewed changes

src/ds/cdllist.h Outdated Show resolved Hide resolved

src/ds/cdllist.h Outdated Show resolved Hide resolved

mjp41 added 5 commits March 16, 2020 13:13

Revert "More CR"

841314e

This reverts commit a857b92.

CR attempt two.

baff3ef

Fix assert

6bf7115

Bug fix found by CI.

193e27a

Clang tidy.

6d60feb

Merge branch 'master' into alloc_slow_optimise

04e74c4

mjp41 added 2 commits March 25, 2020 10:05

Use a ptrdiff to help with zero init.

0c40c84

Make GlobalPlaceholder zero init

65bb8c1

The GlobalPlaceholder allocator is now a zero init block of memory. This removes various issues for when things are initialised. It is made read-only to we detect write to it on some platforms.

mjp41 force-pushed the alloc_slow_optimise branch from f2da4ff to 65bb8c1 Compare March 25, 2020 11:38

mjp41 added 3 commits March 25, 2020 15:34

Comment.

50486c0

Merge remote-tracking branch 'origin/master' into alloc_slow_optimise

4fd24db

Clang format.

4b19611

mjp41 merged commit d900e29 into microsoft:master Mar 31, 2020

mjp41 deleted the alloc_slow_optimise branch March 31, 2020 08:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve slow path performance for allocation #143

Improve slow path performance for allocation #143

mjp41 commented Mar 16, 2020

davidchisnall left a comment

mjp41 commented Mar 16, 2020

mjp41 commented Mar 16, 2020

SchrodingerZhu commented Mar 20, 2020

mjp41 commented Mar 20, 2020

Improve slow path performance for allocation #143

Improve slow path performance for allocation #143

Conversation

mjp41 commented Mar 16, 2020

davidchisnall left a comment

Choose a reason for hiding this comment

mjp41 commented Mar 16, 2020

mjp41 commented Mar 16, 2020

SchrodingerZhu commented Mar 20, 2020

mjp41 commented Mar 20, 2020