Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

memset consumes a large amount of time during startup for applications that use mmap #21620

Open
kg opened this issue Mar 26, 2024 · 25 comments
Open

Comments

@kg
Copy link

kg commented Mar 26, 2024

During application startup, operations like mmap will perform a memset to properly zero pages before returning them to the calling application:

memset(ptr, 0, alloc_len);

If bulk memory is enabled, this is fine since that should (if the JS/wasm runtime is doing the right thing) just invoke a native implementation of the bulk fill operation. EDITED: While bulk memory memset is faster, memset is still a massive bottleneck in this scenario.

If bulk memory isn't enabled, what appears to happen is that since we're starting up, it's quite likely that memset and memcpy operations will run in an interpreter instead of in fully jitted WASM code, like in this profile: (note: the elapsed time looks worse than it is, because I profiled multiple app starts in a loop)
image
Presumably if other parts of your startup take long enough, tiered compilation in your browser of choice will have completed by the point you start calling mmap, and this won't happen. For us, it consistently all happens before that point.

I know in some cases wasm will just always run in an interpreter - i.e. lockdown modes, or iOS when its jitcode memory block is full. So in those cases also this could be pretty impactful during the entirety of an app's runtime, but I would expect bulk memory to fix it there too.

sbc100 mentioned that bulk memory should be default soon, which might make this issue no longer relevant. Just figured I'd bring it up in case it seemed worthwhile to make a 1-2 line change to the emscripten libc to i.e. always use emscripten_memset_js (which doesn't exist right now, I guess) in operations like mmap where it could matter.

I'll also note that during startup lots of this memory is already pre-zeroed, since it came from sbrk at the bottom of the stack, and it looks like in some cases it also comes from mi_heap_malloc_zero under the covers. So in those scenarios there's no point in doing the memset at all - but flowing that information all the way up the call stack into mmap isn't an easy ask, so I'm not surprised that it's not happening.

In general (in part due to the fact that memset is running in the interpreter instead of native code) memset and memcpy are a surprisingly large % time slice of our application startup :)

@sbc100
Copy link
Collaborator

sbc100 commented Mar 26, 2024

First of all I will just note that for mmap of file we already end up using the mmapAlloc JS function which calls zeroMemory which in turn calls HEAP8.fill, so I guess we are only talking about MAP_ANONYMOUS mappings here?

Its true that bulkmemory should be enabled by default soon, but you can also enable it today. Is it possible to enable it for all your users, or do you not have control over the flags they use?

Having said that I don't see why we wouldn't have a emscripten_memset_js just like we have a emscripten_memcpy_js. Would you be able to send a PR at add that?

Finally, if you are setting as lot calls to mmap(MAP_ANONYMOUS), it might be worth replacing them with a simple malloc instead since in emscripten MAP_ANONYMOUS is simply fake, and is strictly worse that just calling malloc + memset. Does you codebase have a fallback for when MAP_ANONYMOUS is not available?

@kg
Copy link
Author

kg commented Mar 26, 2024

First of all I will just note that for mmap of file we already end up using the mmapAlloc JS function which calls zeroMemory which in turn calls HEAP8.fill, so I guess we are only talking about MAP_ANONYMOUS mappings here?

Sounds like it's just MAP_ANONYMOUS, yeah.

Its true that bulkmemory should be enabled by default soon, but you can also enable it today. Is it possible to enable it for all your users, or do you not have control over the flags they use?

I am hoping we'll be able to turn it on for everyone, but we've had issues turning on uncontroversial (to me) flags so far, so I don't know when we can make it the default. We still support disabling both WASM SIMD and WASM EH for users who have old Android or iOS devices, so bulk memory potentially adds a third configuration flag once we turn it on by default. When a developer flips one or both of those off, we re-link our whole runtime using emscripten on the developer's machine to apply the appropriate flags and link in the appropriate bits.

Having said that I don't see why we wouldn't have a emscripten_memset_js just like we have a emscripten_memcpy_js. Would you be able to send a PR at add that?

Yeah, I can put a PR together if it's wanted. I'll try to remember to get to that this week.

Finally, if you are setting as lot calls to mmap(MAP_ANONYMOUS), it might be worth replacing them with a simple malloc instead since in emscripten MAP_ANONYMOUS is simply fake, and is strictly worse that just calling malloc + memset. Does you codebase have a fallback for when MAP_ANONYMOUS is not available?

Thanks for the suggestion, it hadn't occurred to me that malloc could be meaningfully better here. Thankfully we don't have bare mmap calls in our codebase - they're wrapped, and the wrapper is wasm-specific - so it's possible I can turn all our mmap/munmap calls into malloc and free. I expect it will pay off, since my profiles don't show memset as a hotspot under malloc calls (malloc itself is something of a hotspot, but not as bad as memset.)

@sbc100
Copy link
Collaborator

sbc100 commented Mar 26, 2024

Thanks for the suggestion, it hadn't occurred to me that malloc could be meaningfully better here. Thankfully we don't have bare mmap calls in our codebase - they're wrapped, and the wrapper is wasm-specific - so it's possible I can turn all our mmap/munmap calls into malloc and free. I expect it will pay off, since my profiles don't show memset as a hotspot under malloc calls (malloc itself is something of a hotspot, but not as bad as memset.)

So in your case the caller of mmap was not relying on the resulting pages actually being zero, and when you replace with malloc you don't need to also add memset?

@kg
Copy link
Author

kg commented Mar 26, 2024

Thanks for the suggestion, it hadn't occurred to me that malloc could be meaningfully better here. Thankfully we don't have bare mmap calls in our codebase - they're wrapped, and the wrapper is wasm-specific - so it's possible I can turn all our mmap/munmap calls into malloc and free. I expect it will pay off, since my profiles don't show memset as a hotspot under malloc calls (malloc itself is something of a hotspot, but not as bad as memset.)

So in your case the caller of mmap was not relying on the resulting pages actually being zero, and when you replace with malloc you don't need to also add memset?

In these cases, the caller is relying on them being zero, but they're new "pages" that were allocated by sbrk/pre-allocated with initial heap size, and are already zero.

@sbc100
Copy link
Collaborator

sbc100 commented Mar 26, 2024

Thanks for the suggestion, it hadn't occurred to me that malloc could be meaningfully better here. Thankfully we don't have bare mmap calls in our codebase - they're wrapped, and the wrapper is wasm-specific - so it's possible I can turn all our mmap/munmap calls into malloc and free. I expect it will pay off, since my profiles don't show memset as a hotspot under malloc calls (malloc itself is something of a hotspot, but not as bad as memset.)

So in your case the caller of mmap was not relying on the resulting pages actually being zero, and when you replace with malloc you don't need to also add memset?

In these cases, the caller is relying on them being zero, but they're new "pages" that were allocated by sbrk/pre-allocated with initial heap size, and are already zero.

In that case it sounds pretty risky to skip the memset after the malloc, since malloc makes no guarantees.

I wonder if we could improve our fake mmap(MAP_ANONYMOUS) such that it could know through some internal mechanism that it can sometimes skip the memset? Maybe unnecessary once we have emscripten_memset_js.

@sbc100
Copy link
Collaborator

sbc100 commented Mar 26, 2024

BTW I am adding an underscore prefix to the internal memset/memcpy helpers: #21622

@kg
Copy link
Author

kg commented Apr 12, 2024

Incidentally, I did some local measurements of memset specifically and it looks like -Oz -mbulk-memory in latest emcc is about 33x faster than regular -Oz.

@kg
Copy link
Author

kg commented Apr 30, 2024

Thanks for the suggestion, it hadn't occurred to me that malloc could be meaningfully better here. Thankfully we don't have bare mmap calls in our codebase - they're wrapped, and the wrapper is wasm-specific - so it's possible I can turn all our mmap/munmap calls into malloc and free. I expect it will pay off, since my profiles don't show memset as a hotspot under malloc calls (malloc itself is something of a hotspot, but not as bad as memset.)

So in your case the caller of mmap was not relying on the resulting pages actually being zero, and when you replace with malloc you don't need to also add memset?

In these cases, the caller is relying on them being zero, but they're new "pages" that were allocated by sbrk/pre-allocated with initial heap size, and are already zero.

In that case it sounds pretty risky to skip the memset after the malloc, since malloc makes no guarantees.

I wonder if we could improve our fake mmap(MAP_ANONYMOUS) such that it could know through some internal mechanism that it can sometimes skip the memset? Maybe unnecessary once we have emscripten_memset_js.

Just to provide an update, I was able to build our stack using 3.1.56 with -mbulk-memory and memset performance is definitely improved. However, mmap's memset is still the top application startup hotspot - bulk memory moves it from 31% of wasm CPU samples to 28%, and reduces the number of wasm CPU samples during my startup benchmark by 15206. I think bulk memory is more than satisfactory for making these memory operations as fast as we can, and it would probably pay off for many applications to find a way to either make the fake mmap able to bypass memset of new pages, or provide a good alternative to fake mmap. I'm not sure what the latter would look like, but I'm open to doing surgery on our own code instead of doing surgery on emscripten - whatever is going to deliver improvements for the wider community.

When I looked through the relevant code in emscripten it seemed theoretically feasible to flow 'these are pre-zeroed pages' up through the stack to mmap, but it didn't seem trivial to do without making some really ugly changes to musl.

image

@kg
Copy link
Author

kg commented May 1, 2024

I wrote a simple prototype page allocator to test this out, and almost all of the memset time during startup went away, and wall-clock time/sample counts went down too. Doing this in user space doesn't feel straightforward though, because it's not possible to request whole aligned pages from sbrk - and you would need to replace malloc/free with your own allocator in order to make it use your custom page allocator instead of emscripten libc's.

So I feel like emscripten libc might be the place to do this, but it would require some extensive surgery on musl unless I'm missing something. Will start trying to figure out what that diff looks like.

@kg kg changed the title It could help to always use JS for memsets during startup / that are always large memset consumes a large amount of time during startup for applications that use mmap May 1, 2024
@sbc100
Copy link
Collaborator

sbc100 commented May 1, 2024

If you know that memory doesn't need to be released and recycled then I think you can use sbrk() in parallel with malloc's use of sbrk(). You may need to over-allocate by PAGE_SIZE-1 but if the allocations are large that should be OK. Then you know you have zero'd pages.

BTW, does the higher level system (i.e. the caller of mono_valloc) depend on zero'd pages? Is mini using valloc_aligned to then run its own allocator on top? If so then maybe its already doing zeroing of the smaller chunks its allocating within these pages?

@kg
Copy link
Author

kg commented May 1, 2024

If you know that memory doesn't need to be released and recycled then I think you can use sbrk() in parallel with malloc's use of sbrk(). You may need to over-allocate by PAGE_SIZE-1 but if the allocations are large that should be OK. Then you know you have zero'd pages.

BTW, does the higher level system (i.e. the caller of mono_valloc) depend on zero'd pages? Is mini using valloc_aligned to then run its own allocator on top? If so then maybe its already doing zeroing of the smaller chunks its allocating within these pages?

The consumers sadly do want zeroed pages. My prototype just tracks whether pages came from sbrk (and are already zeroed), in which case it knows they are 'zeroed free pages' instead of 'free pages' and skips performing memset() on the caller's behalf. Obviously if a caller doesn't want zeroed pages you can skip memset then too, but I haven't even bothered to wire that up yet, since I just wanted to measure the impact of getting rid of the memset in the mmap path.

One specific example here is that during startup our GC has to allocate memory for its card table. That's a big block of RAM that has to be zeroed, and can be serviced safely by sbrk without memset.

@sbc100
Copy link
Collaborator

sbc100 commented May 1, 2024

If you know that memory doesn't need to be released and recycled then I think you can use sbrk() in parallel with malloc's use of sbrk(). You may need to over-allocate by PAGE_SIZE-1 but if the allocations are large that should be OK. Then you know you have zero'd pages.
BTW, does the higher level system (i.e. the caller of mono_valloc) depend on zero'd pages? Is mini using valloc_aligned to then run its own allocator on top? If so then maybe its already doing zeroing of the smaller chunks its allocating within these pages?

The consumers sadly do want zeroed pages. My prototype just tracks whether pages came from sbrk (and are already zeroed), in which case it knows they are 'zeroed free pages' instead of 'free pages' and skips performing memset() on the caller's behalf. Obviously if a caller doesn't want zeroed pages you can skip memset then too, but I haven't even bothered to wire that up yet, since I just wanted to measure the impact of getting rid of the memset in the mmap path.

One specific example here is that during startup our GC has to allocate memory for its card table. That's a big block of RAM that has to be zeroed, and can be serviced safely by sbrk without memset.

I wonder if you go do even better and just make that region static?

@kg
Copy link
Author

kg commented May 1, 2024

If you know that memory doesn't need to be released and recycled then I think you can use sbrk() in parallel with malloc's use of sbrk(). You may need to over-allocate by PAGE_SIZE-1 but if the allocations are large that should be OK. Then you know you have zero'd pages.
BTW, does the higher level system (i.e. the caller of mono_valloc) depend on zero'd pages? Is mini using valloc_aligned to then run its own allocator on top? If so then maybe its already doing zeroing of the smaller chunks its allocating within these pages?

The consumers sadly do want zeroed pages. My prototype just tracks whether pages came from sbrk (and are already zeroed), in which case it knows they are 'zeroed free pages' instead of 'free pages' and skips performing memset() on the caller's behalf. Obviously if a caller doesn't want zeroed pages you can skip memset then too, but I haven't even bothered to wire that up yet, since I just wanted to measure the impact of getting rid of the memset in the mmap path.
One specific example here is that during startup our GC has to allocate memory for its card table. That's a big block of RAM that has to be zeroed, and can be serviced safely by sbrk without memset.

I wonder if you go do even better and just make that region static?

For the specific scenario of optimizing our startup we can do things like pre-allocate fixed regions in our address space by setting up the wasm binary just so, etc. But the card table example feels somewhat relevant since if your heap grows, you end up potentially needing to allocate more card tables. Or maybe you have per-thread arenas, so you need to allocate one per thread. (We do also allocate our per-thread stacks via wrapped mmap right now, so that's a spot where our startup gets dinged if we have multithreading active - and those don't actually need to be zeroed to begin with.)

I think there are a lot of cases where userspace can hand-optimize this stuff, and maybe that makes it not worthwhile to do anything here. But there might be value in some sort of 'give me N (zeroed|i don't care) 64KB pages, please' primitive that can be used by application code, or an optimization for the existing mmap to skip the memset. We do make use of munmap, so I had to implement a page table and stuff for the proof of concept to work. My instinct would be that anonymous mmap is probably out there in the wild and people unmap those allocations, but maybe most of those cases are served fine by malloc/free, which don't zero? The "you can safely unmap any address without crashing, and you can unmap a subset of your mapping" parts of the mmap/munmap promise are hard to provide with malloc/free, though :(

@kg
Copy link
Author

kg commented May 1, 2024

Looking at dlmalloc's implementation of calloc, it might be able to skip memset in this scenario, so we could switch mmap to use calloc with an element size of 1 and then try to make sure the optimization works:

image

@sbc100
Copy link
Collaborator

sbc100 commented May 1, 2024

Looking at dlmalloc's implementation of calloc, it might be able to skip memset in this scenario, so we could switch mmap to use calloc with an element size of 1 and then try to make sure the optimization works:

Sounds great!

@kg
Copy link
Author

kg commented May 1, 2024

Looking at dlmalloc's implementation of calloc, it might be able to skip memset in this scenario, so we could switch mmap to use calloc with an element size of 1 and then try to make sure the optimization works:

Sounds great!

It looks like because emscripten configures dlmalloc to use morecore instead of mmap, calloc_must_clear returns a constant 1. If we customize dlmalloc to know that sbrk returns zeroed memory, it might be possible to make the optimization turn on.

@kg
Copy link
Author

kg commented May 2, 2024

mimalloc also does not have this optimization enabled. The necessary code for it is there, but page->is_zero_init is never set to TRUE anywhere in the emscripten version of mimalloc (is this bug in the original allocator just the emscripten version? i'm not sure), so it doesn't engage.

@sbc100
Copy link
Collaborator

sbc100 commented May 2, 2024

Well I suppose step 1 is to start using calloc and step 2 would then be to try to improve the calloc of the various allocators where possible.

@kg
Copy link
Author

kg commented May 2, 2024

It looks like emscripten's implementation of the mimalloc primitives is not setting the is_zero flag, because it services page allocations via emmalloc:
image

@sbc100
Copy link
Collaborator

sbc100 commented May 2, 2024

It looks like emscripten's implementation of the mimalloc primitives is not setting the is_zero flag, because it services page allocations via emmalloc: image

I imagine @kripken could suggest a way to pipe that information through.. its fine for mimalloc to use and internal/special version emmalloc API I think.

@kripken
Copy link
Member

kripken commented May 6, 2024

Yes, I think an internal API would be fine here. Something like emmalloc_memalign but that also informs the caller if the output happens to be zeroed already "for free". And then if so we can set is_zero in those cases.

@kg
Copy link
Author

kg commented May 6, 2024

The option I've been exploring on my end is a real mmap implementation, which would allow turning on dlmalloc and mimalloc's mmap support and automatically getting the zeroing optimizations. Would you be interested in that, or would you prefer to keep the current simpler imitation mmap?

@sbc100
Copy link
Collaborator

sbc100 commented May 7, 2024

The option I've been exploring on my end is a real mmap implementation, which would allow turning on dlmalloc and mimalloc's mmap support and automatically getting the zeroing optimizations. Would you be interested in that, or would you prefer to keep the current simpler imitation mmap?

That sounds like a much bigger change.. I wonder what the downsides might be? I guess for one thing it would mean that the underlying page allocator could only work on 64k chunks. Also if it supports munmap too then then we could get fagmentation at that level.. but the i don't think any of our malloc implementations are likely to be calling munmap (at least not under normal operations).

@kg
Copy link
Author

kg commented May 7, 2024

The option I've been exploring on my end is a real mmap implementation, which would allow turning on dlmalloc and mimalloc's mmap support and automatically getting the zeroing optimizations. Would you be interested in that, or would you prefer to keep the current simpler imitation mmap?

That sounds like a much bigger change.. I wonder what the downsides might be? I guess for one thing it would mean that the underlying page allocator could only work on 64k chunks. Also if it supports munmap too then then we could get fagmentation at that level.. but the i don't think any of our malloc implementations are likely to be calling munmap (at least not under normal operations).

Theoretically there's no need to require 64k "pages" for a custom mmap, I've experimented with smaller pages. For our workload smaller pages aren't an improvement but it would be possible to do a smaller size if it was beneficial for applications. You'd have to change the SC_ pagesize though.

Current emscripten mmap still tries to align the result addresses even though it's malloc, right? So it's almost equivalent to real 64KB pages.

The current mmap implementation also appears to have to linearly scan through the full linked list of mappings to perform munmap, which isn't great, but I imagine many applications never call munmap?

A custom mmap could also implement FIXED allocation support, but I don't know if that's actually useful for anyone.

@sbc100
Copy link
Collaborator

sbc100 commented May 7, 2024

is a real mmap implementation

Obviously you can't make a "real" mmap implementation without virtual memory right? One would still basically need to fake mmap on top of sbkr right? But that faking would happen underneath malloc instead of top of it I guess? It not a terrible idea.. and it might make our mimalloc support nicer since currently mimalloc sits on top of emmalloc which is a little odd. In this future would mimalloc would side in top of this new mmap_page_allocator thing. Would that be useful alon?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants