-
Notifications
You must be signed in to change notification settings - Fork 3.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
memset consumes a large amount of time during startup for applications that use mmap #21620
Comments
First of all I will just note that for Its true that bulkmemory should be enabled by default soon, but you can also enable it today. Is it possible to enable it for all your users, or do you not have control over the flags they use? Having said that I don't see why we wouldn't have a Finally, if you are setting as lot calls to |
Sounds like it's just MAP_ANONYMOUS, yeah.
I am hoping we'll be able to turn it on for everyone, but we've had issues turning on uncontroversial (to me) flags so far, so I don't know when we can make it the default. We still support disabling both WASM SIMD and WASM EH for users who have old Android or iOS devices, so bulk memory potentially adds a third configuration flag once we turn it on by default. When a developer flips one or both of those off, we re-link our whole runtime using emscripten on the developer's machine to apply the appropriate flags and link in the appropriate bits.
Yeah, I can put a PR together if it's wanted. I'll try to remember to get to that this week.
Thanks for the suggestion, it hadn't occurred to me that malloc could be meaningfully better here. Thankfully we don't have bare mmap calls in our codebase - they're wrapped, and the wrapper is wasm-specific - so it's possible I can turn all our mmap/munmap calls into malloc and free. I expect it will pay off, since my profiles don't show memset as a hotspot under malloc calls (malloc itself is something of a hotspot, but not as bad as memset.) |
So in your case the caller of |
In these cases, the caller is relying on them being zero, but they're new "pages" that were allocated by sbrk/pre-allocated with initial heap size, and are already zero. |
In that case it sounds pretty risky to skip the memset after the malloc, since malloc makes no guarantees. I wonder if we could improve our fake |
BTW I am adding an underscore prefix to the internal memset/memcpy helpers: #21622 |
Incidentally, I did some local measurements of memset specifically and it looks like |
Just to provide an update, I was able to build our stack using 3.1.56 with When I looked through the relevant code in emscripten it seemed theoretically feasible to flow 'these are pre-zeroed pages' up through the stack to mmap, but it didn't seem trivial to do without making some really ugly changes to musl. |
I wrote a simple prototype page allocator to test this out, and almost all of the memset time during startup went away, and wall-clock time/sample counts went down too. Doing this in user space doesn't feel straightforward though, because it's not possible to request whole aligned pages from sbrk - and you would need to replace malloc/free with your own allocator in order to make it use your custom page allocator instead of emscripten libc's. So I feel like emscripten libc might be the place to do this, but it would require some extensive surgery on musl unless I'm missing something. Will start trying to figure out what that diff looks like. |
If you know that memory doesn't need to be released and recycled then I think you can use BTW, does the higher level system (i.e. the caller of |
The consumers sadly do want zeroed pages. My prototype just tracks whether pages came from sbrk (and are already zeroed), in which case it knows they are 'zeroed free pages' instead of 'free pages' and skips performing memset() on the caller's behalf. Obviously if a caller doesn't want zeroed pages you can skip memset then too, but I haven't even bothered to wire that up yet, since I just wanted to measure the impact of getting rid of the memset in the mmap path. One specific example here is that during startup our GC has to allocate memory for its card table. That's a big block of RAM that has to be zeroed, and can be serviced safely by sbrk without memset. |
I wonder if you go do even better and just make that region static? |
For the specific scenario of optimizing our startup we can do things like pre-allocate fixed regions in our address space by setting up the wasm binary just so, etc. But the card table example feels somewhat relevant since if your heap grows, you end up potentially needing to allocate more card tables. Or maybe you have per-thread arenas, so you need to allocate one per thread. (We do also allocate our per-thread stacks via wrapped mmap right now, so that's a spot where our startup gets dinged if we have multithreading active - and those don't actually need to be zeroed to begin with.) I think there are a lot of cases where userspace can hand-optimize this stuff, and maybe that makes it not worthwhile to do anything here. But there might be value in some sort of 'give me N (zeroed|i don't care) 64KB pages, please' primitive that can be used by application code, or an optimization for the existing mmap to skip the memset. We do make use of munmap, so I had to implement a page table and stuff for the proof of concept to work. My instinct would be that anonymous mmap is probably out there in the wild and people unmap those allocations, but maybe most of those cases are served fine by malloc/free, which don't zero? The "you can safely unmap any address without crashing, and you can unmap a subset of your mapping" parts of the mmap/munmap promise are hard to provide with malloc/free, though :( |
Sounds great! |
It looks like because emscripten configures dlmalloc to use morecore instead of mmap, calloc_must_clear returns a constant 1. If we customize dlmalloc to know that sbrk returns zeroed memory, it might be possible to make the optimization turn on. |
mimalloc also does not have this optimization enabled. The necessary code for it is there, but |
Well I suppose step 1 is to start using |
I imagine @kripken could suggest a way to pipe that information through.. its fine for mimalloc to use and internal/special version emmalloc API I think. |
Yes, I think an internal API would be fine here. Something like |
The option I've been exploring on my end is a real mmap implementation, which would allow turning on dlmalloc and mimalloc's mmap support and automatically getting the zeroing optimizations. Would you be interested in that, or would you prefer to keep the current simpler imitation mmap? |
That sounds like a much bigger change.. I wonder what the downsides might be? I guess for one thing it would mean that the underlying page allocator could only work on 64k chunks. Also if it supports munmap too then then we could get fagmentation at that level.. but the i don't think any of our malloc implementations are likely to be calling munmap (at least not under normal operations). |
Theoretically there's no need to require 64k "pages" for a custom mmap, I've experimented with smaller pages. For our workload smaller pages aren't an improvement but it would be possible to do a smaller size if it was beneficial for applications. You'd have to change the SC_ pagesize though. Current emscripten mmap still tries to align the result addresses even though it's malloc, right? So it's almost equivalent to real 64KB pages. The current mmap implementation also appears to have to linearly scan through the full linked list of mappings to perform munmap, which isn't great, but I imagine many applications never call munmap? A custom mmap could also implement FIXED allocation support, but I don't know if that's actually useful for anyone. |
Obviously you can't make a "real" mmap implementation without virtual memory right? One would still basically need to fake mmap on top of |
During application startup, operations like mmap will perform a
memset
to properly zero pages before returning them to the calling application:emscripten/system/lib/libc/emscripten_mmap.c
Line 124 in 29be801
If bulk memory is enabled, this is fine since that should (if the JS/wasm runtime is doing the right thing) just invoke a native implementation of the bulk fill operation.EDITED: While bulk memory memset is faster, memset is still a massive bottleneck in this scenario.If bulk memory isn't enabled, what appears to happen is that since we're starting up, it's quite likely that memset and memcpy operations will run in an interpreter instead of in fully jitted WASM code, like in this profile: (note: the elapsed time looks worse than it is, because I profiled multiple app starts in a loop)
Presumably if other parts of your startup take long enough, tiered compilation in your browser of choice will have completed by the point you start calling mmap, and this won't happen. For us, it consistently all happens before that point.
I know in some cases wasm will just always run in an interpreter - i.e. lockdown modes, or iOS when its jitcode memory block is full. So in those cases also this could be pretty impactful during the entirety of an app's runtime, but I would expect bulk memory to fix it there too.
sbc100 mentioned that bulk memory should be default soon, which might make this issue no longer relevant. Just figured I'd bring it up in case it seemed worthwhile to make a 1-2 line change to the emscripten libc to i.e. always use
emscripten_memset_js
(which doesn't exist right now, I guess) in operations like mmap where it could matter.I'll also note that during startup lots of this memory is already pre-zeroed, since it came from
sbrk
at the bottom of the stack, and it looks like in some cases it also comes frommi_heap_malloc_zero
under the covers. So in those scenarios there's no point in doing the memset at all - but flowing that information all the way up the call stack into mmap isn't an easy ask, so I'm not surprised that it's not happening.In general (in part due to the fact that memset is running in the interpreter instead of native code) memset and memcpy are a surprisingly large % time slice of our application startup :)
The text was updated successfully, but these errors were encountered: