-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[wasm] Add custom mmap/munmap implementation for anonymous mappings #101871
Conversation
Tagging subscribers to 'arch-wasm': @lewing |
Reformat header
There's more to be done to improve on this, but it feels mergeable so it would be great to at least get initial feedback. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looking good
Should we turn this off for threaded wasm? Is it already off for threaded wasm? Is it off for WASI? |
Our current implementation is actually subtly broken for threaded wasm. I'm open to disabling this for MT anyway, but it should be thread-safe thanks to the mutex. It should be enabled for WASI I think. |
After thinking some more about the skip structure I realized it has a failure mode if you perform partial unmappings. For full unmapping, it returns the page table into a fully correct state, since all of the skip pages created by the original mapping get reset. But if you unmap the middle of an allocation you can end up with left-over skip pages that jump past free space, like so:
It's also the case that if you have a big section of free blocks with skip data, you need to allocate from the front of it for the skip data to stay valid - in the same way otherwise you could end up skipping over free space because you haven't updated the preceding skip data. I'm not sure how to fix this yet - maybe since skip values only go up to 64, I can just recalculate the skip values for the preceding 64 pages any time I modify pages? EDIT: That is what I did |
Address PR feedback
After freeing pages, repair the skip values for the preceding 64 pages so they don't erroneously skip over newly free pages
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Although we should make a follow-up work item to figure out why MT was unhappy
…otnet#101871) * emscripten libc implements mmap/munmap as a broken adapter on top of malloc (not calloc), which means it has no choice but to invoke memset on every allocation in order to provide properly zeroed bytes for our allocation requests. this commit adds a custom mmap/munmap implementation that can skip zeroing already-zeroed pages * re-enable freeing of pages in sgen on wasm if custom mmap is active * add runtime option for custom mmap * add warning switches to fix build on debian
Introduction
The biggest hotspot in many of my startup measurements is memset. Out of memset's various callers in the runtime, the main culprits here are sgen_alloc_os_memory and sgen_alloc_os_memory_aligned, and memset is being performed on their behalf by the mmap implementation in emscripten's libc:
emscripten libc implements mmap/munmap as a broken adapter on top of malloc (not calloc), which means it has no choice but to invoke memset on every allocation in order to provide properly zeroed bytes for our allocation requests. this is theoretically fixable on the emscripten end (I'm still looking into it, but it's not trivial), but in the process of testing my theory that the memset isn't necessary, I noticed some other problems and decided to try and put together a complete implementation of mmap/munmap for our needs that can skip calling memset. The implementation is controlled by a runtime option, so you can turn it off to get emscripten mmap instead.
Some more details in no particular order:
Design overview
mono_wasm_page_manager (mwpm) is an extremely simple mmap implementation that has a statically-allocated page table for the whole 32-bit wasm address space. it manages fixed-size pages that match the native WASM page size (64KB), though it can be configured for smaller pages (my testing shows no real advantage from doing this.)
to service allocations, mwpm uses sbrk to grab large chunks of pre-zeroed memory, and updates its page table to record that those pages are currently free and zeroed. as a result when a caller asks for zeroed memory, there is no need to call memset to zero it. pages controlled by other callers of sbrk remain as dead zones in the page table and will not be used. in order to get aligned pages out of sbrk, it allocates an extra page worth of memory and discards the excess on either side of the allocated pages. if mwpm discovers that it has called sbrk twice in a row without another allocator (i.e. dlmalloc) getting to it, it welds the two allocations together, recovering most of the alignment waste bytes that were lost due to sbrk's lack of alignment.
Finding available pages to service allocations is done with a naive linear scan - I tested more complex strategies and they produced worse fragmentation, though it's possible I missed the perfect solution. the first suitable spot for an allocation is the one that gets used.
When pages are unmapped, they are returned to the page table in a 'free and dirty' state. Requests to map zeroed memory will eagerly use this memory but call memset to prepare it before returning it, while if the caller doesn't request zeroed memory we will skip the memset.
Performance
From an algorithmic perspective this is a bad implementation, but in testing the performance is on par with emscripten's mmap.
S.T.J.Tests finishes with a memory size of 1.409GB using emscripten libc, and a memory size of 1.410GB using mwpm. mwpm finishes the test run with 76.6% of its allocated pages in use, and enough space to allocate 461 sequential pages, so fragmentation is mostly under control. the mwpm version of the test run completed in 3:09, vs emscripten at 2:57, so it's not catastrophically slower (if it's slower at all... I still have to figure out a good way to do realistic benchmarking.) As one would expect, the memset startup bottleneck isn't present with mwpm active.
We could probably optimize page table searches and modifications using WASM SIMD once it's enabled for the runtime.
TODOs
Locking for multithreaded buildsInstead of memsetting the entire allocation, memset only the pages that need to be zeroedUse a more complex page table representation so we don't ever have to scan pages we don't controlRelated emscripten-core/emscripten#21620