-
Notifications
You must be signed in to change notification settings - Fork 909
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Alignment guarantees for mapped buffers #3508
Comments
Is there a strong reason to avoid using the minimum alignment allowed by |
right, It should be ok though, at least it's the size of a 64bit type. |
The spec requiring users to make The motivations I mentioned are:
8 bytes is too small IMHO. Things start being pretty safe and useful around 16 and I don't think it's worth ensuring anything larger than 64 bytes. |
I was thinking in terms of getting a sub-region of a slice (in rust terms) where the alignment of the sub-region would have to be at most the offset alignment but I guess that doesn't apply here since the offset is passed down to the backend to do the copy. |
I'd be slightly hesitant to add alignment guarantees that don't currently seem to exist in WebGPU, although maybe there would be interest adding these guarantees upstream. I completely agree that it would be great to avoid the panic when casting slices, but I'm not sure about how tricky it might be to guarantee extra alignment. Thinking about it some more I think the minimum guaranteed alignment on the size right now is 4 and the expectation is the backing
The backing There could also be some subtle interactions with mapped subregions. I can't remember the resolution on overlapping mapped subregions in WebGPU but there could be a subtle interaction there if we try to guarantee larger alignments vs. the lengths of the subregions (currently 4?). We might also want to consider how this works with buffers with unaligned lengths (e.g., too small or just not aligned). In places where we don't use require memory temporary allocations for memory mapping, this might mean we need oversized buffers to guarantee that a later aligned map won't fail. |
Yes, that's what I would like to figure out. I suspect that we can guarantee comfortable alignment almost everywhere without much implementation effort. At least it would be useful to understand and document what alignment one can rely on (per backend if need be). I'm interested in the details if you have any in the case of wasm. On the web (and I suspect everywhere esle), the runtime creates a typed array provided to mapAsync that is separate from the wasm heap. The generated rust bindings have to make copies between the wasm heap and the real WebGPU typed array. In this context, if an alignment is to be guaranteed it would come from where the bindings choose to allocate this copy maybe there's something we can do there. |
We can also survey the situation for sub-regions. I expect that there are two situations:
documenting a guaranteed alignment of |
What would you think about starting with documenting the conservative minimum alignment of 4 bytes? This is the minimum required for the mapping size in WebGPU anyway, so any lower would be an implementation bug. We could also provide better documentation that recommends creating oversized buffers if alignment with existing types is a concern, or maybe even some kind of helper for unaligned slices that slices the byte slice at the right length (e.g., working nicely with Later we could reconsider raising the minimum guaranteed alignment if it still turns out to be a significant pain point. At least anecdotally I haven't heard of many people running into this, so I wonder if it's worth extra implementation complexity in each backend (at least at this point).
We technically already have an extra temporary copy right now so we could pad that, but I think we'll be able to improve this eventually and write into the |
Well that's not what I am after but I'm all for someone adding this to the doc while the bigger picture is being figured out.
There is no need to stall. For map_async. I want 1) to gather the information about the guarantees we get fro free with each backend, 2) to document what can be safely relied on and where, and 3) see if there are improvements that can be made without cost or heroics.
While researching this, I've seen the simd use case pop in various places, it was also what prompted this discussion. Most likely this isn't a very common pain point because people do unknowingly rely on alignments they don't know they need but get in practice. |
Alright that sounds good. I do think we should try to upstream any guarantees into WebGPU and webgpu-headers if we do end up raising the alignment guarantee, but I agree that it makes sense to investigate it here first.
Definitely possible, it would probably be caused by SIMD types like you mentioned. |
Isn't this exactly the type of subtle UB that presser was made to protect people from? |
Presser protects you from more than that (padding within the structures for example), but that's beside the point. I'm convinced that wgpu already mostly provides comfortable alignment guarantees for mapped buffers and that bridging the gaps will require no heroics nor cost so there is no reason for wgpu not provide the comfort of, say, 16 bytes alignment and remove a source of mistakes in the process. I'm unfortunately unable to spend much time in front of a computer for a few of weeks. Where I'm at regarding this, is that I'm 99.9% certain that the alignment we get in d3d12 and metal is basically the alignment of the GPU buffer itself. I'd like to find spec wording that explicit confirms that. At the very least D3D11 at some point explicitly guaranteed 16 bytes alignment not for the needs of the driver but for user convenience so it is pretty safe to assume they didn't take that away. That would make it trivial to guaranteed 16 bytes on all native backends (assuming ARB_map_buffer_alignment in GL). I haven't had time to look closely at the case of web. |
I had a workaround in place so far to check alignment after mapping, then add padding if necessary and use the buffer with an offset. Noticed today, that this can cause spurious failures: I got a 2 (two!!) byte aligned buffer back on WebGL, applied a padding of 2 and the ofc got an error when doing a This is a very unfortunate tying of the "gpu offset" with the "cpu offset". I also think this illustrates that any alignment guarantee other than 16 is a bit non-sensical (a buffer offset that is required to be aligned to 16 makes no sense when the underlying data pointer isn't aligned to begin with) |
Turns out our eagerness to acquire aligned pointers for fast copy operation backfired and got us into an impossible situation: By offsetting staging buffers to ensure cpu pointer alignment, we sometimes choose offsets that aren't allowed for copy operations. E.g. we get back a buffer that has a pointer alignment of 2 (that happens -.-) we therefore offset the pointer by 14 (our min alignment is 16!). We now can copy data into the buffer quickly and safely. But when scheduling e.g. `copy_buffer_to_texture` we get a wgpu crash! Wgpu requires the offset (we put 14) to be: * a multiple of wgpu::COPY_BUFFER_ALIGNMENT * a multiple of the texel block size Neither of which is true now! You might be asking why wgpu gives such oddly aligned buffers out to begin with, and the answer is sadly that the WebGL impl has issues + that the spec doesn't guarantee anything, so this is strictly speaking valid (although most other backends will give out 16 byte aligned pointers). See gfx-rs/wgpu#3508 Long story short, I changed (and simplified) the way we go about alignment on `CpuWriteGpuReadBelt`. The CPU pointer no longer has *any* alignment guarantees and offsets fullfill now the above guarantees. This is _ok_ since we already wrapped all accesses to the cpu pointer and can do byte writes to them. The huge drawback of this is ofc that `copy_from_slice` now has to do the heavy lifting of checking for alignment and then doing the right instructions for everything that is worth while doing so (that is, the things `memcpy` does when it deals with raw byte pointers) Testing: Confirmed fix with crashing repro on the Web, then ran `just py-run-all` for native, renderer samples local and on web. Have not checked if this has any practical perf impact. Luckily our interface makes this very much a "optimize later" problem (copy operations within `CpuWriteGpuReadBuffer` can be made more clever in the future if need to be; unlikely necessary to be fair though)
Turns out our eagerness to acquire aligned pointers for fast copy operation backfired and got us into an impossible situation: By offsetting staging buffers to ensure cpu pointer alignment, we sometimes choose offsets that aren't allowed for copy operations. E.g. we get back a buffer that has a pointer alignment of 2 (that happens -.-) we therefore offset the pointer by 14 (our min alignment is 16!). We now can copy data into the buffer quickly and safely. But when scheduling e.g. `copy_buffer_to_texture` we get a wgpu crash! Wgpu requires the offset (we put 14) to be: * a multiple of wgpu::COPY_BUFFER_ALIGNMENT * a multiple of the texel block size Neither of which is true now! You might be asking why wgpu gives such oddly aligned buffers out to begin with, and the answer is sadly that the WebGL impl has issues + that the spec doesn't guarantee anything, so this is strictly speaking valid (although most other backends will give out 16 byte aligned pointers). See gfx-rs/wgpu#3508 Long story short, I changed (and simplified) the way we go about alignment on `CpuWriteGpuReadBelt`. The CPU pointer no longer has *any* alignment guarantees and offsets fullfill now the above guarantees. This is _ok_ since we already wrapped all accesses to the cpu pointer and can do byte writes to them. The huge drawback of this is ofc that `copy_from_slice` now has to do the heavy lifting of checking for alignment and then doing the right instructions for everything that is worth while doing so (that is, the things `memcpy` does when it deals with raw byte pointers) Testing: Confirmed fix with crashing repro on the Web, then ran `just py-run-all` for native, renderer samples local and on web. Have not checked if this has any practical perf impact. Luckily our interface makes this very much a "optimize later" problem (copy operations within `CpuWriteGpuReadBuffer` can be made more clever in the future if need to be; unlikely necessary to be fair though)
FYI I did a bit of investigation on this over in webgpu.h: webgpu-native/webgpu-headers#180 (comment) |
For wasm, we completely control the memory allocation - we can align it however we see fit. Both manual copy-to-pointer and copy-to-slice exist on Uint8Array, so we can do whatever we want. https://docs.rs/js-sys/latest/js_sys/struct.Uint8Array.html#method.copy_to |
Myles@apple says that:
|
TLDR
queue.write_buffer_with
The long version
The question came up on matrix
when a user maps a buffer, it can bu tempting to cast the byte slice into a slice of whatever it is they are filling the buffer with and then copy into that typed slice. Doing that requires that the mapped slice have the minimum alignment of the type in question. If wgpu were to guarantee a minimum alignment, it would make this pattern easier to get right.
On the web I don't expect this to matter because webgpu buffers are copied to and from typed arrays. The runtime only needs to memcpy bytes around.
I don't see much in the way of documentation or specification about that the alignment guarantees of mapped buffers.
In backends that use a memory allocator like vulkan, we can pass the alignment as a parameter of the allocation request (example) so it would be very easy to set the minimum alignment to some value.
In other backends it might be up to the driver. The various specs I looked at are better at documenting alignment requirements that users must abide to when writing data than alignment guarantees that the drivers must uphold when mapping.
Can wgpu provide alignment guarantees in mapAsync?
Per backend:
If we can what would a good minimum alignment be?
Buffers are not supposed to be small, so I would go as far as give a whole 64 bytes of alignment. That's quite big but not that much compared to a typical buffer size, and in some rare but real situations it's nice to know how data fits into L1 cache lines.
At a minimum, guaranteeing 16 bytes would allow people to read or write common simd types with peace of mind.
The case of
queue.write_buffer_with
Since this uses a simple bump allocation, it should be easy to provide alignment guarantees. Should we? How much?
The text was updated successfully, but these errors were encountered: