Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alignment guarantees for mapped buffers #3508

Open
5 of 8 tasks
nical opened this issue Feb 20, 2023 · 17 comments
Open
5 of 8 tasks

Alignment guarantees for mapped buffers #3508

nical opened this issue Feb 20, 2023 · 17 comments
Labels
type: enhancement New feature or request

Comments

@nical
Copy link
Contributor

nical commented Feb 20, 2023

TLDR

  • Can we guarantee a certain minimal alignment when mapping buffers ?
  • What would that alignment guarantee be?
  • what about queue.write_buffer_with
  • Let's at least document what alignment is guaranteed (or the absence of guarantees)

The long version

The question came up on matrix

when a user maps a buffer, it can bu tempting to cast the byte slice into a slice of whatever it is they are filling the buffer with and then copy into that typed slice. Doing that requires that the mapped slice have the minimum alignment of the type in question. If wgpu were to guarantee a minimum alignment, it would make this pattern easier to get right.

On the web I don't expect this to matter because webgpu buffers are copied to and from typed arrays. The runtime only needs to memcpy bytes around.

I don't see much in the way of documentation or specification about that the alignment guarantees of mapped buffers.

In backends that use a memory allocator like vulkan, we can pass the alignment as a parameter of the allocation request (example) so it would be very easy to set the minimum alignment to some value.

In other backends it might be up to the driver. The various specs I looked at are better at documenting alignment requirements that users must abide to when writing data than alignment guarantees that the drivers must uphold when mapping.

Can wgpu provide alignment guarantees in mapAsync?

Per backend:

  • vulkan: easy via the allocator
  • metal ?
    • On unified memory architectures I'm pretty sure we get the same alignment as the GPU needs which is much larger than what we need. It would be good to get some confirmation.
    • At the very least, on all architecture, 16 bytes alignment is guaranteed
  • dx12: 16 bytes alignment guaranteed
    • The doc for Map does not say anything unlike D3D11's equivalent doc. Connor confirmed that the 16 bytes guarantee applies to dx12 (asked upstream on the dx12 discord).
    • Can easily guarantee larger alignments using the allocator.
  • GL ?
  • dx11 ?
    • D3D_FEATURE_LEVEL_10_0 and later: 16 bytes alignment guaranteed doc
    • earlier than D3D_FEATURE_LEVEL_10_0: only 4 bytes alignment guaranteed
  • wasm ?
    • No alignment guarantees currently, hopefully we can provide some.

If we can what would a good minimum alignment be?

Buffers are not supposed to be small, so I would go as far as give a whole 64 bytes of alignment. That's quite big but not that much compared to a typical buffer size, and in some rare but real situations it's nice to know how data fits into L1 cache lines.
At a minimum, guaranteeing 16 bytes would allow people to read or write common simd types with peace of mind.

The case of queue.write_buffer_with

Since this uses a simple bump allocation, it should be easy to provide alignment guarantees. Should we? How much?

@teoxoy
Copy link
Member

teoxoy commented Feb 20, 2023

Regarding the min alignment value, I think 32 is appropriate for the same reason the min{Uniform,Storage}BufferOffsetAlignment limits must be at least 32 according to the WebGPU spec.

image

@grovesNL
Copy link
Collaborator

Is there a strong reason to avoid using the minimum alignment allowed by mapAsync? I think this is currently 8

@teoxoy
Copy link
Member

teoxoy commented Feb 20, 2023

right, mapAsync's offset's alignment is 8; that means that we can't guarantee an alignment of more than 8.

It should be ok though, at least it's the size of a 64bit type.

@nical
Copy link
Contributor Author

nical commented Feb 21, 2023

Is there a strong reason to avoid using the minimum alignment allowed by mapAsync? I think this is currently 8

The spec requiring users to make offset a multiple of 8 but does not provide guarantee about the alignment of the start of the mapped range. I'll ask the question in reverse: Is there a strong reason the guarantees wgpu provide on mapped buffer alignment should be the same as the requirement of the the offset field?

The motivations I mentioned are:

  • In general the bigger the guaranteed alignment, the harder it is to accidentally cause undefined behavior when interacting with the mapped data.
  • Minimum alignment of 16 bytes lets you write or read common simd types out of the mapped buffer.
  • Minimum alignment of 64 bytes makes it safe to interact with almost any rust type and gives predictable alignment of the data with L1 cache lines.

8 bytes is too small IMHO. Things start being pretty safe and useful around 16 and I don't think it's worth ensuring anything larger than 64 bytes.
If it turns out that it can be achieved conveniently on every backend, my suggestion is 64 bytes alignment in map_async and at least 16 bytes in map_buffer_with.

@teoxoy
Copy link
Member

teoxoy commented Feb 21, 2023

The spec requiring users to make offset a multiple of 8 but does not provide guarantee about the alignment of the start of the mapped range. I'll ask the question in reverse: Is there a strong reason the guarantees wgpu provide on mapped buffer alignment should be the same as the requirement of the the offset field?

I was thinking in terms of getting a sub-region of a slice (in rust terms) where the alignment of the sub-region would have to be at most the offset alignment but I guess that doesn't apply here since the offset is passed down to the backend to do the copy.
Although I think the story changes if we throw unified memory in the mix.
I think this needs some more research.

@grovesNL
Copy link
Collaborator

Is there a strong reason the guarantees wgpu provide on mapped buffer alignment should be the same as the requirement of the the offset field?

I'd be slightly hesitant to add alignment guarantees that don't currently seem to exist in WebGPU, although maybe there would be interest adding these guarantees upstream. I completely agree that it would be great to avoid the panic when casting slices, but I'm not sure about how tricky it might be to guarantee extra alignment.

Thinking about it some more I think the minimum guaranteed alignment on the size right now is 4 and the expectation is the backing ArrayBuffer fails to be viewed as a TypedArray views if it's not aligned (e.g., a Float64Array can't be created out of a mapped ArrayBuffer that's 7 bytes). Conceptually this feels really similar to bytemuck failing to cast a slice if it can't fit into the mapped memory. Maybe we could provide a convenience function that introduces an extra copy if necessary?

On the web I don't expect this to matter because webgpu buffers are copied to and from typed arrays. The runtime only needs to memcpy bytes around.

The backing ArrayBuffer is provided to us and we can't control its size without possibly needing to request an oversized buffer during buffer creation, so I think we run into the same problem here unless I'm misunderstanding.

There could also be some subtle interactions with mapped subregions. I can't remember the resolution on overlapping mapped subregions in WebGPU but there could be a subtle interaction there if we try to guarantee larger alignments vs. the lengths of the subregions (currently 4?).

We might also want to consider how this works with buffers with unaligned lengths (e.g., too small or just not aligned). In places where we don't use require memory temporary allocations for memory mapping, this might mean we need oversized buffers to guarantee that a later aligned map won't fail.

@nical
Copy link
Contributor Author

nical commented Feb 21, 2023

but I'm not sure about how tricky it might be to guarantee extra alignment.

Yes, that's what I would like to figure out. I suspect that we can guarantee comfortable alignment almost everywhere without much implementation effort. At least it would be useful to understand and document what alignment one can rely on (per backend if need be).

I'm interested in the details if you have any in the case of wasm. On the web (and I suspect everywhere esle), the runtime creates a typed array provided to mapAsync that is separate from the wasm heap. The generated rust bindings have to make copies between the wasm heap and the real WebGPU typed array. In this context, if an alignment is to be guaranteed it would come from where the bindings choose to allocate this copy maybe there's something we can do there.

@nical
Copy link
Contributor Author

nical commented Feb 21, 2023

We can also survey the situation for sub-regions. I expect that there are two situations:

  • either the sub-region has the same guarantees as the whole buffer (for example GL and D3D11),
  • or knowing the alignment of offset zero guarantees that the beginning of the range is aligned to offset % base_alignement.

documenting a guaranteed alignment of offset % base_alignement would cover both and be useful enough for users (for example the simd use case).

@teoxoy teoxoy added the type: enhancement New feature or request label Feb 21, 2023
@grovesNL
Copy link
Collaborator

What would you think about starting with documenting the conservative minimum alignment of 4 bytes? This is the minimum required for the mapping size in WebGPU anyway, so any lower would be an implementation bug.

We could also provide better documentation that recommends creating oversized buffers if alignment with existing types is a concern, or maybe even some kind of helper for unaligned slices that slices the byte slice at the right length (e.g., working nicely with bytemuck/crevice/encase).

Later we could reconsider raising the minimum guaranteed alignment if it still turns out to be a significant pain point. At least anecdotally I haven't heard of many people running into this, so I wonder if it's worth extra implementation complexity in each backend (at least at this point).

The generated rust bindings have to make copies between the wasm heap and the real WebGPU typed array. In this context, if an alignment is to be guaranteed it would come from where the bindings choose to allocate this copy maybe there's something we can do there.

We technically already have an extra temporary copy right now so we could pad that, but I think we'll be able to improve this eventually and write into the Uint8Array directly (i.e., not writing into the wasm heap).

@nical
Copy link
Contributor Author

nical commented Feb 21, 2023

What would you think about starting with documenting the conservative minimum alignment of 4 bytes? This is the minimum required for the mapping size in WebGPU anyway, so any lower would be an implementation bug.

Well that's not what I am after but I'm all for someone adding this to the doc while the bigger picture is being figured out.

Later we could reconsider raising the minimum guaranteed alignment if it still turns out to be a significant pain point.

There is no need to stall. For map_async. I want 1) to gather the information about the guarantees we get fro free with each backend, 2) to document what can be safely relied on and where, and 3) see if there are improvements that can be made without cost or heroics.
There should be no controversial decision to make for 1 and 2, and we'll see about 3 when we have a better lay of the land.
For write_buffer_with, adding some alignment should be pretty easy regardless of the backend. Since there is some debate, maybe we can focus on the map_async clarification here.

At least anecdotally I haven't heard of many people running into this, so I wonder if it's worth extra implementation complexity in each backend (at least at this point).

While researching this, I've seen the simd use case pop in various places, it was also what prompted this discussion.
We can argue about implementation complexity when there is complexity to argue about, I expect that the guarantees are already pretty strong almost everywhere (except wasm maybe).

Most likely this isn't a very common pain point because people do unknowingly rely on alignments they don't know they need but get in practice.

@grovesNL
Copy link
Collaborator

Alright that sounds good. I do think we should try to upstream any guarantees into WebGPU and webgpu-headers if we do end up raising the alignment guarantee, but I agree that it makes sense to investigate it here first.

Most likely this isn't a very common pain point because people do unknowingly rely on alignments they don't know they need but get in practice.

Definitely possible, it would probably be caused by SIMD types like you mentioned.

@Elabajaba
Copy link
Contributor

when a user maps a buffer, it can be tempting to cast the byte slice into a slice of whatever it is they are filling the buffer with and then copy into that typed slice

Isn't this exactly the type of subtle UB that presser was made to protect people from?

@nical
Copy link
Contributor Author

nical commented Mar 18, 2023

Isn't this exactly the type of subtle UB that presser was made to protect people from?

Presser protects you from more than that (padding within the structures for example), but that's beside the point. I'm convinced that wgpu already mostly provides comfortable alignment guarantees for mapped buffers and that bridging the gaps will require no heroics nor cost so there is no reason for wgpu not provide the comfort of, say, 16 bytes alignment and remove a source of mistakes in the process.

I'm unfortunately unable to spend much time in front of a computer for a few of weeks.

Where I'm at regarding this, is that I'm 99.9% certain that the alignment we get in d3d12 and metal is basically the alignment of the GPU buffer itself. I'd like to find spec wording that explicit confirms that. At the very least D3D11 at some point explicitly guaranteed 16 bytes alignment not for the needs of the driver but for user convenience so it is pretty safe to assume they didn't take that away.

That would make it trivial to guaranteed 16 bytes on all native backends (assuming ARB_map_buffer_alignment in GL). I haven't had time to look closely at the case of web.

@Wumpf
Copy link
Member

Wumpf commented Mar 27, 2023

I had a workaround in place so far to check alignment after mapping, then add padding if necessary and use the buffer with an offset. Noticed today, that this can cause spurious failures: I got a 2 (two!!) byte aligned buffer back on WebGL, applied a padding of 2 and the ofc got an error when doing a copy_buffer_to_texture operation which - depending on the texture format requires the offset to be padded somewhere between 4 and 16.

This is a very unfortunate tying of the "gpu offset" with the "cpu offset". I also think this illustrates that any alignment guarantee other than 16 is a bit non-sensical (a buffer offset that is required to be aligned to 16 makes no sense when the underlying data pointer isn't aligned to begin with)

Wumpf added a commit to rerun-io/rerun that referenced this issue Mar 27, 2023
Turns out our eagerness to acquire aligned pointers for fast copy operation backfired and got us into an impossible situation: By offsetting staging buffers to ensure cpu pointer alignment, we sometimes choose offsets that aren't allowed for copy operations. E.g. we get back a buffer that has a pointer alignment of 2 (that happens -.-) we therefore offset the pointer by 14 (our min alignment is 16!). We now can copy data into the buffer quickly and safely. But when scheduling e.g. `copy_buffer_to_texture` we get a wgpu crash! Wgpu requires the offset (we put 14) to be:
* a multiple of wgpu::COPY_BUFFER_ALIGNMENT
* a multiple of the texel block size
Neither of which is true now! You might be asking why wgpu gives such oddly aligned buffers out to begin with, and the answer is sadly that the WebGL impl has issues + that the spec doesn't guarantee anything, so this is strictly speaking valid (although most other backends will give out 16 byte aligned pointers). See gfx-rs/wgpu#3508

Long story short, I changed (and simplified) the way we go about alignment on `CpuWriteGpuReadBelt`. The CPU pointer no longer has *any* alignment guarantees and offsets fullfill now the above guarantees. This is _ok_ since we already wrapped all accesses to the cpu pointer and can do byte writes to them. The huge drawback of this is ofc that `copy_from_slice` now has to do the heavy lifting of checking for alignment and then doing the right instructions for everything that is worth while doing so (that is, the things `memcpy` does when it deals with raw byte pointers)

Testing: Confirmed fix with crashing repro on the Web, then ran `just py-run-all` for native, renderer samples local and on web.
Have not checked if this has any practical perf impact. Luckily our interface makes this very much a "optimize later" problem (copy operations within `CpuWriteGpuReadBuffer` can be made more clever in the future if need to be; unlikely necessary to be fair though)
emilk pushed a commit to rerun-io/rerun that referenced this issue Mar 27, 2023
Turns out our eagerness to acquire aligned pointers for fast copy operation backfired and got us into an impossible situation: By offsetting staging buffers to ensure cpu pointer alignment, we sometimes choose offsets that aren't allowed for copy operations. E.g. we get back a buffer that has a pointer alignment of 2 (that happens -.-) we therefore offset the pointer by 14 (our min alignment is 16!). We now can copy data into the buffer quickly and safely. But when scheduling e.g. `copy_buffer_to_texture` we get a wgpu crash! Wgpu requires the offset (we put 14) to be:
* a multiple of wgpu::COPY_BUFFER_ALIGNMENT
* a multiple of the texel block size
Neither of which is true now! You might be asking why wgpu gives such oddly aligned buffers out to begin with, and the answer is sadly that the WebGL impl has issues + that the spec doesn't guarantee anything, so this is strictly speaking valid (although most other backends will give out 16 byte aligned pointers). See gfx-rs/wgpu#3508

Long story short, I changed (and simplified) the way we go about alignment on `CpuWriteGpuReadBelt`. The CPU pointer no longer has *any* alignment guarantees and offsets fullfill now the above guarantees. This is _ok_ since we already wrapped all accesses to the cpu pointer and can do byte writes to them. The huge drawback of this is ofc that `copy_from_slice` now has to do the heavy lifting of checking for alignment and then doing the right instructions for everything that is worth while doing so (that is, the things `memcpy` does when it deals with raw byte pointers)

Testing: Confirmed fix with crashing repro on the Web, then ran `just py-run-all` for native, renderer samples local and on web.
Have not checked if this has any practical perf impact. Luckily our interface makes this very much a "optimize later" problem (copy operations within `CpuWriteGpuReadBuffer` can be made more clever in the future if need to be; unlikely necessary to be fair though)
@kainino0x
Copy link

FYI I did a bit of investigation on this over in webgpu.h: webgpu-native/webgpu-headers#180 (comment)
I don't think there's anything new that you hadn't already discovered, except we confirmed that D3D12 provides 16-byte alignment.

@cwfitzgerald
Copy link
Member

cwfitzgerald commented Aug 5, 2023

For wasm, we completely control the memory allocation - we can align it however we see fit. Both manual copy-to-pointer and copy-to-slice exist on Uint8Array, so we can do whatever we want. https://docs.rs/js-sys/latest/js_sys/struct.Uint8Array.html#method.copy_to

@kdashg
Copy link

kdashg commented Aug 23, 2023

Myles@apple says that:

MTLBuffer.contents "is guaranteed to be at least aligned with the largest alignment that any MSL type requires.”
(Which ends up being 16B)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

8 participants