DLPack requires 256B alignment of data pointers #779
Replies: 20 comments
-
I have a vague memory of this alignment requirement being pointed out before and the answer being that it's unnecessary and can be ignored. Some digging turned up numpy/numpy#19013 (comment), where you (@leofang) said: "please ignore the alignment stuff. If we are talking about host-device copies, it's an implementation detail (of CUDA, HIP, or any sane memory pool implementation sitting on top). CUDA/HIP's copy APIs do not care about alignments. DLPack does not. CUDA Array Interface does not. No one does." |
Beta Was this translation helpful? Give feedback.
-
Thanks, Ralf, yes now I recall, I guess it's a rediscovery then 😅 Maybe I just need to add a note to this page for better visibility? |
Beta Was this translation helpful? Give feedback.
-
This is certainly something that worth discussion. To give background of the original rationale in DLPack. Such requirement is not necessary for host/device copies of any kind as we rightfully point out that copy APIs does not care about alignments. The alignment spec comes in handy when we build kernels that leverages those data structure, as vectorization usually have higher amount of alignment requirements. Of course there is a need to support sliced array, which do not have the proper alignment. Similar problems will happen in the context of OpenCL, where the data ptr is opaque(you cannot add an offset directly to that field). The |
Beta Was this translation helpful? Give feedback.
-
Naive question - is there any advantage of enforcing 256 byte alignment plus always checking And if there is such an advantage, I guess you're proposing to start checking it in each implementation? This may require a slow path, with emitting warnings for at least a year before starting to raise an exception. |
Beta Was this translation helpful? Give feedback.
-
Larger alignment generally have an advantage of making vectorization simple(in the presence of avx2 or 512) in the case of CPU. Similarly in the case of GPU can inform a more aligned loads in float4. The particular number was picked mainly because it was CUDA's Malloc default alignment. I agree start with a warning would be useful, given most do not start with such an assumption. |
Beta Was this translation helpful? Give feedback.
-
Yes, so why even discuss enforcing any alignment beyond the itemsize? Slicing is a central feature, so almost all libraries have to anticipate data that is at most itemsize-aligned. I don't understand how you can possibly do any more than recommend allocating data with a larger alignment (e.g. because certain compute libs may not accept the data otherwise – which is fair). Calculating alignment, and rejecting data that is not hard! You could also ask the exporter to provide the information (e.g. as a power-of-two alignment – many exporters likely already store it in some form; I would like to for NumPy) or you expect the importer to calculate it. |
Beta Was this translation helpful? Give feedback.
-
Thanks @seberg. In the specific case of DLPack, there is also a need of supporting opaque data ptr (in the case of vulkan and opencl), in those cases the Just trying to bring up context here and background for both cases, not necessarily arguing one versus another |
Beta Was this translation helpful? Give feedback.
-
The use of How is the current header comment/standard meant? Are we supposed to use |
Beta Was this translation helpful? Give feedback.
-
Based on the plaintext interpretation of the header comment, if the data is not aligned, a usage of In reality, i think a lot of the current exchange impls ignores that constraint, so it becomes an open question whether or not we encourage such a convention, or allow people to pass in their own alignment preference. |
Beta Was this translation helpful? Give feedback.
-
Yes this would be really helpful. Knowing the base allocation pointer, offset, and alignment often allows for much more efficient code. |
Beta Was this translation helpful? Give feedback.
-
@kkraus14 OK, that makes sense to me. I am still not sure if you would want to expect the importer to calculate both base and non-base pointer alignment? Or should DLPack grow a field to indicate alignment (e.g. for opaque pointers). If it is padding and alignment, would that mean that the exporter must indicate it anyway (i.e. for padding)? I suppose alignment could be device specific, maybe some devices Would you suggest changing the dlpack header to explicitly say:
|
Beta Was this translation helpful? Give feedback.
-
I would assume that allocations are aligned, which means the base pointer is aligned, but the non-base pointer is not necessarily aligned and is determined by the offset.
I would love to specify both padding and alignment, but beggars can't be choosers 😄.
Memory pools are super common that don't necessarily have to follow the alignment / padding guarantees of the base
I don't know if we want to go to the level of device alignment given the number of memory pools out there. Requiring the alignment to be a multiple of the itemsize makes sense to me though.
Yes
Yes, adding both would be ideal and would allow more optimized code |
Beta Was this translation helpful? Give feedback.
-
Ah, so |
Beta Was this translation helpful? Give feedback.
-
Ideally, yes. I imagine the read side is more important, but generally being able to leverage vector loads / stores is typically important for optimizations.
Maybe there's an upper limit to where alignment doesn't matter relevant to itemsize anymore? I.E. anything above 64 bits or 128 bits only needs 64 bit or 128 bit alignment? |
Beta Was this translation helpful? Give feedback.
-
At this point, all I care about is to have a definite answer which exports NumPy is supposed to reject due to insufficient alignment (or readonly data for that matter). And that this answer is added to pytorch has this:
And it supports slicing, right? So, torch can't possibly have export guarantees beyond the Now, I don't know what their allocation strategy is, so I don't know whether they even guarantee itemsize-alignment for
which guarantees 8-bytes while complex-double has 16. So even itemsize-aligned is probably broken (albeit only for complex doubles). As for their As far as I am aware there is no versioning for To be painfully clear: I very much still think that DLPack in its current form needs both extension and better extensibility to be the good implicit protocol that data-apis should in my opinion aim for. With that: Sorry, I am going to try and leave you all in peace now... If you want to consider extending DLPack liberally and with versioning, I will be interested (but also happy to stay away to leave you in peace – seriously, just tell me). Until then, I will simply try to just stop caring it. |
Beta Was this translation helpful? Give feedback.
-
Actually this is not my first time hearing the very same comment. I do believe we should get this and other DLPack issues (tracked in the meta-issue dmlc/dlpack#74) addressed for the v2021 milestone. |
Beta Was this translation helpful? Give feedback.
-
PyTorch doesn't support 32-bit platforms AFAIK. It's not impossible that there's a custom armv7 mobile build or some such thing floating around, but certainly there's no wheels, conda packages or CI for 32-bit platforms. EDIT: confirmed, trying to build PyTorch for a 32-bit platform will raise an exception immediately. Only
This may be true for other libraries as well. It seems clear from the above this discussion that there's a desire to have alignment better specified and in the future enforced. It looks to me like this will need gradual evolution:
|
Beta Was this translation helpful? Give feedback.
-
It looks like the answer changed here (or at least my reading of it): (xref dmlc/dlpack#83 (comment)). Let me try to summarize:
These are the options:
The current status is that the fine print in So here's a new proposal:
|
Beta Was this translation helpful? Give feedback.
-
If we're making longer term changes surrounding optional alignment in A3, it would be nice to have optional padding as well so that we know how many bytes past |
Beta Was this translation helpful? Give feedback.
-
@leofang Does it make sense to keep this issue open? I'm not sure the current status or whether things/perspectives have changed since 2021. |
Beta Was this translation helpful? Give feedback.
-
Recently we found that DLPack has this requirement noted in the header:
https://github.com/dmlc/dlpack/blob/a02bce20e2dfa0044b9b2ef438e8eaa7b0f95e96/include/dlpack/dlpack.h#L139-L141. Would this be an issue for all adopting libraries? As far as I know, CuPy doesn't do any alignment check and take the pointer (
DLTensor.data
) as is, and IIRC quite a few other libraries are also doing this.cc: @seberg @rgommers @tqchen @kmaehashi @oleksandr-pavlyk @jakirkham @edloper @szha
Beta Was this translation helpful? Give feedback.
All reactions