-
Notifications
You must be signed in to change notification settings - Fork 12.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
align_offset
seems to generate worse code than is desirable (unless size_of::<T>() == 1
)
#98809
Comments
align_offset
seems to generate worse code than needed (unless size_of::<T>() == 1
)align_offset
seems to generate worse code than is desirable (unless size_of::<T>() == 1
)
Saw, got interested. Some drive-by notes:
fn align_offset_aligned<T>(ptr: &[T; 0], align: usize) -> usize
where
T: Sized,
{
if !align.is_power_of_two() {
panic!("align_offset: align is not a power-of-two");
}
let byte_align_offset = ptr.as_ptr().cast::<u8>().align_offset(align);
byte_align_offset / core::mem::align_of::<T>()
} This optimizes perfectly, even at -O1, with no changes to (EDIT: on mobile and can't edit on godbolt, but: the linked code implements I figured this out while trying and failing to add a fast path for Even inside |
Thanks for the report. I think I have an idea how this can be improved, but I need to think about it a little further to see if it really is correct in all instances and to evaluate if it the tradeoffs in code quality are going to be positive enough to warrant a change I’m thinking of. @rustbot claim |
…ark-Simulacrum Add a special case for align_offset /w stride != 1 This generalizes the previous `stride == 1` special case to apply to any situation where the requested alignment is divisible by the stride. This in turn allows the test case from rust-lang#98809 produce ideal assembly, along the lines of: leaq 15(%rdi), %rax andq $-16, %rax This also produces pretty high quality code for situations where the alignment of the input pointer isn’t known: pub unsafe fn ptr_u32(slice: *const u32) -> *const u32 { slice.offset(slice.align_offset(16) as isize) } // => movl %edi, %eax andl $3, %eax leaq 15(%rdi), %rcx andq $-16, %rcx subq %rdi, %rcx shrq $2, %rcx negq %rax sbbq %rax, %rax orq %rcx, %rax leaq (%rdi,%rax,4), %rax Here LLVM is smart enough to replace the `usize::MAX` special case with a branch-less bitwise-OR approach, where the mask is constructed using the neg and sbb instructions. This appears to work across various architectures I’ve tried. This change ends up introducing more branches and code in situations where there is less knowledge of the arguments. For example when the requested alignment is entirely unknown. This use-case was never really a focus of this function, so I’m not particularly worried, especially since llvm-mca is saying that the new code is still appreciably faster, despite all the new branching. Fixes rust-lang#98809. Sadly, this does not help with rust-lang#72356.
There are many examples where
align_offset
produces worse code than it should (worse than you'd write by hand, at the very least, and in the case of slices, worse then I think it needs to). You can often work around this by going through a*const u8
pointer, although this is surprising, and not always an option (for example, it is not an option for users ofalign_to
).Here's a minimal example extracted from some real code but massaged somewhat: https://rust.godbolt.org/z/aq8Yo8oYj. For posterity, the generated code here currently is:
(And
simd_align1
here is actually better than I was getting in my real code, where it ended up as a branch. That said, this is bad enough to demonstrate the problem).I guess this is because
align_offset
gets passed a*const T
which itself may not be aligned toT
(for example,(1 as *const f32).align_offset(16)
can't be satisfied). That problem is avoided by going through u8 (as is done withsimd_align2
), since that should always succeed (for non-const which isn't relevant here), and hits thestride == 1
special case added to fix #75579).However, in this case the compiler should know that the pointer is aligned to 4, since it comes from a slice (and the pointer does end up with
align(4)
in the LLVM). I think this means there's no actual reason thatsimd_align1
must be less efficient thansimd_align2
(unless I'm missing something), and the issue is just that ouralign_offset
impl doesn't have a fast path for this situation. That said, maybe I'm wrong, and the generated code is bad for some other reason.Either way, it would be very nice for this to do better for the slice use case -- as I mentioned, going through
*const u8
is not viable whenalign_offset
is invoked from an API likealign_to
(I suspect if we can't fix it foralign_offset
, we probably could work around it inalign_to
more directly, but it would be better to do it inalign_offset
if possible. That said, if I'm wrong about why this is happening, all bets are certainly off).See also #72356, which seems related (and my hunch is that it will be fixed if we fix this issue, at least in this case).
CC @nagisa, since you asked.
P.S. Sorry that this is a bit of a rambling issue.
The text was updated successfully, but these errors were encountered: