-
Notifications
You must be signed in to change notification settings - Fork 12.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] New implementation of the slice iterators using NonZero #46223
Conversation
I haven't run so many benchmarks yet, but the ones I do are unchanged. These ones below are of very simple loops. from ndarray "cargo bench --bench higher-order": name xx ns/iter yy ns/iter diff ns/iter diff %
fold_axis 681 676 -5 -0.73%
map_axis_0 451 452 1 0.22%
map_axis_1 452 453 1 0.22%
map_regular 172 169 -3 -1.74%
map_stride_double_f64 504 506 2 0.40%
map_stride_f64 652 666 14 2.15%
map_stride_u32 690 681 -9 -1.30% |
These differences seem small enough to just be noise; is the generated assembly actually identical? |
cc @rust-lang/libs |
@gankro Sure, I intended to present them as exactly that, noise. However I'll have to find better benchmarks. I removed the zip benchmarks -- those are entirely in a specialization and don't go through Iterator::next(). The PR posted here so that the full test suite is ran and for comments. |
Having looked through some benchmarks, nothing stood out. (Some examples a few pct better and apparently some a few pct worse, but it needs to be reduced to very small examples to tell us something.) I found this code gen example that autovectorizes on nightly but not using a compiler with this PR: #[no_mangle]
pub fn zip_and(a: &[i32], b: &[i32]) -> i32 {
let mut result = 0;
let mut iter_a = a.iter();
let mut iter_b = b.iter();
loop {
if let Some(&elt_a) = iter_a.next() {
if let Some(&elt_b) = iter_b.next() {
result |= elt_a | elt_b;
continue;
}
}
break;
}
result
} A "small" example of 450 lines of unoptimized IR without debug info. This PR removes the assume(!end.is_null()) or equivalent, and maybe that causes this regression. Otherwise I am open to suggestions. (Edit: I checked, it doesn't seem to help to use a |
Yeah, LLVM is inconsistent with its metadata, keeping the |
Reimplement the slice iterators using `NonZero` where possible. 1. Iter, IterMut now have a non-nullable field a. The usual enum layout optimizations apply b. Automatic and more principled nonnull metadata for that pointer 2. Big change for zero sized elements. The end field is a counter. The produced pointer value is equivalent to `slice.as_ptr()` and thus `&slice[i]` for each index `i` in this case. * Implementation Notes ** Regular case, T is not zero-sized (Nothing changes in this case, but this is the documentation in the code.) The iterator represents a contiguous range of elements. `ptr` is a pointer to the start of the range. `end` is a pointer to one past the end of the range. Initial values are ptr = slice.as_ptr() and end = ptr.offset(slice.len()) The iterator is empty when ptr == end. The iterator is stepped at the front by incrementing ptr, the yielded element is the old value of ptr ("post increment" semantic). The iterator is stepped at the back by decrementing end, the yielded element is the new value of end ("pre decrement" semantic). ** When T is a Zero-sized type (ZST) `ptr` is a pointer to the start of the range. `end` is used as an integer: the count of elements in the range. Initial values are ptr = slice.as_ptr() and end = self.len() as *const T The iterator is empty when end == 0 The iterator is stepped at the front by decrementing end. The iterator is stepped at the back by decrementing end. For both front and back, the yielded element is `ptr`, which does not change during the iteration. ** NonZero `ptr` is derived from `slice.as_ptr()` and just like it is is non-null. `end` is allowed to be null (used by the ZST implementation); this is required since for iterating a [(); usize::MAX] all values of usize (and equivalently the raw pointer) are used, including zero.
7aa749b
to
48c2497
Compare
(Rebased & updated with some cleanups, no major changes). I've officially given up trying to fix the regression in #46223 (comment) I prototyped this out of tree, which was a bit quicker, and that "regression" does not care about assumes or other changes I thought were natural. The change to That's the macro that changes from: macro_rules! make_ref {
($ptr:expr) => {{
let ptr = $ptr;
if size_from_ptr(ptr) == 0 {
// Use a non-null pointer value
&*(1 as *mut _)
} else {
&*ptr
}
}};
} to this: macro_rules! make_ref {
($ptr:expr) => { &*$ptr };
} Even funnier, the macro can even be this (which is nonsense but correct in the new impl) and the vectorization does not regress: macro_rules! make_ref {
($ptr:expr) => {{
let ptr = $ptr;
if false && size_from_ptr(ptr) == 0 {
// Use a non-null pointer value
&*(1 as *mut _)
} else {
&*ptr
}
}};
} but we must refuse to insert nonsense of that degree; then I think there's no principle but some strange accident that is causing the original #46223 (comment) code to compile well. |
@bluss Oh, have you tried having a |
@eddyb oh I think so, I tried a lot of things. I have resolved one bounds check regression in the nom testsuite (related to the unrolling / try_fold code ping @arthurprs ; unregressing it involved just tweaking how the ptrdistance was computed in that loop. This WIP PR is not entirely up to date since I'm still trying to fix regressions locally.) |
You mean this #45501? That's an awesome side effect 😃 |
Oh just maybe, it was actually just regress (due to this PR) and unregress (due to WIP on this PR). Just thinking of it since it's the same issue. |
Hi @bluss, is there anything supposed to happen for this PR to proceed? |
@kennytm I've been trying to work on a no-performance-regressions version with no success. I'll have to come back to this with a different goal, probably demonstrating some code savings in some places and code gen regressions in other places. I was using nom benchmarks for the no-performance-regressions goal. cc @Geal |
Visiting for triage: just to be clear, it sounds like you're going to continue work in this PR rather than create a new one when ready? |
Yes. I just haven't had time, sorry. Someone else can also pick it up. |
Reimplement the slice iterators using
NonZero
where possible.a. The usual enum layout optimizations apply
b. Automatic and more principled nonnull metadata for that pointer
The produced pointer value is equivalent to
slice.as_ptr()
andthus
&slice[i]
for each indexi
in this case.Implementation Notes
Regular case, T is not zero-sized
(Nothing changes in this case, but this is the documentation in the code.)
The iterator represents a contiguous range of elements.
ptr
is a pointer to the start of the range.end
is a pointer to one past the end of the range.Initial values are ptr = slice.as_ptr() and end = ptr.offset(slice.len())
The iterator is empty when ptr == end.
The iterator is stepped at the front by incrementing ptr, the
yielded element is the old value of ptr ("post increment" semantic).
The iterator is stepped at the back by decrementing end, the
yielded element is the new value of end ("pre decrement" semantic).
When T is a Zero-sized type (ZST)
ptr
is a pointer to the start of the range.end
is used as an integer: the count of elements in the range.Initial values are ptr = slice.as_ptr() and end = self.len() as *const T
The iterator is empty when end == 0
The iterator is stepped at the front by decrementing end.
The iterator is stepped at the back by decrementing end.
For both front and back, the yielded element is
ptr
, which does not changeduring the iteration.
NonZero
ptr
is derived fromslice.as_ptr()
and just like it is is non-null.end
is allowed to be null (used by the ZST implementation); this isrequired since for iterating a [(); usize::MAX] all values of usize (and
equivalently the raw pointer) are used, including zero.