-
Notifications
You must be signed in to change notification settings - Fork 12.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve the array::map
codegen
#107634
Improve the array::map
codegen
#107634
Conversation
This comment was marked as resolved.
This comment was marked as resolved.
/// | ||
/// It currently requires `TrustedLen` because it's unclear whether it's | ||
/// reasonably possible to depend on the `size_hint` of anything else. | ||
pub(crate) trait UncheckedIterator: TrustedLen { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like another take on TrustedRandomAccess, except that the source can drop its elements. I tried something similar in #93243 except by augmenting TRA directly, but it lead to compile time regressions due to the added traits and methods.
Ideally we would only have to maintain one unsafe iteration trait. But this is a lot simpler and sometimes the single-induction-variable that TRA provides isn't necessary, so maybe it's not so bad.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, what's the best resource to read up on TrustedRandomAccessNoCoerce
? TBH I've always ignored it 😬 Is https://stdrs.dev/nightly/x86_64-unknown-linux-gnu/std/iter/trait.TrustedRandomAccess.html#safety the go-to place?
I could try changing try_from_trusted_iterator
to essentially try_from_fn(Iterator::__iterator_get_unchecked)
, rather needing next_unchecked
at all...
(But next_unchecked
has a much simpler safety condition than TRANC
.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s https://stdrs.dev/nightly/x86_64-unknown-linux-gnu/std/iter/trait.TrustedRandomAccess.html#safety the go-to place?
Yes. But nothing actionable for this PR. The trait as it is currently isn't fully applicable for array::IntoIter
anyway because it is an owning iterator which needs to drop items when it's not exhausted. So it can only be implemented for T: Copy
(as we do for vec::IntoIter
) as approximation for needs_drop() == false
. To generalize it further would need some changes, as tried in the mentioned PR.
I was just pointing out that we now have two ways for unchecked access. TrustedLen is slightly more general since it can also apply to Chain
or Flatten
because it supports len > usize::MAX. TRA on the other hand is a bit more powerful when it applies because it allows external iteration with a single induction variable (this really helps Zip
) and doesn't need to update internal fields on which the next loop iteration would depend.
(But next_unchecked has a much simpler safety condition than TRANC.)
Agreed, there's some simplicity to it. I have tried using next().unwrap_unchecked()
before in conjunction with TrustedLen
while it brought some improvements it couldn't match TRA.
I hope that maybe UncheckedIterator
could get us closer (if we turn it into a specialization trait, which I thought it was 😅), perhaps even to parity. If we could do away with the TRA machinery that would be awesome because it's chernobyl-levels of unsafe.
But it doesn't cover all the same cases then we'd be left with two unsafe iteration traits, at least one which we might never be able to stabilize and that seems kind of ugly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd be happy to play with it in a future PR :)
What's the best way to make it a specialization trait? Make the method __iterator_next_unchecked
instead, like __iterator_get_unchecked
, so it's on Iterator
instead and just bounded by the specialization trait TrustedLen
? (TBH I'm not sure why the unsafe specialization trait keeps things from having methods, but presumably there's a good reason.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Specialization traits can have methods (TrustedRandomAccessNoCoerce
has size()
). __iterator_next_unchecked
was moved to the iterator trait during the specialization -> min_specialization refactor for reasons I have forgotten (edit, reason was #73565 (comment)).
Slapping #[rustc_specialization_trait]
on it might be enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I see -- #[rustc_specialization_trait]
can have methods, but TrustedLen
is #[rustc_unsafe_specialization_marker]
, which can't. That's the distinction I'd missed. Thanks!
Since this doesn't impact @bors try @rust-timer queue |
This comment has been minimized.
This comment has been minimized.
⌛ Trying commit 2e20acd169c4fd9c1ee69c089b9c7439ef48a02f with merge 4d8ea29940107409a02030c3f43a0fd2bb7c567c... |
Ah, I wasn't thinking that this would have perf impacts. I was more thinking about whether it was a good idea to add more specialization traits for unchecked iteration. And the compiler isn't array-iteration-heavy so it it should affect things much. |
Did you run the std benches? There are a few that cover array mapping and collecting arrays for next_chunk. |
I'm not sure whether this is the direction you meant, but I'll note that this PR doesn't actually do any specialization at all. The methods are internal, so just only accept (I first tried to add |
Ah, I saw |
☀️ Try build successful - checks-actions |
This comment has been minimized.
This comment has been minimized.
Finished benchmarking commit (4d8ea29940107409a02030c3f43a0fd2bb7c567c): comparison URL. Overall result: ❌✅ regressions and improvements - no action neededBenchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf. @bors rollup=never Instruction countThis is a highly reliable metric that was used to determine the overall result at the top of this comment.
Max RSS (memory usage)ResultsThis is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
CyclesResultsThis is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
|
Benches are essentially unchanged: Before:
After
(I couldn't find any for array |
move |_| { | ||
// SAFETY: We know that `from_fn` will call this at most N times, | ||
// and we checked to ensure that we have at least that many items. | ||
unsafe { iter.next_unchecked() } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To your point about unwrap_uncheck
not working as well as you'd hoped, @the8472, that's what I saw here too that led me to the new trait. I've lost the intermediate version I'd originally tried, but was able to get back to something similar by swapping this iter.next_unchecked
for iter.next().unwrap_unchecked()
.
That ends up leaving this in the IR for the long_integer_map
test:
"_ZN93_$LT$core..array..drain..Drain$LT$T$GT$$u20$as$u20$core..iter..traits..iterator..Iterator$GT$4next17h51881ffc2935de71E.exit.i.i.i.i.i.i.i.i.i.1": ; preds = %bb5.i.i.i.i.i.i.i
%generator.sroa.4.07.i.i.i.i.ptr.i.i.i.1 = getelementptr inbounds i32, ptr %array1.i.i.i, i64 %generator.sroa.4.1.i.i.i.i.idx.i.i.i
%generator.sroa.4.07.i.i.i.i.add.i.i.i.1 = add nsw i64 %generator.sroa.4.1.i.i.i.i.idx.i.i.i, 1
%tmp.0.copyload.i.i.i.i.i.i.i.i.i.i.1 = load i32, ptr %generator.sroa.4.07.i.i.i.i.ptr.i.i.i.1, align 4, !noalias !31
%7 = insertvalue { i32, i32 } { i32 1, i32 undef }, i32 %tmp.0.copyload.i.i.i.i.i.i.i.i.i.i.1, 1
%_3.i.i.i.i.i.i.i.i.i.i.i.i.i.1 = mul i32 %tmp.0.copyload.i.i.i.i.i.i.i.i.i.i.1, 13
%8 = add i32 %_3.i.i.i.i.i.i.i.i.i.i.i.i.i.1, 7
br label %bb5.i.i.i.i.i.i.i.1
bb5.i.i.i.i.i.i.i.1: ; preds = %"_ZN93_$LT$core..array..drain..Drain$LT$T$GT$$u20$as$u20$core..iter..traits..iterator..Iterator$GT$4next17h51881ffc2935de71E.exit.i.i.i.i.i.i.i.i.i.1", %bb5.i.i.i.i.i.i.i
%generator.sroa.4.1.i.i.i.i.idx.i.i.i.1 = phi i64 [ 64, %bb5.i.i.i.i.i.i.i ], [ %generator.sroa.4.07.i.i.i.i.add.i.i.i.1, %"_ZN93_$LT$core..array..drain..Drain$LT$T$GT$$u20$as$u20$core..iter..traits..iterator..Iterator$GT$4next17h51881ffc2935de71E.exit.i.i.i.i.i.i.i.i.i.1" ]
%9 = phi { i32, i32 } [ { i32 0, i32 undef }, %bb5.i.i.i.i.i.i.i ], [ %7, %"_ZN93_$LT$core..array..drain..Drain$LT$T$GT$$u20$as$u20$core..iter..traits..iterator..Iterator$GT$4next17h51881ffc2935de71E.exit.i.i.i.i.i.i.i.i.i.1" ]
%10 = phi i32 [ undef, %bb5.i.i.i.i.i.i.i ], [ %8, %"_ZN93_$LT$core..array..drain..Drain$LT$T$GT$$u20$as$u20$core..iter..traits..iterator..Iterator$GT$4next17h51881ffc2935de71E.exit.i.i.i.i.i.i.i.i.i.1" ]
%_3.0.i.i.i.i.i.i.i.i.1 = extractvalue { i32, i32 } %9, 0
%switch.i.i.i.i.i.i.i.i.i.1 = icmp ne i32 %_3.0.i.i.i.i.i.i.i.i.1, 0
tail call void @llvm.assume(i1 %switch.i.i.i.i.i.i.i.i.i.1)
%11 = getelementptr inbounds i32, ptr %array.i.i.i.i.i.i, i64 %6
store i32 %10, ptr %11, align 4, !alias.scope !28, !noalias !38
%12 = or i64 %guard.sroa.7.08.i.i.i.i.i.i.i, 2
%_10.i.i.i.i.i.i.i.i.i.i.i.2 = icmp eq i64 %generator.sroa.4.1.i.i.i.i.idx.i.i.i.1, 64
br i1 %_10.i.i.i.i.i.i.i.i.i.i.i.2, label %bb5.i.i.i.i.i.i.i.2, label %"_ZN93_$LT$core..array..drain..Drain$LT$T$GT$$u20$as$u20$core..iter..traits..iterator..Iterator$GT$4next17h51881ffc2935de71E.exit.i.i.i.i.i.i.i.i.i.2"
Those variable names are a horrible mess, but from the instructions you can see that the Option
s just don't optimize away: it does insertvalue { i32, i32 }
to make an option, gets phi
'd to the next block, then there's the extractvalue { i32, i32 }
to pull things out again, and assume
s the discriminant.
Arguably the next_unchecked
shouldn't be doing something different, but somehow it makes a huge difference.
There's way too much code here to file a bug, but I'll see whether I can at least find something simpler that I can file on the rust side to see if an expert can tell why LLVM can't eliminate stuff here.
EDIT: Oh, it was easier to find a simple example than I thought, #107681
{ | ||
let mut map = iter.map(NeverShortCircuit); | ||
try_collect_into_array(&mut map).map(|NeverShortCircuit(arr)| arr) | ||
pub(crate) fn iter_next_chunk<T, const N: usize>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am curious about what motivated your reasoning to change the type signature here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@emilHof It's essentially the same; I didn't have to change it.
This is just a bit more consistent with how I did the other things in this file. (Like how try_from_fn_erased
takes impl FnMut
, not F
… where F: FnMut
.) I tend to think of APIT like this as "generic elision" -- for things like iterators and closures, I tend not to actually care about the specific type, so having a used-once generic for it where I have to read the bounds to follow the signature is kinda meh.
I like thinking of it as "this function takes a mutable reference to an iterator of T
s, and returns an array of T
s". Like how you might write a function taking Iterator<T>
in Java, or an IEnumerable<T>
in C#, just without the dyn
indirection that that results in those languages. Thinking of it as "This takes an I
, and returns an array of the item type of I
, and BTW I
needs to be an iterator" is just less direct in my brain. The item type is the important one, not the iterator's type, for me.
TL/DR: Entirely preference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@scottmcm That is a really explanation!
Your reasoning definitely makes sense. In a way, the T
is maybe more relevant to the purpose of these methods, so having the T
and not some "elusive" I
in the signature definitely seems very sensible! At least, this is how I am interpreting what you are saying.
Thank you for the great response! :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, that's a good phrasing of it!
This comment was marked as resolved.
This comment was marked as resolved.
@thomcc There's been lots of conversation here, so in case it wasn't clear I think this is still ready to go as-is. (I'll experiment with the various other future possibilities later, since I don't want to mix those in with this improvement.) |
☀️ Test successful - checks-actions |
Finished benchmarking commit (2d91939): comparison URL. Overall result: ❌✅ regressions and improvements - ACTION NEEDEDNext Steps: If you can justify the regressions found in this perf run, please indicate this with @rustbot label: +perf-regression Instruction countThis is a highly reliable metric that was used to determine the overall result at the top of this comment.
Max RSS (memory usage)ResultsThis is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
CyclesThis benchmark run did not return any relevant results for this metric. |
Perf is a wash - going to mark as triaged. @rustbot label: perf-regression-triaged |
The
map
method on arrays is documented as sometimes performing poorly, and after a question on URLO prompted me to take another look at the coretry_collect_into_array
function, I had some ideas that ended up working better than I'd expected.There's three main ideas in here, split over three commits:
array::IntoIter
when we can avoid it, since that seems to not get SRoA'd, meaning that every step writes things like loop counters into the stack unnecessarilyResult
s unnecessarily, as that doesn't seem to optimize away even withunwrap_unchecked
(perhaps because it needs to get moved into a new LLVM type to account for the discriminant)Option
dances when we know for sure we have enough items (like inmap
andzip
). This one's a larger commit as to do it I ended up adding a newpub(crate)
trait, but hopefully those changes are still straight-forward.(No libs-api changes; everything should be completely implementation-detail-internal.)
It's still not completely fixed -- I think it needs pcwalton's
memcpy
optimizations still (#103830) to get further -- but this seems to go much better than before. And the remainingmemcpy
s are justtransmute
-equivalent ([T; N] -> ManuallyDrop<[T; N]>
and[MaybeUninit<T>; N] -> [T; N]
), so hopefully those will be easier to remove with LLVM16 than the previous subobject copies 🤞r? @thomcc
As a simple example, this test
On nightly https://rust.godbolt.org/z/xK7548TGj takes
sub rsp, 808
(and yes, that's a 65-element array
alloca
despite 64-element input and output)But with this PR it's only
sub rsp, 520
Similarly, the loop it emits on nightly is scalar-only and horrifying
whereas with this PR it's unrolled and vectorized
(though sadly still stack-to-stack)