-
Notifications
You must be signed in to change notification settings - Fork 12.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
optimize inplace collection of Vec #123878
Conversation
@bors try @rust-timer queue |
This comment has been minimized.
This comment has been minimized.
optimize inplace collection of Vec This PR has the following changes: 1. Using `usize::unchecked_mul` in https://github.com/rust-lang/rust/blob/79424056b05eaa9563d16dfab9b9a0c8f033f220/library/alloc/src/vec/in_place_collect.rs#L262 as LLVM, does not know that the operation can't wrap, since that's the size of the original allocation. Given the following: ```rust pub struct Foo([usize; 3]); pub fn unwrap_copy(v: Vec<Foo>) -> Vec<[usize; 3]> { v.into_iter().map(|f| f.0).collect() } ``` <details> <summary>Before this commit:</summary> ```llvm define void `@unwrap_copy(ptr` noalias nocapture noundef writeonly sret([24 x i8]) align 8 dereferenceable(24) %_0, ptr noalias nocapture noundef readonly align 8 dereferenceable(24) %iter) { start: %me.sroa.0.0.copyload.i = load i64, ptr %iter, align 8 %me.sroa.4.0.self.sroa_idx.i = getelementptr inbounds i8, ptr %iter, i64 8 %me.sroa.4.0.copyload.i = load ptr, ptr %me.sroa.4.0.self.sroa_idx.i, align 8 %me.sroa.5.0.self.sroa_idx.i = getelementptr inbounds i8, ptr %iter, i64 16 %me.sroa.5.0.copyload.i = load i64, ptr %me.sroa.5.0.self.sroa_idx.i, align 8 %_19.i.idx = mul nsw i64 %me.sroa.5.0.copyload.i, 24 %0 = udiv i64 %_19.i.idx, 24 ; Unnecessary calculation %_16.i.i = mul i64 %me.sroa.0.0.copyload.i, 24 %dst_cap.i.i = udiv i64 %_16.i.i, 24 store i64 %dst_cap.i.i, ptr %_0, align 8 %1 = getelementptr inbounds i8, ptr %_0, i64 8 store ptr %me.sroa.4.0.copyload.i, ptr %1, align 8 %2 = getelementptr inbounds i8, ptr %_0, i64 16 store i64 %0, ptr %2, align 8 ret void } ``` </details> <details> <summary>After:</summary> ```llvm define void `@unwrap_copy(ptr` noalias nocapture noundef writeonly sret([24 x i8]) align 8 dereferenceable(24) %_0, ptr noalias nocapture noundef readonly align 8 dereferenceable(24) %iter) { start: %me.sroa.0.0.copyload.i = load i64, ptr %iter, align 8 %me.sroa.4.0.self.sroa_idx.i = getelementptr inbounds i8, ptr %iter, i64 8 %me.sroa.4.0.copyload.i = load ptr, ptr %me.sroa.4.0.self.sroa_idx.i, align 8 %me.sroa.5.0.self.sroa_idx.i = getelementptr inbounds i8, ptr %iter, i64 16 %me.sroa.5.0.copyload.i = load i64, ptr %me.sroa.5.0.self.sroa_idx.i, align 8 %_19.i.idx = mul nsw i64 %me.sroa.5.0.copyload.i, 24 %0 = udiv i64 %_19.i.idx, 24 store i64 %me.sroa.0.0.copyload.i, ptr %_0, align 8 %1 = getelementptr inbounds i8, ptr %_0, i64 8 store ptr %me.sroa.4.0.copyload.i, ptr %1, align 8 %2 = getelementptr inbounds i8, ptr %_0, i64 16 store i64 %0, ptr %2, align 8, !alias.scope !9, !noalias !14 ret void } ``` </details> Note that there is still one more `mul,udiv` pair that I couldn't get rid of. The root cause is the same issue as rust-lang#121239, the `nuw` gets stripped off of `ptr::sub_ptr`. 2. `Iterator::try_fold` gets called on the underlying Iterator in `SpecInPlaceCollect::collect_in_place` whenever it does not implement `TrustedRandomAccess`. For types that impl `Drop`, LLVM currently can't tell that the drop can never occur, when using the default `Iterator::try_fold` implementation. For example, given the following code from rust-lang#120493 ```rust #[repr(transparent)] struct WrappedClone { inner: String } #[no_mangle] pub fn unwrap_clone(list: Vec<WrappedClone>) -> Vec<String> { list.into_iter().map(|s| s.inner).collect() } ``` <details> <summary>The asm for the `unwrap_clone` method is currently:</summary> ```asm unwrap_clone: push rbp push r15 push r14 push r13 push r12 push rbx push rax mov rbx, rdi mov r12, qword ptr [rsi] mov rdi, qword ptr [rsi + 8] mov rax, qword ptr [rsi + 16] movabs rsi, -6148914691236517205 mov r14, r12 test rax, rax je .LBB0_10 lea rcx, [rax + 2*rax] lea r14, [r12 + 8*rcx] shl rax, 3 lea rax, [rax + 2*rax] xor ecx, ecx .LBB0_2: cmp qword ptr [r12 + rcx], 0 je .LBB0_4 add rcx, 24 cmp rax, rcx jne .LBB0_2 jmp .LBB0_10 .LBB0_4: lea rdx, [rax - 24] lea r14, [r12 + rcx] cmp rdx, rcx je .LBB0_10 mov qword ptr [rsp], rdi sub rax, rcx add rax, -24 mul rsi mov r15, rdx lea rbp, [r12 + rcx] add rbp, 32 shr r15, 4 mov r13, qword ptr [rip + __rust_dealloc@GOTPCREL] jmp .LBB0_6 .LBB0_8: add rbp, 24 dec r15 je .LBB0_9 .LBB0_6: mov rsi, qword ptr [rbp] test rsi, rsi je .LBB0_8 mov rdi, qword ptr [rbp - 8] mov edx, 1 call r13 jmp .LBB0_8 .LBB0_9: mov rdi, qword ptr [rsp] movabs rsi, -6148914691236517205 .LBB0_10: sub r14, r12 mov rax, r14 mul rsi shr rdx, 4 mov qword ptr [rbx], r12 mov qword ptr [rbx + 8], rdi mov qword ptr [rbx + 16], rdx mov rax, rbx add rsp, 8 pop rbx pop r12 pop r13 pop r14 pop r15 pop rbp ret ``` </details> <details> <summary>After this PR:</summary> ```asm unwrap_clone: mov rax, rdi movups xmm0, xmmword ptr [rsi] mov rcx, qword ptr [rsi + 16] movups xmmword ptr [rdi], xmm0 mov qword ptr [rdi + 16], rcx ret ``` </details> Fixes rust-lang#120493
☀️ Try build successful - checks-actions |
This comment has been minimized.
This comment has been minimized.
Finished benchmarking commit (1256640): comparison URL. Overall result: ❌✅ regressions and improvements - ACTION NEEDEDBenchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf. Next Steps: If you can justify the regressions found in this try perf run, please indicate this with @bors rollup=never Instruction countThis is a highly reliable metric that was used to determine the overall result at the top of this comment.
Max RSS (memory usage)ResultsThis is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
CyclesResultsThis is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
Binary sizeResultsThis is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
Bootstrap: 677.972s -> 675.313s (-0.39%) |
The perf results are generally fine. The big improvements are from an incremental change that happens to contain a lot of Vec code, so that's not too surprising. |
Yeah, I'll take a look at the |
LLVM does not know that the multiplication never overflows, which causes it to generate unnecessary instructions. Use `usize::unchecked_mul`, so that it can fold the `dst_cap` calculation when `size_of::<I::SRC>() == size_of::<T>()`. Running: ``` rustc -C llvm-args=-x86-asm-syntax=intel -O src/lib.rs --emit asm` ``` ```rust pub struct Foo([usize; 3]); pub fn unwrap_copy(v: Vec<Foo>) -> Vec<[usize; 3]> { v.into_iter().map(|f| f.0).collect() } ``` Before this commit: ``` define void @unwrap_copy(ptr noalias nocapture noundef writeonly sret([24 x i8]) align 8 dereferenceable(24) %_0, ptr noalias nocapture noundef readonly align 8 dereferenceable(24) %iter) { start: %me.sroa.0.0.copyload.i = load i64, ptr %iter, align 8 %me.sroa.4.0.self.sroa_idx.i = getelementptr inbounds i8, ptr %iter, i64 8 %me.sroa.4.0.copyload.i = load ptr, ptr %me.sroa.4.0.self.sroa_idx.i, align 8 %me.sroa.5.0.self.sroa_idx.i = getelementptr inbounds i8, ptr %iter, i64 16 %me.sroa.5.0.copyload.i = load i64, ptr %me.sroa.5.0.self.sroa_idx.i, align 8 %_19.i.idx = mul nsw i64 %me.sroa.5.0.copyload.i, 24 %0 = udiv i64 %_19.i.idx, 24 %_16.i.i = mul i64 %me.sroa.0.0.copyload.i, 24 %dst_cap.i.i = udiv i64 %_16.i.i, 24 store i64 %dst_cap.i.i, ptr %_0, align 8 %1 = getelementptr inbounds i8, ptr %_0, i64 8 store ptr %me.sroa.4.0.copyload.i, ptr %1, align 8 %2 = getelementptr inbounds i8, ptr %_0, i64 16 store i64 %0, ptr %2, align 8 ret void } ``` After: ``` define void @unwrap_copy(ptr noalias nocapture noundef writeonly sret([24 x i8]) align 8 dereferenceable(24) %_0, ptr noalias nocapture noundef readonly align 8 dereferenceable(24) %iter) { start: %me.sroa.0.0.copyload.i = load i64, ptr %iter, align 8 %me.sroa.4.0.self.sroa_idx.i = getelementptr inbounds i8, ptr %iter, i64 8 %me.sroa.4.0.copyload.i = load ptr, ptr %me.sroa.4.0.self.sroa_idx.i, align 8 %me.sroa.5.0.self.sroa_idx.i = getelementptr inbounds i8, ptr %iter, i64 16 %me.sroa.5.0.copyload.i = load i64, ptr %me.sroa.5.0.self.sroa_idx.i, align 8 %_19.i.idx = mul nsw i64 %me.sroa.5.0.copyload.i, 24 %0 = udiv i64 %_19.i.idx, 24 store i64 %me.sroa.0.0.copyload.i, ptr %_0, align 8 %1 = getelementptr inbounds i8, ptr %_0, i64 8 store ptr %me.sroa.4.0.copyload.i, ptr %1, align 8 %2 = getelementptr inbounds i8, ptr %_0, i64 16 store i64 %0, ptr %2, align 8, !alias.scope !9, !noalias !14 ret void } ``` Note that there is still one more `mul,udiv` pair that I couldn't get rid of. The root cause is the same issue as rust-lang#121239, the `nuw` gets stripped off of `ptr::sub_ptr`.
`Iterator::try_fold` gets called on the underlying Iterator in `SpecInPlaceCollect::collect_in_place` whenever it does not implement `TrustedRandomAccess`. For types that impl `Drop`, LLVM currently can't tell that the drop can never occur, when using the default `Iterator::try_fold` implementation. For example, the asm from the `unwrap_clone` method is currently: ``` unwrap_clone: push rbp push r15 push r14 push r13 push r12 push rbx push rax mov rbx, rdi mov r12, qword ptr [rsi] mov rdi, qword ptr [rsi + 8] mov rax, qword ptr [rsi + 16] movabs rsi, -6148914691236517205 mov r14, r12 test rax, rax je .LBB0_10 lea rcx, [rax + 2*rax] lea r14, [r12 + 8*rcx] shl rax, 3 lea rax, [rax + 2*rax] xor ecx, ecx .LBB0_2: cmp qword ptr [r12 + rcx], 0 je .LBB0_4 add rcx, 24 cmp rax, rcx jne .LBB0_2 jmp .LBB0_10 .LBB0_4: lea rdx, [rax - 24] lea r14, [r12 + rcx] cmp rdx, rcx je .LBB0_10 mov qword ptr [rsp], rdi sub rax, rcx add rax, -24 mul rsi mov r15, rdx lea rbp, [r12 + rcx] add rbp, 32 shr r15, 4 mov r13, qword ptr [rip + __rust_dealloc@GOTPCREL] jmp .LBB0_6 .LBB0_8: add rbp, 24 dec r15 je .LBB0_9 .LBB0_6: mov rsi, qword ptr [rbp] test rsi, rsi je .LBB0_8 mov rdi, qword ptr [rbp - 8] mov edx, 1 call r13 jmp .LBB0_8 .LBB0_9: mov rdi, qword ptr [rsp] movabs rsi, -6148914691236517205 .LBB0_10: sub r14, r12 mov rax, r14 mul rsi shr rdx, 4 mov qword ptr [rbx], r12 mov qword ptr [rbx + 8], rdi mov qword ptr [rbx + 16], rdx mov rax, rbx add rsp, 8 pop rbx pop r12 pop r13 pop r14 pop r15 pop rbp ret ``` After this PR: ``` unwrap_clone: mov rax, rdi movups xmm0, xmmword ptr [rsi] mov rcx, qword ptr [rsi + 16] movups xmmword ptr [rdi], xmm0 mov qword ptr [rdi + 16], rcx ret ``` Fixes rust-lang#120493
LLVM currently adds a redundant check for the returned option, in addition to the `self.ptr != self.end` check when using the default `Iterator::fold` method that calls `vec::IntoIter::next` in a loop.
Sorry about the long wait, I got caught up with some other things and hadn't worked on this for awhile. However, I rebased against eb1a5c9 and the binary size for That said, could I get another perf run? The regressions should be gone now, if my testing was correct. |
@bors try @rust-timer queue |
This comment has been minimized.
This comment has been minimized.
optimize inplace collection of Vec This PR has the following changes: 1. Using `usize::unchecked_mul` in https://github.com/rust-lang/rust/blob/79424056b05eaa9563d16dfab9b9a0c8f033f220/library/alloc/src/vec/in_place_collect.rs#L262 as LLVM, does not know that the operation can't wrap, since that's the size of the original allocation. Given the following: ```rust pub struct Foo([usize; 3]); pub fn unwrap_copy(v: Vec<Foo>) -> Vec<[usize; 3]> { v.into_iter().map(|f| f.0).collect() } ``` <details> <summary>Before this commit:</summary> ```llvm define void `@unwrap_copy(ptr` noalias nocapture noundef writeonly sret([24 x i8]) align 8 dereferenceable(24) %_0, ptr noalias nocapture noundef readonly align 8 dereferenceable(24) %iter) { start: %me.sroa.0.0.copyload.i = load i64, ptr %iter, align 8 %me.sroa.4.0.self.sroa_idx.i = getelementptr inbounds i8, ptr %iter, i64 8 %me.sroa.4.0.copyload.i = load ptr, ptr %me.sroa.4.0.self.sroa_idx.i, align 8 %me.sroa.5.0.self.sroa_idx.i = getelementptr inbounds i8, ptr %iter, i64 16 %me.sroa.5.0.copyload.i = load i64, ptr %me.sroa.5.0.self.sroa_idx.i, align 8 %_19.i.idx = mul nsw i64 %me.sroa.5.0.copyload.i, 24 %0 = udiv i64 %_19.i.idx, 24 ; Unnecessary calculation %_16.i.i = mul i64 %me.sroa.0.0.copyload.i, 24 %dst_cap.i.i = udiv i64 %_16.i.i, 24 store i64 %dst_cap.i.i, ptr %_0, align 8 %1 = getelementptr inbounds i8, ptr %_0, i64 8 store ptr %me.sroa.4.0.copyload.i, ptr %1, align 8 %2 = getelementptr inbounds i8, ptr %_0, i64 16 store i64 %0, ptr %2, align 8 ret void } ``` </details> <details> <summary>After:</summary> ```llvm define void `@unwrap_copy(ptr` noalias nocapture noundef writeonly sret([24 x i8]) align 8 dereferenceable(24) %_0, ptr noalias nocapture noundef readonly align 8 dereferenceable(24) %iter) { start: %me.sroa.0.0.copyload.i = load i64, ptr %iter, align 8 %me.sroa.4.0.self.sroa_idx.i = getelementptr inbounds i8, ptr %iter, i64 8 %me.sroa.4.0.copyload.i = load ptr, ptr %me.sroa.4.0.self.sroa_idx.i, align 8 %me.sroa.5.0.self.sroa_idx.i = getelementptr inbounds i8, ptr %iter, i64 16 %me.sroa.5.0.copyload.i = load i64, ptr %me.sroa.5.0.self.sroa_idx.i, align 8 %_19.i.idx = mul nsw i64 %me.sroa.5.0.copyload.i, 24 %0 = udiv i64 %_19.i.idx, 24 store i64 %me.sroa.0.0.copyload.i, ptr %_0, align 8 %1 = getelementptr inbounds i8, ptr %_0, i64 8 store ptr %me.sroa.4.0.copyload.i, ptr %1, align 8 %2 = getelementptr inbounds i8, ptr %_0, i64 16 store i64 %0, ptr %2, align 8, !alias.scope !9, !noalias !14 ret void } ``` </details> Note that there is still one more `mul,udiv` pair that I couldn't get rid of. The root cause is the same issue as rust-lang#121239, the `nuw` gets stripped off of `ptr::sub_ptr`. 2. `Iterator::try_fold` gets called on the underlying Iterator in `SpecInPlaceCollect::collect_in_place` whenever it does not implement `TrustedRandomAccess`. For types that impl `Drop`, LLVM currently can't tell that the drop can never occur, when using the default `Iterator::try_fold` implementation. For example, given the following code from rust-lang#120493 ```rust #[repr(transparent)] struct WrappedClone { inner: String } #[no_mangle] pub fn unwrap_clone(list: Vec<WrappedClone>) -> Vec<String> { list.into_iter().map(|s| s.inner).collect() } ``` <details> <summary>The asm for the `unwrap_clone` method is currently:</summary> ```asm unwrap_clone: push rbp push r15 push r14 push r13 push r12 push rbx push rax mov rbx, rdi mov r12, qword ptr [rsi] mov rdi, qword ptr [rsi + 8] mov rax, qword ptr [rsi + 16] movabs rsi, -6148914691236517205 mov r14, r12 test rax, rax je .LBB0_10 lea rcx, [rax + 2*rax] lea r14, [r12 + 8*rcx] shl rax, 3 lea rax, [rax + 2*rax] xor ecx, ecx .LBB0_2: cmp qword ptr [r12 + rcx], 0 je .LBB0_4 add rcx, 24 cmp rax, rcx jne .LBB0_2 jmp .LBB0_10 .LBB0_4: lea rdx, [rax - 24] lea r14, [r12 + rcx] cmp rdx, rcx je .LBB0_10 mov qword ptr [rsp], rdi sub rax, rcx add rax, -24 mul rsi mov r15, rdx lea rbp, [r12 + rcx] add rbp, 32 shr r15, 4 mov r13, qword ptr [rip + __rust_dealloc@GOTPCREL] jmp .LBB0_6 .LBB0_8: add rbp, 24 dec r15 je .LBB0_9 .LBB0_6: mov rsi, qword ptr [rbp] test rsi, rsi je .LBB0_8 mov rdi, qword ptr [rbp - 8] mov edx, 1 call r13 jmp .LBB0_8 .LBB0_9: mov rdi, qword ptr [rsp] movabs rsi, -6148914691236517205 .LBB0_10: sub r14, r12 mov rax, r14 mul rsi shr rdx, 4 mov qword ptr [rbx], r12 mov qword ptr [rbx + 8], rdi mov qword ptr [rbx + 16], rdx mov rax, rbx add rsp, 8 pop rbx pop r12 pop r13 pop r14 pop r15 pop rbp ret ``` </details> <details> <summary>After this PR:</summary> ```asm unwrap_clone: mov rax, rdi movups xmm0, xmmword ptr [rsi] mov rcx, qword ptr [rsi + 16] movups xmmword ptr [rdi], xmm0 mov qword ptr [rdi + 16], rcx ret ``` </details> Fixes rust-lang#120493
☀️ Try build successful - checks-actions |
This comment has been minimized.
This comment has been minimized.
Finished benchmarking commit (af838b4): comparison URL. Overall result: ✅ improvements - no action neededBenchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf. @bors rollup=never Instruction countThis is a highly reliable metric that was used to determine the overall result at the top of this comment.
Max RSS (memory usage)Results (primary -4.8%)This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
CyclesResults (primary -28.4%)This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
Binary sizeResults (primary 0.0%, secondary -0.0%)This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
Bootstrap: 671.223s -> 671.46s (0.04%) |
@bors r+ rollup=never |
☀️ Test successful - checks-actions |
Finished benchmarking commit (12075f0): comparison URL. Overall result: ❌✅ regressions and improvements - ACTION NEEDEDNext Steps: If you can justify the regressions found in this perf run, please indicate this with @rustbot label: +perf-regression Instruction countThis is a highly reliable metric that was used to determine the overall result at the top of this comment.
Max RSS (memory usage)Results (primary -4.2%, secondary -2.6%)This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
CyclesResults (primary -21.9%, secondary 3.7%)This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
Binary sizeResults (primary 0.1%, secondary -0.0%)This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
Bootstrap: 669.127s -> 670.375s (0.19%) |
…e8472 add some codegen tests for issue 120493 I forgot to add these in rust-lang#123878.
Rollup merge of rust-lang#125305 - jwong101:120493-codegen-test, r=the8472 add some codegen tests for issue 120493 I forgot to add these in rust-lang#123878.
Improvements vastly outweigh the regressions which seemed to have returned to baseline after this PR. I think it's safe to move on from this. @rustbot label: +perf-regression-triaged |
add some codegen tests for issue 120493 I forgot to add these in rust-lang/rust#123878.
This PR has the following changes:
usize::unchecked_mul
inrust/library/alloc/src/vec/in_place_collect.rs
Line 262 in 7942405
Given the following:
Before this commit:
After:
Note that there is still one more
mul,udiv
pair that I couldn't getrid of. The root cause is the same issue as #121239, the
nuw
getsstripped off of
ptr::sub_ptr
.Iterator::try_fold
gets called on the underlying Iterator inSpecInPlaceCollect::collect_in_place
whenever it does not implementTrustedRandomAccess
. For types that implDrop
, LLVM currently can'ttell that the drop can never occur, when using the default
Iterator::try_fold
implementation.For example, given the following code from #120493
The asm for the `unwrap_clone` method is currently:
After this PR:
Fixes #120493