-
Notifications
You must be signed in to change notification settings - Fork 12.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Simplify logical operations CFG #83663
Simplify logical operations CFG #83663
Conversation
Thanks for the pull request, and welcome! The Rust team is excited to review your changes, and you should hear from @estebank (or someone else) soon. If any changes to this PR are deemed necessary, please add them as extra commits. This ensures that the reviewer can see what has changed since they last reviewed the code. Due to the way GitHub handles out-of-date commits, this should also make it reasonably obvious what issues have or haven't been addressed. Large or tricky changes may require several passes of review and changes. Please see the contribution instructions for more information. |
I would like to benchmark this changes. This code like to fix the issue with failed SIMD (#83623) Code: pub struct Blueprint {
pub fuel_tank_size: u32,
pub payload: u32,
pub wheel_diameter: u32,
pub wheel_width: u32,
pub storage: u32,
}
impl PartialEq for Blueprint{
fn eq(&self, other: &Self)->bool{
(self.fuel_tank_size == other.fuel_tank_size)
&& (self.payload == other.payload)
&& (self.wheel_diameter == other.wheel_diameter)
&& (self.wheel_width == other.wheel_width)
&& (self.storage == other.storage)
}
} LLVM IR: ; ModuleID = 'eq_test.3a1fbbbh-cgu.0'
source_filename = "eq_test.3a1fbbbh-cgu.0"
target datalayout = "e-m:w-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-pc-windows-msvc"
%Blueprint = type { [0 x i32], i32, [0 x i32], i32, [0 x i32], i32, [0 x i32], i32, [0 x i32], i32, [0 x i32] }
; <eq_test::Blueprint as core::cmp::PartialEq>::eq
; Function Attrs: norecurse nounwind readonly uwtable willreturn
define zeroext i1 @"_ZN59_$LT$eq_test..Blueprint$u20$as$u20$core..cmp..PartialEq$GT$2eq17he4e85086691c6c20E"(%Blueprint* noalias nocapture readonly align 4 dereferenceable(20) %self, %Blueprint* noalias nocapture readonly align 4 dereferenceable(20) %other) unnamed_addr #0 {
start:
%0 = bitcast %Blueprint* %self to <4 x i32>*
%1 = load <4 x i32>, <4 x i32>* %0, align 4
%2 = bitcast %Blueprint* %other to <4 x i32>*
%3 = load <4 x i32>, <4 x i32>* %2, align 4
%4 = icmp eq <4 x i32> %1, %3
%5 = getelementptr inbounds %Blueprint, %Blueprint* %self, i64 0, i32 9
%_19 = load i32, i32* %5, align 4
%6 = getelementptr inbounds %Blueprint, %Blueprint* %other, i64 0, i32 9
%_20 = load i32, i32* %6, align 4
%_18 = icmp eq i32 %_19, %_20
%7 = call i1 @llvm.vector.reduce.and.v4i1(<4 x i1> %4)
%8 = and i1 %7, %_18
ret i1 %8
}
; Function Attrs: nofree nosync nounwind readnone willreturn
declare i1 @llvm.vector.reduce.and.v4i1(<4 x i1>) #1
attributes #0 = { norecurse nounwind readonly uwtable willreturn "target-cpu"="znver1" }
attributes #1 = { nofree nosync nounwind readnone willreturn }
!llvm.module.flags = !{!0}
!0 = !{i32 7, !"PIC Level", i32 2} And ASM: _ZN59_$LT$eq_test..Blueprint$u20$as$u20$core..cmp..PartialEq$GT$2eq17he4e85086691c6c20E:
vmovdqu (%rcx), %xmm0
movl 16(%rcx), %eax
vpcmpeqd (%rdx), %xmm0, %xmm0
cmpl 16(%rdx), %eax
vmovmskps %xmm0, %eax
sete %cl
cmpb $15, %al
sete %al
andb %cl, %al
retq |
@bors try @rust-timer queue |
Awaiting bors try build completion. @rustbot label: +S-waiting-on-perf |
⌛ Trying commit c343862c31c77abdb223df0c6de9a09f17c1557c with merge 6b7dccdb4b4fbb645bdfc176bf01a0f0b9941bb5... |
Before this lands an addition of a codegen test will be necessary so that we don't regress this case again. |
☀️ Try build successful - checks-actions |
Queued 6b7dccdb4b4fbb645bdfc176bf01a0f0b9941bb5 with parent 2917eda, future comparison URL. |
Finished benchmarking try commit (6b7dccdb4b4fbb645bdfc176bf01a0f0b9941bb5): comparison url. Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. Please note that if the perf results are neutral, you should likely undo the rollup=never given below by specifying Importantly, though, if the results of this run are non-neutral do not roll this PR up -- it will mask other regressions or improvements in the roll up. @bors rollup=never |
Benchmark result shows slight improvement in compilation time for MIR and LLVM passes. |
I don't feel comfortable reviewing this. @nagisa would you mind taking this? |
r? @nagisa |
Should we merge this and should I continue to work on this then? |
Hacky code from Code from this PR handles vectorizing of slices of such struct fine: code#[derive(PartialEq)]
pub struct Blueprint {
pub fuel_tank_size: u32,
pub payload: u32,
pub wheel_diameter: u32,
pub wheel_width: u32,
pub storage: u32,
}
#[no_mangle]
pub fn compare_two_arrays(a: &[Blueprint], b: &[Blueprint])->bool{
a==b
} compare_two_arrays:
cmpq %r9, %rdx
jne .LBB0_1
incq %rdx
movl $16, %r9d
.p2align 4, 0x90
.LBB0_3:
decq %rdx
sete %al
je .LBB0_6
vmovdqu -16(%rcx,%r9), %xmm0
vpcmpeqd -16(%r8,%r9), %xmm0, %xmm0
vmovmskps %xmm0, %r10d
cmpb $15, %r10b
jne .LBB0_6
movl (%r8,%r9), %r10d
leaq 20(%r9), %r11
cmpl %r10d, (%rcx,%r9)
movq %r11, %r9
je .LBB0_3
.LBB0_6:
retq
.LBB0_1:
xorl %eax, %eax
retq |
I think it is worthwhile to pursue this based on the assembly & IR that gets produced once this is applied. I would be surprised if the compiler exercised the kinds of code paths we're seeing here to produce good motivation for or against merging this. Microbenchmarks would be a significantly better tool in this situation, I suspect. I'm somewhat surprised that #62993 was sufficient motivation to revert this in the first place, but from what I can tell it might have happened because there was no information suggesting that the code derived by |
I wrote a benchmark for more cases. So, results: Legend:
HEAD~1
With PR commit
So, it looks like that code generated by trunk version is faster when it's execution finishes on first branch and branch predictor correctly predicts this. When branch predictor fails to do this, it is slower 7 times. I think, it is a clear win. For anyone who want to look benchmarks code or run it on own machine, I add zipped project. @nagisa I would add codegen tests now and I wonder if I need to squash my commits myself or it would be done automatically. |
bors merges commits as is so you do need to cleanup the history in your PR
manually.
…On Wed, 31 Mar 2021, 00:39 AngelicosPhosphoros, ***@***.***> wrote:
I wrote a benchmark for more cases.
I run it on toolchains from HEAD~1 and HEAD for my working branch, and
used target-cpu=znver1.
So, results:
Legend:
u32s struct:
5 us32 fields struct
u32s struct/Self - comparison with same cloned vec
u32s struct/Random field - comparison with elements differ only in one random field
u32s struct/First field - comparison with elements differ only in first field
u32s struct/Last field - comparison with elements differ only in last field
u32s struct/Every Second - comparison with vec which every even element is equal and every odd is inequal
String struct:
Struct with 4 u32 fields and one string field. String fields generated as random char from A to Z.
cmp/u32s struct/Self - with same cloned Vec
cmp/String struct/u32 field - with random u32 field
cmp/String struct/String field - with String field.
HEAD~1
Running unittests (target\release\deps\bench_u32s-967e14898f5691eb.exe)
Gnuplot not found, using plotters backend
cmp/u32s struct/Self time: [6.7752 us 6.7880 us 6.8033 us]
Found 3 outliers among 100 measurements (3.00%)
1 (1.00%) low mild
2 (2.00%) high mild
cmp/u32s struct/Random field
time: [15.176 us 15.204 us 15.234 us]
Found 10 outliers among 100 measurements (10.00%)
9 (9.00%) high mild
1 (1.00%) high severe
cmp/u32s struct/First field
time: [2.6016 us 2.6084 us 2.6149 us]
Found 3 outliers among 100 measurements (3.00%)
2 (2.00%) high mild
1 (1.00%) high severe
cmp/u32s struct/Last field
time: [6.7269 us 6.7357 us 6.7459 us]
Found 7 outliers among 100 measurements (7.00%)
4 (4.00%) high mild
3 (3.00%) high severe
cmp/u32s struct/Every Second
time: [12.925 us 12.940 us 12.957 us]
Found 18 outliers among 100 measurements (18.00%)
18 (18.00%) high severe
Running unittests (target\release\deps\bench_with_strings-77b5d3b11469681f.exe)
Gnuplot not found, using plotters backend
cmp/String struct/Self time: [65.674 us 65.750 us 65.823 us]
Found 1 outliers among 100 measurements (1.00%)
1 (1.00%) high mild
cmp/String struct/u32 field
time: [17.712 us 17.733 us 17.756 us]
Found 5 outliers among 100 measurements (5.00%)
5 (5.00%) high mild
cmp/String struct/String field
time: [10.087 us 10.102 us 10.116 us]
Found 8 outliers among 100 measurements (8.00%)
3 (3.00%) low mild
5 (5.00%) high mild
With PR commit
Running unittests (target\release\deps\bench_u32s-967e14898f5691eb.exe)
Gnuplot not found, using plotters backend
cmp/u32s struct/Self time: [7.1867 us 7.2374 us 7.2920 us]
change: [+6.1698% +7.0979% +7.9865%] (p = 0.00 < 0.05)
Performance has regressed.
Found 5 outliers among 100 measurements (5.00%)
5 (5.00%) high mild
cmp/u32s struct/Random field
time: [6.7970 us 6.8126 us 6.8319 us]
change: [-54.664% -53.917% -53.142%] (p = 0.00 < 0.05)
Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
5 (5.00%) high mild
1 (1.00%) high severe
cmp/u32s struct/First field
time: [6.7780 us 6.7868 us 6.7966 us]
change: [+158.65% +159.32% +160.07%] (p = 0.00 < 0.05)
Performance has regressed.
Found 2 outliers among 100 measurements (2.00%)
1 (1.00%) low mild
1 (1.00%) high severe
cmp/u32s struct/Last field
time: [6.7810 us 6.7919 us 6.8034 us]
change: [+0.3407% +0.6019% +0.9169%] (p = 0.00 < 0.05)
Change within noise threshold.
Found 4 outliers among 100 measurements (4.00%)
2 (2.00%) low mild
1 (1.00%) high mild
1 (1.00%) high severe
cmp/u32s struct/Every Second
time: [6.7663 us 6.7758 us 6.7861 us]
change: [-48.465% -48.193% -47.936%] (p = 0.00 < 0.05)
Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
1 (1.00%) low mild
2 (2.00%) high mild
Running unittests (target\release\deps\bench_with_strings-77b5d3b11469681f.exe)
Gnuplot not found, using plotters backend
cmp/String struct/Self time: [59.658 us 59.743 us 59.833 us]
change: [-9.1402% -8.9131% -8.6605%] (p = 0.00 < 0.05)
Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
3 (3.00%) low severe
3 (3.00%) low mild
5 (5.00%) high mild
1 (1.00%) high severe
cmp/String struct/u32 field
time: [3.9043 us 3.9140 us 3.9253 us]
change: [-77.864% -77.797% -77.726%] (p = 0.00 < 0.05)
Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
6 (6.00%) high mild
1 (1.00%) high severe
cmp/String struct/String field
time: [6.9662 us 6.9746 us 6.9833 us]
change: [-31.134% -30.958% -30.788%] (p = 0.00 < 0.05)
Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
4 (4.00%) high mild
2 (2.00%) high severe
So, it looks like that code generated by trunk version is faster when it's
execution finishes on first branch and branch predictor correctly predicts
this. When branch predictor fails to do this, it is slower 7 times.
SIMD version shows much less variance and it is significantly slower in *First
field* case and faster in worst *Random field*.
*First field* case slowed down by 4.1784 microseconds and *Random field*
sped up by 8.3914
I think, it is a clear win.
For anyone who want to look benchmarks code or run it on own machine, I
add zipped project.
bench_changes.zip
<https://github.com/rust-lang/rust/files/6232364/bench_changes.zip>
@nagisa <https://github.com/nagisa> I would add codegen tests now and I
wonder if I need to squash my commits myself or it would be done
automatically.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#83663 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAFFZUR5DSZ3U5K5PU2ZAN3TGJAJTANCNFSM42AOW7HA>
.
|
@@ -0,0 +1,45 @@ | |||
// This test checks that jumps generated by logical operators can be optimized away |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know how to write codegen tests so I made this by example.
I can miss something.
This comment has been minimized.
This comment has been minimized.
@@ -0,0 +1,45 @@ | |||
// This test checks that jumps generated by logical operators can be optimized away | |||
|
|||
// compile-flags: -O3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-Copt-level=3
or perhaps just -O
should work.
// This test checks that jumps generated by logical operators can be optimized away | ||
|
||
// compile-flags: -O3 | ||
// only-64bit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test might fail for some of the 64-bit targets, and may need to be limited to just x86_64
, but lets see if this works out, first.
@bors r+ rollup=never |
You may need to set Sorry. I wish that was a default. |
I recommend Ubuntu for generating the new test results. Not Windows. And if it keeps ignoring the tests, you may need to remove the test results in your build directory and run again. |
Just thinking, you may need to build LLVM, and it has to build with 'optimize = false" Some constraints I only recently realized. Thanks. |
Can you help me with building LLVM please? Missed headers
|
I don't normally have to do anything special, but best practice (for me) in this situation would be:
[llvm]
...
optimize = false # required for `llvm-cov show --debug`
...
[build]
...
profiler = true I also generally enable Then: $ ./x.py test src/test/run-make-fulldeps/coverage --bless should work. |
I finally managed to make it compile.
JYI to make it work I changed this:
So, I failed to build LLVM because it used distribution provided C/C++ compiler (gcc) at first. Also, it was very incomfortable to build LLVM on my machine because I needed to limit parallelism since my build constantly killed by OOM on 55 gb of used memory (especially, during linking). Also, rustc compiles very long time (4 hours) if it uses LLVM without optimizations. Are there any thoughts to move generation of such changes to some build agents? I think, we cut some possible contributors from work because they just haven't such good machines (I have 64 gb RAM on my PC but I don't know anyone other who has). |
Right. Linking will take a ton of memory because debug info is just that huge. There's definitely some improvements to be had around the coverage tests – having to build a unoptimized LLVM any time you need to bless a test suite potentially prevents contributors from making changes at all (what if they only have a machine with, say, 8GB of memory?) |
@bors r+ |
📌 Commit 87264fa002169ac242f0c74f69a66032274cd88d has been approved by |
This is basically same commit as e38e954 which was reverted later in 676953f In both cases, this changes weren't benchmarked. e38e954 leads to missed optimization described in [this issue](#62993) 676953f leads to missed optimization described in [this issue](#83623) Also it changes some src/test/run-make-fulldeps/coverage-spanview/expected_mir_dump* files automatically.
@nagisa
|
I agree. I only just made the connection in my head. I'm going to push a PR later this week that removes spanview files and removes the debug output. The spanview files are no longer as valuable as they used to be (in tests) so that doesn't hurt much. It's frustrating that I can't call But that's not your problem. I think it's worth removing. cc: @wesleywiser @tmandry |
I also made a thread in forum for this: https://internals.rust-lang.org/t/blessing-tests-for-rustc-too-hard-for-average-pc/14396 |
@bors r+
In my experience |
📌 Commit 4464cc2 has been approved by |
I assume #83663 will land before #83755, in which case, I'll need to rebase my changes. But if for some reason #83755 lands first, you can just wipe out the spanview directory: src/test/run-make-fulldeps/coverage-spanview/ You shouldn't need to re-bless since (AFAICT) only spanview files were changed in this PR. But if you ever do need to rebless, you no longer need to build your own LLVM (once my PR lands). The LLVM debug options are no longer required. Thanks! |
FYI usually debug flags in LLVM are controlled by |
☀️ Test successful - checks-actions |
This is basically same commit as e38e954 which was reverted later in 676953f
In both cases, this changes weren't benchmarked.
e38e954 leads to missed optimization described in this issue
676953f leads to missed optimization described in this issue