You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Voultapher opened this issue
Oct 24, 2023
· 0 comments
Labels
A-LLVMArea: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues.C-bugCategory: This is a bug.I-slowIssue: Problems and improvements with respect to performance of generated code.
A common idiom in branchless (jumpless) code is to either increment a pointer by 0 or 1 which can be derived from a bool. The obvious way to do this is ptr = ptr.add(bool_val as usize) however doing so results in slightly sub-optimal code-gen on both x86_64 and Arm.
A minimal example:
pubunsafefninc_direct(v:&[u64],mutside_effect_out:*mut*constu64){for elem in v {let cond = *elem < 30;*side_effect_out = elem;
side_effect_out = side_effect_out.add(cond asusize);}}
Notably the use of adc on x86_64 and cinc on Arm. Using uica and LLVM to simulate the performance of the respective loops we get:
inc_direct cycles per loop iteration:
Tool
Skylake
Sunny Cove
uiCA
2.01
1.87
llvm-mca
1.59
1.59
inc_via_counter cycles per loop iteration:
Tool
Skylake
Sunny Cove
uiCA
1.31
1.56
llvm-mca
1.33
1.33
Simulations as part of some larger code structure are less conclusive, but one thing is certain, for both architectures it's less instructions and worst case same perf. Ideally rustc could generate the more efficient code for the obvious direct increment version.
nikic
added
A-LLVM
Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues.
I-slow
Issue: Problems and improvements with respect to performance of generated code.
and removed
needs-triage
This issue may need triage. Remove it if it has been sufficiently triaged.
labels
Oct 24, 2023
A-LLVMArea: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues.C-bugCategory: This is a bug.I-slowIssue: Problems and improvements with respect to performance of generated code.
A common idiom in branchless (jumpless) code is to either increment a pointer by 0 or 1 which can be derived from a bool. The obvious way to do this is
ptr = ptr.add(bool_val as usize)
however doing so results in slightly sub-optimal code-gen on both x86_64 and Arm.A minimal example:
Generated machine-code for the loop:
x86_64
Arm
Notably, on x86_64 the use of xor + setb + lea. And cset on Arm.
Compared to a version using an additional counter with
counter += cond as usize;
Generated machine-code for the loop:
x86_64
Arm
Notably the use of adc on x86_64 and cinc on Arm. Using uica and LLVM to simulate the performance of the respective loops we get:
inc_direct cycles per loop iteration:
inc_via_counter cycles per loop iteration:
Simulations as part of some larger code structure are less conclusive, but one thing is certain, for both architectures it's less instructions and worst case same perf. Ideally rustc could generate the more efficient code for the obvious direct increment version.
Meta
rustc --version --verbose
:The text was updated successfully, but these errors were encountered: