Seemingly inefficient code generated to forward a parameter to a function #22891

rprichard · 2015-02-28T10:50:52Z

The generated code for passing arguments larger than a machine word looks inefficient.

Test case:

#[inline(never)]
pub fn bar(x: &str) { println!("{}", x) }
pub fn foo(x: &str) { bar(x); bar(x); }

On x86_64-unknown-linux-gnu, compiling with rustc test.rs -O -C no-stack-check --crate-type dylib --emit asm, I see this code for foo:

    .section    .text._ZN3foo20hb6f131ac36a30532PaaE,"ax",@progbits
    .globl  _ZN3foo20hb6f131ac36a30532PaaE
    .align  16, 0x90
    .type   _ZN3foo20hb6f131ac36a30532PaaE,@function
_ZN3foo20hb6f131ac36a30532PaaE:
    .cfi_startproc
    pushq   %rbx
.Ltmp4:
    .cfi_def_cfa_offset 16
    subq    $16, %rsp
.Ltmp5:
    .cfi_def_cfa_offset 32
.Ltmp6:
    .cfi_offset %rbx, -16
    movq    %rdi, %rbx
    movups  (%rbx), %xmm0
    movaps  %xmm0, (%rsp)
    leaq    (%rsp), %rdi
    callq   _ZN3bar20hf21270c370b3427feaaE@PLT
    movups  (%rbx), %xmm0
    movaps  %xmm0, (%rsp)
    leaq    (%rsp), %rdi
    callq   _ZN3bar20hf21270c370b3427feaaE@PLT
    addq    $16, %rsp
    popq    %rbx
    retq
.Ltmp7:
    .size   _ZN3foo20hb6f131ac36a30532PaaE, .Ltmp7-_ZN3foo20hb6f131ac36a30532PaaE
    .cfi_endproc

foo receives the address of the &str in %rdi. It copies it into a new stack location for each call, then passes the address of that location to bar.

Could foo forward the address of the &str along without making stack copies?

If I remove one of the bar calls from foo, then the function also ought to become a tail call, but it doesn't. Tail call optimization does occur if I replace the &str types with &&str.

The calling convention for passing &str (and other arguments larger than a machine word?) seems to be:

Make a copy of the argument on the stack.
Pass the address of the copy in the conventional manner (in a register or on the stack).
The callee may modify the copy.

i.e. We seem to be passing values both by-value and by-reference.

With the current convention, I think we could get smaller code by eliding some of the copies. If the copies were instead immutable, I think we could elide more copies.

Compiler version:

rustc 1.0.0-nightly (b47aebe3f 2015-02-26) (built 2015-02-27)
binary: rustc
commit-hash: b47aebe3fc2da06c760fd8ea19f84cbc41d34831
commit-date: 2015-02-26
build-date: 2015-02-27
host: x86_64-unknown-linux-gnu
release: 1.0.0-nightly

The text was updated successfully, but these errors were encountered:

dotdash · 2015-02-28T11:19:04Z

I'm currently working on a PR that makes use of LLVM's byval attribute. That would at least make the code a bit shorter because the address of the copy is implicitly passed because it's a fixed offset from the stack pointer. With that, foo is translated to:

    subq    $24, %rsp
.Ltmp4:
    .cfi_def_cfa_offset 32
    movaps  32(%rsp), %xmm0
    movups  %xmm0, (%rsp)
    callq   _ZN3bar20h6b5879ee29f32fdeeaaE@PLT
    movaps  32(%rsp), %xmm0
    movups  %xmm0, (%rsp)
    callq   _ZN3bar20h6b5879ee29f32fdeeaaE@PLT
    addq    $24, %rsp
    retq

And some other improvements also happen, I'll detail them with the PR, once I get the last few bugs fixed.

In a later step, we could definitely look into additional improvements like just dropping the byval flag for functions that are known not to modify their argument.

rprichard · 2015-02-28T11:44:12Z

Good to know. Does dropping byval mean that the value is passed by-reference?

dotdash · 2015-02-28T11:46:18Z

Since the byval attribute replaces the calls to memcpy that we currently manually emit, yes, that would then completely remove the copy and make it a pass-by-reference.

rprichard · 2015-03-01T11:36:21Z

@dotdash Does your PR also fix #22924?

dotdash · 2015-03-04T14:30:45Z

The byval stuff is blocked on #5016

comex · 2015-03-12T05:10:13Z

As mentioned in a recent Discourse thread - why does it have to go through a double indirection (pointer to fat pointer to string data) at all, whether or not that can be optimized in forwarding situations? Normally a small structure passed by value would be placed in registers.

dotdash · 2015-03-12T09:40:29Z

I'm working on a patch that does that (pass fat pointers in registers), and a first iteration shows good results, but I'm still passing the fat pointers as a single struct { i8*, i64 } instead of splitting it into two arguments which probably isn't a good idea. Unfortunately, to split a single rust argument into two llvm arguments needs some refactorings, so it might take some time to be ready.

comex · 2015-03-12T19:34:18Z

Why isn't it a good idea? I think ABIs are normally good at dealing with structs passed by value, although ARM has a hazard in the case of small structs returned by value (it's slower than returning an equivalently sized scalar).

dotdash · 2015-03-12T20:13:13Z

The rust ABI isn't good at it, yet. That's why I'm working on the patch to change it.

The thing with the first iteration of my patch is that it translates fn foo(s: &str) to declare void @foo({ i8*, i64 }) instead of declare void @foo(i8*, i64). It works to some degree, because LLVM happens to pass each struct field as if it was a distinct argument, so at the asm level you get the desired results (at least on x86 and x86_64), but the LLVM function passes get almost 40% slower, which I suspect to happen due to LLVM not handling FCA arguments that well. We might also miss some optimizations that way. We dropped FCA loads/stores a while ago because of that.

comex · 2015-03-13T00:08:52Z

LLVM 'happening to pass' (i.e. must pass using standard ABIs on most architectures) structs like multiple values is what I meant by ABIs being good at it. Quite interesting that it slows down LLVM, though.

dotdash · 2015-03-13T00:19:10Z

Oh, now I see. No, LLVM doesn't automatically handle ABI compliance. It only handles calling conventions. For example, we used to pass a struct consisting of 4 u8 values (think RGBA) directly as the struct type. LLVM did pass the 4 values as 4 individual arguments. But the x86_64 SysV ABI (and by now the rust ABI, too) wants that struct to be passed as a single u32 value. And it's the frontend's (i.e. our) job to handle that.

comex · 2015-03-13T00:33:35Z

Oh... wow, I'm really surprised by that. I'd love to know what motivated that decision, considering how close LLVM IR is semantically to C.

This has a number of advantages compared to creating a copy in memory and passing a pointer. The obvious one is that we don't have to put the data into memory but can keep it in registers. Since we're currently passing a pointer anyway (instead of using e.g. a known offset on the stack, which is what the `byval` attribute would achieve), we only use a single additional register for each fat pointer, but save at least two pointers worth of stack in exchange (sometimes more because more than one copy gets eliminated). On archs that pass arguments on the stack, we save a pointer worth of stack even without considering the omitted copies. Additionally, LLVM can optimize the code a lot better, to a large degree due to the fact that lots of copies are gone or can be optimized away. Additionally, we can now emit attributes like nonnull on the data and/or vtable pointers contained in the fat pointer, potentially allowing for even more optimizations. This results in LLVM passes being about 3-7% faster (depending on the crate), and the resulting code is also a few percent smaller, for example: text data filename 5671479 3941461 before/librustc-d8ace771.so 5447663 3905745 after/librustc-d8ace771.so 1944425 2394024 before/libstd-d8ace771.so 1896769 2387610 after/libstd-d8ace771.so I had to remove a call in the backtrace-debuginfo test, because LLVM can now merge the tails of some blocks when optimizations are turned on, which can't correctly preserve line info. Fixes rust-lang#22924 Cc rust-lang#22891 (at least for fat pointers the code is good now)

This has a number of advantages compared to creating a copy in memory and passing a pointer. The obvious one is that we don't have to put the data into memory but can keep it in registers. Since we're currently passing a pointer anyway (instead of using e.g. a known offset on the stack, which is what the `byval` attribute would achieve), we only use a single additional register for each fat pointer, but save at least two pointers worth of stack in exchange (sometimes more because more than one copy gets eliminated). On archs that pass arguments on the stack, we save a pointer worth of stack even without considering the omitted copies. Additionally, LLVM can optimize the code a lot better, to a large degree due to the fact that lots of copies are gone or can be optimized away. Additionally, we can now emit attributes like nonnull on the data and/or vtable pointers contained in the fat pointer, potentially allowing for even more optimizations. This results in LLVM passes being about 3-7% faster (depending on the crate), and the resulting code is also a few percent smaller, for example: |text|data|filename| |----|----|--------| |5671479|3941461|before/librustc-d8ace771.so| |5447663|3905745|after/librustc-d8ace771.so| | | | | |1944425|2394024|before/libstd-d8ace771.so| |1896769|2387610|after/libstd-d8ace771.so| I had to remove a call in the backtrace-debuginfo test, because LLVM can now merge the tails of some blocks when optimizations are turned on, which can't correctly preserve line info. Fixes #22924 Cc #22891 (at least for fat pointers the code is good now)

steveklabnik · 2016-11-15T19:53:25Z

Triage: it's not totally clear to me if this ticket is still valid. A lot has changed between now and then, and I get very different code, but I'm not sure if that's from all the other stuff that's gone on since pre-1.0.

Mark-Simulacrum · 2017-05-16T04:23:58Z

Yeah as per the above it looks like this is probably not valid anymore, or at least I can't say either way -- closing. If that's not the case, please reopen; I apologize ahead of time!

kmcallister added the A-codegen Area: Code generation label Mar 2, 2015

rprichard mentioned this issue Mar 24, 2015

Try not to copy outgoing arguments #23657

Closed

dotdash mentioned this issue Jun 18, 2015

Pass fat pointers in two immediate arguments #26411

Merged

Mark-Simulacrum closed this as completed May 16, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Seemingly inefficient code generated to forward a parameter to a function #22891

Seemingly inefficient code generated to forward a parameter to a function #22891

rprichard commented Feb 28, 2015

dotdash commented Feb 28, 2015

rprichard commented Feb 28, 2015

dotdash commented Feb 28, 2015

rprichard commented Mar 1, 2015

dotdash commented Mar 4, 2015

comex commented Mar 12, 2015

dotdash commented Mar 12, 2015

comex commented Mar 12, 2015

dotdash commented Mar 12, 2015

comex commented Mar 13, 2015

dotdash commented Mar 13, 2015

comex commented Mar 13, 2015

steveklabnik commented Nov 15, 2016

Mark-Simulacrum commented May 16, 2017

Seemingly inefficient code generated to forward a parameter to a function #22891

Seemingly inefficient code generated to forward a parameter to a function #22891

Comments

rprichard commented Feb 28, 2015

dotdash commented Feb 28, 2015

rprichard commented Feb 28, 2015

dotdash commented Feb 28, 2015

rprichard commented Mar 1, 2015

dotdash commented Mar 4, 2015

comex commented Mar 12, 2015

dotdash commented Mar 12, 2015

comex commented Mar 12, 2015

dotdash commented Mar 12, 2015

comex commented Mar 13, 2015

dotdash commented Mar 13, 2015

comex commented Mar 13, 2015

steveklabnik commented Nov 15, 2016

Mark-Simulacrum commented May 16, 2017