Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some DCE issues with ThinLTO leaving in construction of unused parameters #47409

Closed
gamozolabs opened this issue Jan 13, 2018 · 12 comments
Closed
Labels
P-high High priority regression-from-stable-to-beta Performance or correctness regression from stable to beta. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Milestone

Comments

@gamozolabs
Copy link

gamozolabs commented Jan 13, 2018

Since ThinLTO became default in Rust my bootloader went from 17 KiB to about 42 KiB in size, which puts it over the 32 KiB limit. To get my bootloader to fit in this limit I've been able to strip out all of core::fmt by not having any parameters to my panic_fmt routine and this allows LTO to effectively delete all construction of the core::fmt parameters to panic_fmt and thus save huge amounts of code.

Since ThinLTO became default it seems that certain functions using volatile memory or inline assembly struggle to be optimized fully, however this is all I've found that replicates the issue, perhaps more common use of Rust could also trigger it.

For the tiny test case (included at the end of this issue) historically (from at least 2016 to November 2017) produced a .text section of ~100 bytes, and a total virtual program size of ~500 bytes. Since ThinLTO became default this tiny program grows to ~2400 bytes of .text and ~3200 bytes overall.

While this 2 KiB increase seems small, keep in mind it is for an application that is trivial and only has one thing using core::fmt (the bounds check panic on the array index). When a program grows and more panic come about these changes to be a much larger increase, in my case over 20 KiB.

While I'm writing an RFC to address this issue more permanently by proposing an optional (developer opt-in) panic routine which does not rely on core::fmt (probably just a &'static str for the filename and u32 for the line for bare bones debugging where nothing more can be afforded), I wanted to open this issue just in case it's a more fundamental optimization issue that could be fixed.

The core::intrinsics::abort() is critical in the panic_fmt for the code size increase to occur. Other things that work here as well are statements like asm!("nop") or really any assembly statement. Further, assembly statements that are included in functions that are called from panic_fmt also cause the optimization problem to occur. This makes workarounds impossible in my case as my panic dumps to a serial port which ultimately requires inline assembly to access. So for my bootloader I have resorted to using an old version of Rust.

What I find particularly strange is that while I've defined my panic_fmt to not take any parameters this doesn't seem to be a big enough hint to Rust/LLVM that they are not used and thus all construction of these parameters can be DCE'd.

For a size listing of this test program using all nightly builds of Rust in the past 1.5 years see: https://gist.github.com/gamozolabs/d72c05e2b200f02397f1495cf33c2674

We can see from this that the size increase occured between rustc 1.24.0-nightly (f9b0897c5 2017-12-02) and rustc 1.24.0-nightly (1956d5535 2017-12-03) which happens to be when ThinLTO went default.

I've tried a few different tests using various panic_fmt declarations, with parameters, without, as extern, as pub, etc, and they seem to all be affected by this issue.

Here's an example basic piece of code to be built with nightly-x86_64-pc-windows-msvc using args -C lto -C panic=abort -O:

Here is the disassembly difference between this same code but one with ThinLTO and one with full LTO: https://gist.github.com/gamozolabs/41b2059a55d5e0b60d7251dd8a31c055

#![no_std]
#![feature(lang_items, start, core_intrinsics)]

#[lang = "panic_fmt"]
#[no_mangle]
pub extern fn rust_begin_panic(_msg: core::fmt::Arguments,
                               _file: &'static str,
                               _line: u32,
                               _column: u32) -> ! {
    unsafe { core::intrinsics::abort() }
}

// These functions are used by the compiler, but not
// for a bare-bones hello world. These are normally
// provided by libstd.
#[lang = "eh_personality"]
#[no_mangle]
pub extern fn rust_eh_personality() {
}

// This function may be needed based on the compilation target.
#[lang = "eh_unwind_resume"]
#[no_mangle]
pub extern fn rust_eh_unwind_resume() {
}

#[start]
#[no_mangle]
#[allow(non_snake_case)]
pub fn mainCRTStartup(argc: isize, _argv: *const *const u8) -> isize
{
    let foo = [1, 2, 3, 4];
    foo[argc as usize]
}
@gamozolabs
Copy link
Author

gamozolabs commented Jan 14, 2018

Further testing shows that building with -Z thinlto=off reverts to using old full LTO which on recent builds does behave as expected (deletes construction of core::fmt args to panic_fmt). This is now my current workaround and I'm no longer using an old version of Rust for my bootloader. My concern is in the future this may be unsupported, but I have no idea.

However I still think it's worth looking into. I dumped the LLVM IR --emit llvm-ir and with thinlto=on and thinlto=off it produces identical IR. This leads me to believe that the issue lies more in LLVM than Rust, however I'm not familiar enough with the internals of Rust to conclude this. Further is this something we could optimize at MIR level as a Rust-specific optimization pass?

Edit: Disregard above statement --emit llvm-ir seems to disable ThinLTO so I cannot use it to accurately determine if the issue is with Rust or LLVM.

-B

@alexcrichton alexcrichton added T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. regression-from-stable-to-beta Performance or correctness regression from stable to beta. labels Jan 14, 2018
@alexcrichton
Copy link
Member

Thanks for the report @gamozolabs!

This sounds like it does indeed get blamed to ThinLTO! The previous form of LTO threw everything into one giant module which allows LLVM to see as much as possible and optimize as aggressively as possible. Typically this isn't necessary for applications, but as you're seeing there's still some that require the old form of LTO!

I've tagged this as a regression from stable to beta for now to ensure we can deal with this before the next release. There's a few things going on here of note.

  • The compiler, when using -C lto, has switched to using ThinLTO by default rather than the previous "throw everything in a module" (what I'll call fat LTO) by default. This I believe is the source of your regression if you were previously compiling with LTO.
  • There is an option to disable ThinLTO via -Z thinlto=no, but as the name implies it's not a stable option. Additionally there's no Carog.toml option for specifying ThinLTO on or off.

So given that I think there's two primary questions to consider when evaluating this:

  • Should we turn "fat LTO" back on by default when LTO is requested? We'd still use ThinLTO if -C lto isn't passed (just amongst the codegen units we're producing) however though.
  • Should we stabilize -Z thinlto and add corresponding options to Cargo.toml?

Curious what others from @rust-lang/compiler might think.

@Mark-Simulacrum
Copy link
Member

We could make -C lto take an additional parameter, so something like -C lto=auto (current behavior, default), -C lto=fat (old, "fat" LTO), -C lto=thin (current behavior again). We could document auto as not having a stable meaning, and expose these same options in Cargo's lto option.

@eddyb
Copy link
Member

eddyb commented Jan 15, 2018

I don't feel like the distinction should matter in general, but rather this is more of a LLVM limitation.
OTOH, we already have LLVM-specific options in -C.

@hanna-kruppe
Copy link
Contributor

Aside: It may be possible to improve ThinLTO here. My theory is that panic_fmt is not internalized because there are multiple modules, and thus deadargelim doesn't fire. Other interprocedural optimizations have similar problem and in those cases adding data to the module summary and teaching those optimizations about that has helped. For deadargelim, you probably can't eliminate the arguments (because you can't update all the callers unilaterally), but you could store the fact that an argument is unused in the module summary and callers who are aware of this can pass undef instead.

@michaelwoerister
Copy link
Member

I think we should provide a way to get full LTO on stable/beta as it still provides better runtime performance overall.

@Mark-Simulacrum Mark-Simulacrum added this to the 1.24 milestone Jan 15, 2018
@alexcrichton
Copy link
Member

@Mark-Simulacrum your suggestion sounds great to me! We could do what we did with debuginfo and accept lto = 'fat' in Cargo.toml as well as lto = true.

@nikomatsakis
Copy link
Contributor

I like @Mark-Simulacrum's suggestion. I guess an open question is whether the default should be LTO or not -- but I think we can probably change the default to LTO if we think that's better. I'd be curious to hear from @gamozolabs if they agree.

@alexcrichton
Copy link
Member

Ok I've opened #47521 to implement @Mark-Simulacrum's suggestion, and if we decide to merge I can follow up with a Cargo.toml option as well soon after merging.

I'm sort of ambivalent about which form of LTO should be the default when specifying only -C lto (either fat or thin LTO), although I'd slightly prefer ThinLTO as it should have all the same the-artifact's-runtime-is-faster benefits with a huge benefit on compile times. The downside of ThinLTO, as seen here though, is that it doesn't have the same code-size benefits.

@michaelwoerister
Copy link
Member

I'm fine with defaulting to ThinLTO as long as full LTO is available on stable.

@nikomatsakis
Copy link
Contributor

triage: P-high

This should get into beta, but there is a pending PR #47521

@rust-highfive rust-highfive added the P-high High priority label Jan 18, 2018
alexcrichton added a commit to alexcrichton/rust that referenced this issue Jan 18, 2018
This commit reverts a small portion of the switch to ThinLTO by default which
changed the meaning of `-C lto` from "put the whole crate graph into one codegen
unit" to "perform ThinLTO over the whole crate graph". This backport has no
corresponding commit on master as rust-lang#47521 is making the same change but in a
slightly different manner. This commit is intended to be a surgical change with
low impact on beta.

Closes rust-lang#47409
bors added a commit that referenced this issue Jan 21, 2018
[beta] Turn back on "fat" LTO by default

This commit reverts a small portion of the switch to ThinLTO by default which
changed the meaning of `-C lto` from "put the whole crate graph into one codegen
unit" to "perform ThinLTO over the whole crate graph". This backport has no
corresponding commit on master as #47521 is making the same change but in a
slightly different manner. This commit is intended to be a surgical change with
low impact on beta.

Closes #47409
@alexcrichton
Copy link
Member

I think this is closed now with #47548 merged

alexcrichton added a commit to alexcrichton/rust that referenced this issue Jan 23, 2018
This commit primarily adds the ability to control what kind of LTO happens when
rustc performs LTO, namely allowing values to be specified to the `-C lto`
option, such as `-C lto=thin` and `-C lto=fat`. (where "fat" is the previous
kind of LTO, throw everything in one giant module)

Along the way this also refactors a number of fields which store information
about whether LTO/ThinLTO are enabled to unify them all into one field through
which everything is dispatched, hopefully removing a number of special cases
throughout.

This is intended to help mitigate rust-lang#47409 but will require a backport as well,
and this would unfortunately need to be an otherwise insta-stable option.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P-high High priority regression-from-stable-to-beta Performance or correctness regression from stable to beta. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

8 participants