Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segfault on latest nightly with codegen-units > 1 #47364

Closed
jonhoo opened this issue Jan 11, 2018 · 31 comments
Closed

Segfault on latest nightly with codegen-units > 1 #47364

jonhoo opened this issue Jan 11, 2018 · 31 comments
Labels
C-bug Category: This is a bug. I-crash Issue: The compiler crashes (SIGSEGV, SIGABRT, etc). Use I-ICE instead when the compiler panics. P-high High priority regression-from-stable-to-nightly Performance or correctness regression from stable to nightly. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Milestone

Comments

@jonhoo
Copy link
Contributor

jonhoo commented Jan 11, 2018

On the latest nightly, the default is now to run release builds with codegen-units > 1 (see #46910). For some reason, this causes https://github.com/mit-pdos/distributary to segfault at runtime. Setting

[profile.release]
codegen-units = 1

makes the segfault go away. To reproduce:

$ rustc +nightly --version
rustc 1.25.0-nightly (f62f77403 2018-01-10)
$ git clone https://github.com/mit-pdos/distributary.git
$ cd distributary
$ git checkout open-loop
$ cargo +nightly b --release --manifest-path benchmarks/Cargo.toml --bin vote
$ target/release/vote -r 5 --target 100000 localsoup
Segmentation fault (core dumped)
$ # add codegen-units = 1 to Cargo.toml under [profile.release]
$ cargo +nightly b --release --manifest-path benchmarks/Cargo.toml --bin vote
$ target/release/vote -r 5 --target 100000 localsoup
# wait ~5s and notice that it completes without crashing
@kennytm kennytm added I-crash Issue: The compiler crashes (SIGSEGV, SIGABRT, etc). Use I-ICE instead when the compiler panics. C-bug Category: This is a bug. labels Jan 11, 2018
@Mark-Simulacrum Mark-Simulacrum added regression-from-stable-to-nightly Performance or correctness regression from stable to nightly. I-nominated T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. labels Jan 11, 2018
jonhoo added a commit to mit-pdos/noria that referenced this issue Jan 12, 2018
@nagisa
Copy link
Member

nagisa commented Jan 12, 2018

Cannot reproduce on Linux with exactly the same rustc:

rustc 1.25.0-nightly (f62f774 2018-01-10)

I even went out of my way to compile with a number of different flags. What happens instead is:

thread 'main' panicked at 'no entry found for key', libcore/option.rs:891:5
stack backtrace:
   0: std::sys::unix::backtrace::tracing::imp::unwind_backtrace
             at libstd/sys/unix/backtrace/tracing/gcc_s.rs:49
   1: std::sys_common::backtrace::print
             at libstd/sys_common/backtrace.rs:68
             at libstd/sys_common/backtrace.rs:57
   2: std::panicking::default_hook::{{closure}}
             at libstd/panicking.rs:380
   3: std::panicking::default_hook
             at libstd/panicking.rs:396
   4: std::panicking::rust_panic_with_hook
             at libstd/panicking.rs:576
   5: std::panicking::begin_panic
             at libstd/panicking.rs:537
   6: std::panicking::begin_panic_fmt
             at libstd/panicking.rs:521
   7: rust_begin_unwind
             at libstd/panicking.rs:497
   8: core::panicking::panic_fmt
             at libcore/panicking.rs:71
   9: core::option::expect_failed
             at libcore/option.rs:891
  10: vote::clients::localsoup::graph::make
             at /checkout/src/libcore/option.rs:302
             at /checkout/src/liballoc/btree/map.rs:1753
             at benchmarks/vote/clients/../../vote-old/graph.rs:149
  11: vote::run
             at benchmarks/vote/clients/localsoup.rs:55
             at benchmarks/vote/main.rs:50
  12: vote::main
             at benchmarks/vote/main.rs:441
  13: std::rt::lang_start::{{closure}}
             at /checkout/src/libstd/rt.rs:74
  14: std::panicking::try::do_call
             at libstd/rt.rs:59
             at libstd/panicking.rs:479
  15: __rust_maybe_catch_panic
             at libpanic_unwind/lib.rs:102
  16: std::rt::lang_start_internal
             at libstd/panicking.rs:458
             at libstd/panic.rs:365
             at libstd/rt.rs:58
  17: main
  18: __libc_start_main
  19: _start

right after executable is started. This makes me suspect there might be UB-invoking code.

Please provide more information, and a test case as minimal as possible.

@jonhoo
Copy link
Contributor Author

jonhoo commented Jan 12, 2018

That's very interesting... I was compiling with some special RUSTFLAGS (specifically with ld-gold and -Cnative), but even withenv RUSTFLAGS= cargo b I observe the segfault. Are you sure you didn't accidentally also build with 8b973c8480cf0cbd51746458bbe4d426451cfbbc (which I pushed yesterday to the open-loop branch and sets codegen-units=1 by default)? Also, I never see the crash you reported above, so clearly something weird is going on (and you may be right that it's UB :/)

I'd love to give a smaller replicating example, but this is a pretty large codebase, so reducing becomes tricky. Do you have a recommendation for where I should start looking/pruning?

@nagisa
Copy link
Member

nagisa commented Jan 12, 2018

So, I was including your commit most recent commit, but I also made sure to set some -Ccodegen-units=4, because the issue report claims it reproduces > 1. That’s where the panic happens. With -Ccodegen-units=16 or -Ccodegen-units=12 I see the SIGSEGV.

It seems to me like it would be easier to reduce the panic above, as the backtrace seems fairly local to your benchmark executable, but you could work on reducing for the sigsegv as well. I often begin reducing by blindly removing non-critical components piecemeal. For example one of the first components I’d remove would be argument parsing (and so -- the fairly large dependency on clap) and hardcoding the arguments used to reproduce.

@jonhoo
Copy link
Contributor Author

jonhoo commented Jan 12, 2018

I'll try that. Curiously enough, I still see the SIGSEGV even with codegen-units = 4, but I'll just have to work with it.

@jonhoo
Copy link
Contributor Author

jonhoo commented Jan 12, 2018

Here's a start:

extern crate nom_sql;

fn main() {
    let r_txt = "SELECT Article.id, title, VoteCount.votes AS votes \
                            FROM Article \
                            LEFT JOIN (SELECT Vote.id, COUNT(user) AS votes \
                                       FROM Vote GROUP BY Vote.id) AS VoteCount \
                            ON (Article.id = VoteCount.id) WHERE Article.id = ?";
    nom_sql::parser::parse_query(r_txt);
}

This segfaults with 4 units, though only some of the time. Depends only on the nom-sql crate. I'll keep digging.

jonhoo added a commit to jonhoo/rust-issue-47364 that referenced this issue Jan 12, 2018
@jonhoo
Copy link
Contributor Author

jonhoo commented Jan 12, 2018

The smallest reproducing example I've been able to come up with thus far can be found at https://github.com/jonhoo/rust-issue-47364. Just run cargo run --release with nightly. That code has no unsafe (though I suppose nom probably does?).

@jonhoo
Copy link
Contributor Author

jonhoo commented Jan 13, 2018

Now down to:

#[macro_use]
extern crate nom; // nom = "1.2.4" in [dependencies]
use nom::multispace;

pub enum ConditionExpression {
    Field(String),
    Placeholder,
}

pub fn condition_expr<'a>(i: &'a [u8]) -> nom::IResult<&[u8], ConditionExpression, u32> {
    nom::IResult::Done(i, ConditionExpression::Placeholder)
}

named!(pub selection<&[u8], Option<ConditionExpression>>,
    chain!(
        select: chain!(
            tag!("x") ~
            cond: opt!(complete!(chain!(
                multispace? ~
                cond: condition_expr,
                || { cond }
            ))) ~
            || { cond }
        ) ~
        tag!(";"),
        || { select }
    )
);

fn main() {
    selection("x ".as_bytes());
}

I don't know that I can minimize much more without starting to dig into nom (and an outdated one at that...)

@jonhoo
Copy link
Contributor Author

jonhoo commented Jan 13, 2018

@nagisa I've now pruned a bunch of the code from nom as well. The linked repository (https://github.com/jonhoo/rust-issue-47364) has no external dependencies, and no unsafe code, but still segfaults if you run do cargo +nightly r --release.

@nagisa
Copy link
Member

nagisa commented Jan 13, 2018

I can reproduce with #47364 (comment) and b192e82 from the repository, but not with any of the later minimisations.

That is still a good amount of progress. I will look into this.

@jonhoo
Copy link
Contributor Author

jonhoo commented Jan 13, 2018

Yeah, from what I can tell it must be related to memory layout to some extent. Specifically, the latest commit always segfaults on my laptop, and never on another machine I have available (both x86_64, both fairly up-to-date Arch Linux installs). I observed this while fiddling with the minimization too though -- seemingly unrelated changes (like removing a struct field) would cause the segfault to disappear. This suggests that there's memory corruption going on, and that the question is just whether or not it triggers a segfault. The latest commit also has the memory corruption, it just doesn't consistently (as in, across machines) trigger a segfault. valgrind may be able to help.

@jonhoo
Copy link
Contributor Author

jonhoo commented Jan 13, 2018

Actually, I can reproduce with jonhoo/rust-issue-47364@3f0e3a7 on a different machine too, and that already has a decent amount of minimization.

@nagisa
Copy link
Member

nagisa commented Jan 13, 2018

Memory corruption of some sort is my conclusion as well. I ended up arriving at fairly different minimisation. This minimisation doesn’t crash, but still produces different results between optimised and non-optimised version at a carefully placed assertion in the code.

Sadly my case is still huge and results in a 5k lines of IR sized function that hasn’t anything obviously wrong with it.

@Mark-Simulacrum Mark-Simulacrum added this to the 1.25 milestone Jan 15, 2018
@jonhoo
Copy link
Contributor Author

jonhoo commented Jan 17, 2018

I'm seeing this crop up in some other cases too now, though none that are particularly much easier to minimize. Any luck digging into this yet @nagisa?

@jonhoo
Copy link
Contributor Author

jonhoo commented Jan 17, 2018

Also, could this be related to #47071?

@nikomatsakis
Copy link
Contributor

triage: P-high

Regression, segfault. @nagisa let me know if you feel you don't have time to follow-up here.

@rust-highfive rust-highfive added P-high High priority and removed I-nominated labels Jan 18, 2018
@nagisa
Copy link
Member

nagisa commented Jan 18, 2018

I’m planning to work on this further, however I’m also fairly busy with other matters that take precedence over this, so I’m not currently actively looking at the issue.

@alexcrichton
Copy link
Member

I've been trying to minimize this but I'm unfortunately not making a ton of progress.

So far I've got this Rust code which fails with:

$ rustc +nightly foo.rs -C opt-level=1 -C codegen-units=200 -Z thinlto=no
$ ./foo
$ rustc +nightly foo.rs -C opt-level=2 -C codegen-units=200 -Z thinlto=no
$ ./foo
thread 'main' panicked at 'index 5 out of range for slice of length 1', libcore/slice/mod.rs:785:5
note: Run with `RUST_BACKTRACE=1` for a backtrace.

Turning on ThinLTO causes it to segfault but either way something funky is happening here. I don't know yet whethere this is an LLVM or trans bug. I've narrowed the bad IR down to one of the codegen units, and I've attempted to hand-minimize that IR as much as possible, reaching this state. Unfortunately it's still quite large and it's not at all clear where the bug is.

@jonhoo
Copy link
Contributor Author

jonhoo commented Jan 19, 2018

Ugh, yeah, that's not great. I wish I could provide any more insight, but it seems almost arbitrary whether removing something will still trigger the fault. I sometimes found that removing more code would cause the bug to reappear. Similarly, sometimes removing a particular piece would cause the bug to disappear, but if something else was removed first, then removing it would cause the crash to remain. This all makes me think that the bug may actually be there most of the time, but just doesn't manifest as a crash or panic. Running the binary through valgrind may help confirm this suspicion, and might allow minimizing significantly more...

@alexcrichton
Copy link
Member

Aha interesting!

I saw #47674 come in and using bisect-rust also points to #46739 as the PR-at-fault.

I'll tag this appropriately for compiler team discussion. @arielb1 you're likely interested in this though.

@alexcrichton alexcrichton added the regression-from-stable-to-beta Performance or correctness regression from stable to beta. label Jan 23, 2018
@alexcrichton alexcrichton added regression-from-stable-to-nightly Performance or correctness regression from stable to nightly. I-nominated and removed regression-from-stable-to-nightly Performance or correctness regression from stable to nightly. regression-from-stable-to-beta Performance or correctness regression from stable to beta. labels Jan 23, 2018
@dotdash
Copy link
Contributor

dotdash commented Jan 23, 2018

FWIW, it seems that the MergedLoadStoreMotionPass corrupts the memory depence results, leaving cache entries that claim a non-local dependency which actually became a local one due to this pass. This causes GVN to perform broken replacements later on.

@nikomatsakis
Copy link
Contributor

Should we roll back #46739 then?

@dotdash
Copy link
Contributor

dotdash commented Jan 23, 2018

AFAICT, the changed order only makes it more likely for the bug to be exposed. MLSM followed by GVN is already present in the original pass order.

I've filed https://bugs.llvm.org//show_bug.cgi?id=36063

@dotdash
Copy link
Contributor

dotdash commented Jan 23, 2018

Also, this is how far I could minimize:

fn main() {
    nom_sql::selection(b"x ");
}

pub enum Err<P>{
    Position(P),
    NodePosition(u32),
}

pub enum IResult<I,O> {
    Done(I,O),
    Error(Err<I>),
    Incomplete(u32, u64)
}

pub fn multispace<T: Copy>(input: T) -> ::IResult<i8, i8> {
    ::IResult::Done(0, 0)
}

mod nom_sql {
    fn where_clause(i: &[u8]) -> ::IResult<&[u8], Option<String>> {
        let X = match ::multispace(i) {
            ::IResult::Done(..) => ::IResult::Done(i, None::<String>),
            _ => ::IResult::Error(::Err::NodePosition(0)),
        };
        match X {
            ::IResult::Done(_, _) => ::IResult::Done(i, None),
            _ => X
        }
    }

    pub fn selection(i: &[u8]) {
        let Y = match {
            match {
                where_clause(i)
            } {
                ::IResult::Done(_, o) => ::IResult::Done(i, Some(o)),
                ::IResult::Error(_) => ::IResult::Done(i, None),
                _ => ::IResult::Incomplete(0, 0),
            }
        } {
            ::IResult::Done(z, _) => ::IResult::Done(z, None::<String>),
            _ => return ()
        };
        match Y {
            ::IResult::Done(x, _) => {
                let bytes = b";   ";
                let len = x.len();
                bytes[len];
            }
            _ => ()
        }
    }
}

@dotdash
Copy link
Contributor

dotdash commented Jan 23, 2018

FWIW, if we want to keep the changed pass order, we could also mark MLSM as not preserving MemDep, which is true anyway and fixes the bug. Not sure how much of a slowdown that means.

@arielb1
Copy link
Contributor

arielb1 commented Jan 23, 2018

I'm having some difficulties debugging this. @dotdash could you try to find a fix? I think the MLSM not preserving MemDep is probably a good idea.

@dotdash
Copy link
Contributor

dotdash commented Jan 23, 2018

JFYI, I found a reduced testcase and added that to the LLVM issue.

@nikomatsakis
Copy link
Contributor

FWIW, if we want to keep the changed pass order, we could also mark MLSM as not preserving MemDep, which is true anyway and fixes the bug. Not sure how much of a slowdown that means.

If it's true anyway, seems good to me.

@nikomatsakis
Copy link
Contributor

@dotdash are you up for making that change?

@dotdash
Copy link
Contributor

dotdash commented Jan 25, 2018

Waiting on rust-lang/llvm#102 now

@oli-obk
Copy link
Contributor

oli-obk commented Jan 27, 2018

:( I have no clue what is happening, but this is the only test in all of run-pass that is failing on stage 1 in #46882 and it's failing with "index out of bounds: the len is 4 but the index is 2"

all run-fail tests are passing, all compile-fail tests are passing, all ui tests are passing and all run-pass tests except for this one are passing. Any tips?

side note: the MIR pre-trans looks totally fine

@oli-obk
Copy link
Contributor

oli-obk commented Jan 27, 2018

sorry about the noise, it was just x.py failing to build the new llvm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Category: This is a bug. I-crash Issue: The compiler crashes (SIGSEGV, SIGABRT, etc). Use I-ICE instead when the compiler panics. P-high High priority regression-from-stable-to-nightly Performance or correctness regression from stable to nightly. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

10 participants