Segfault on latest nightly with codegen-units > 1 #47364

jonhoo · 2018-01-11T20:55:55Z

On the latest nightly, the default is now to run release builds with codegen-units > 1 (see #46910). For some reason, this causes https://github.com/mit-pdos/distributary to segfault at runtime. Setting

[profile.release]
codegen-units = 1

makes the segfault go away. To reproduce:

$ rustc +nightly --version
rustc 1.25.0-nightly (f62f77403 2018-01-10)
$ git clone https://github.com/mit-pdos/distributary.git
$ cd distributary
$ git checkout open-loop
$ cargo +nightly b --release --manifest-path benchmarks/Cargo.toml --bin vote
$ target/release/vote -r 5 --target 100000 localsoup
Segmentation fault (core dumped)
$ # add codegen-units = 1 to Cargo.toml under [profile.release]
$ cargo +nightly b --release --manifest-path benchmarks/Cargo.toml --bin vote
$ target/release/vote -r 5 --target 100000 localsoup
# wait ~5s and notice that it completes without crashing

The text was updated successfully, but these errors were encountered:

See rust-lang/rust#47364

nagisa · 2018-01-12T18:50:55Z

Cannot reproduce on Linux with exactly the same rustc:

rustc 1.25.0-nightly (f62f774 2018-01-10)

I even went out of my way to compile with a number of different flags. What happens instead is:

thread 'main' panicked at 'no entry found for key', libcore/option.rs:891:5
stack backtrace:
   0: std::sys::unix::backtrace::tracing::imp::unwind_backtrace
             at libstd/sys/unix/backtrace/tracing/gcc_s.rs:49
   1: std::sys_common::backtrace::print
             at libstd/sys_common/backtrace.rs:68
             at libstd/sys_common/backtrace.rs:57
   2: std::panicking::default_hook::{{closure}}
             at libstd/panicking.rs:380
   3: std::panicking::default_hook
             at libstd/panicking.rs:396
   4: std::panicking::rust_panic_with_hook
             at libstd/panicking.rs:576
   5: std::panicking::begin_panic
             at libstd/panicking.rs:537
   6: std::panicking::begin_panic_fmt
             at libstd/panicking.rs:521
   7: rust_begin_unwind
             at libstd/panicking.rs:497
   8: core::panicking::panic_fmt
             at libcore/panicking.rs:71
   9: core::option::expect_failed
             at libcore/option.rs:891
  10: vote::clients::localsoup::graph::make
             at /checkout/src/libcore/option.rs:302
             at /checkout/src/liballoc/btree/map.rs:1753
             at benchmarks/vote/clients/../../vote-old/graph.rs:149
  11: vote::run
             at benchmarks/vote/clients/localsoup.rs:55
             at benchmarks/vote/main.rs:50
  12: vote::main
             at benchmarks/vote/main.rs:441
  13: std::rt::lang_start::{{closure}}
             at /checkout/src/libstd/rt.rs:74
  14: std::panicking::try::do_call
             at libstd/rt.rs:59
             at libstd/panicking.rs:479
  15: __rust_maybe_catch_panic
             at libpanic_unwind/lib.rs:102
  16: std::rt::lang_start_internal
             at libstd/panicking.rs:458
             at libstd/panic.rs:365
             at libstd/rt.rs:58
  17: main
  18: __libc_start_main
  19: _start

right after executable is started. This makes me suspect there might be UB-invoking code.

Please provide more information, and a test case as minimal as possible.

jonhoo · 2018-01-12T19:49:15Z

That's very interesting... I was compiling with some special RUSTFLAGS (specifically with ld-gold and -Cnative), but even withenv RUSTFLAGS= cargo b I observe the segfault. Are you sure you didn't accidentally also build with 8b973c8480cf0cbd51746458bbe4d426451cfbbc (which I pushed yesterday to the open-loop branch and sets codegen-units=1 by default)? Also, I never see the crash you reported above, so clearly something weird is going on (and you may be right that it's UB :/)

I'd love to give a smaller replicating example, but this is a pretty large codebase, so reducing becomes tricky. Do you have a recommendation for where I should start looking/pruning?

nagisa · 2018-01-12T21:20:42Z

So, I was including your commit most recent commit, but I also made sure to set some -Ccodegen-units=4, because the issue report claims it reproduces > 1. That’s where the panic happens. With -Ccodegen-units=16 or -Ccodegen-units=12 I see the SIGSEGV.

It seems to me like it would be easier to reduce the panic above, as the backtrace seems fairly local to your benchmark executable, but you could work on reducing for the sigsegv as well. I often begin reducing by blindly removing non-critical components piecemeal. For example one of the first components I’d remove would be argument parsing (and so -- the fairly large dependency on clap) and hardcoding the arguments used to reproduce.

jonhoo · 2018-01-12T21:36:43Z

I'll try that. Curiously enough, I still see the SIGSEGV even with codegen-units = 4, but I'll just have to work with it.

jonhoo · 2018-01-12T22:27:28Z

Here's a start:

extern crate nom_sql;

fn main() {
    let r_txt = "SELECT Article.id, title, VoteCount.votes AS votes \
                            FROM Article \
                            LEFT JOIN (SELECT Vote.id, COUNT(user) AS votes \
                                       FROM Vote GROUP BY Vote.id) AS VoteCount \
                            ON (Article.id = VoteCount.id) WHERE Article.id = ?";
    nom_sql::parser::parse_query(r_txt);
}

This segfaults with 4 units, though only some of the time. Depends only on the nom-sql crate. I'll keep digging.

jonhoo · 2018-01-12T22:58:16Z

The smallest reproducing example I've been able to come up with thus far can be found at https://github.com/jonhoo/rust-issue-47364. Just run cargo run --release with nightly. That code has no unsafe (though I suppose nom probably does?).

jonhoo · 2018-01-13T05:24:25Z

Now down to:

#[macro_use]
extern crate nom; // nom = "1.2.4" in [dependencies]
use nom::multispace;

pub enum ConditionExpression {
    Field(String),
    Placeholder,
}

pub fn condition_expr<'a>(i: &'a [u8]) -> nom::IResult<&[u8], ConditionExpression, u32> {
    nom::IResult::Done(i, ConditionExpression::Placeholder)
}

named!(pub selection<&[u8], Option<ConditionExpression>>,
    chain!(
        select: chain!(
            tag!("x") ~
            cond: opt!(complete!(chain!(
                multispace? ~
                cond: condition_expr,
                || { cond }
            ))) ~
            || { cond }
        ) ~
        tag!(";"),
        || { select }
    )
);

fn main() {
    selection("x ".as_bytes());
}

I don't know that I can minimize much more without starting to dig into nom (and an outdated one at that...)

jonhoo · 2018-01-13T05:56:37Z

@nagisa I've now pruned a bunch of the code from nom as well. The linked repository (https://github.com/jonhoo/rust-issue-47364) has no external dependencies, and no unsafe code, but still segfaults if you run do cargo +nightly r --release.

nagisa · 2018-01-13T08:39:08Z

I can reproduce with #47364 (comment) and b192e82 from the repository, but not with any of the later minimisations.

That is still a good amount of progress. I will look into this.

jonhoo · 2018-01-13T16:55:03Z

Yeah, from what I can tell it must be related to memory layout to some extent. Specifically, the latest commit always segfaults on my laptop, and never on another machine I have available (both x86_64, both fairly up-to-date Arch Linux installs). I observed this while fiddling with the minimization too though -- seemingly unrelated changes (like removing a struct field) would cause the segfault to disappear. This suggests that there's memory corruption going on, and that the question is just whether or not it triggers a segfault. The latest commit also has the memory corruption, it just doesn't consistently (as in, across machines) trigger a segfault. valgrind may be able to help.

jonhoo · 2018-01-13T17:08:01Z

Actually, I can reproduce with jonhoo/rust-issue-47364@3f0e3a7 on a different machine too, and that already has a decent amount of minimization.

nagisa · 2018-01-13T21:45:41Z

Memory corruption of some sort is my conclusion as well. I ended up arriving at fairly different minimisation. This minimisation doesn’t crash, but still produces different results between optimised and non-optimised version at a carefully placed assertion in the code.

Sadly my case is still huge and results in a 5k lines of IR sized function that hasn’t anything obviously wrong with it.

jonhoo · 2018-01-17T20:52:05Z

I'm seeing this crop up in some other cases too now, though none that are particularly much easier to minimize. Any luck digging into this yet @nagisa?

jonhoo · 2018-01-17T21:44:09Z

Also, could this be related to #47071?

nikomatsakis · 2018-01-18T15:35:58Z

triage: P-high

Regression, segfault. @nagisa let me know if you feel you don't have time to follow-up here.

nagisa · 2018-01-18T17:19:24Z

I’m planning to work on this further, however I’m also fairly busy with other matters that take precedence over this, so I’m not currently actively looking at the issue.

alexcrichton · 2018-01-19T05:46:08Z

I've been trying to minimize this but I'm unfortunately not making a ton of progress.

So far I've got this Rust code which fails with:

$ rustc +nightly foo.rs -C opt-level=1 -C codegen-units=200 -Z thinlto=no
$ ./foo
$ rustc +nightly foo.rs -C opt-level=2 -C codegen-units=200 -Z thinlto=no
$ ./foo
thread 'main' panicked at 'index 5 out of range for slice of length 1', libcore/slice/mod.rs:785:5
note: Run with `RUST_BACKTRACE=1` for a backtrace.

Turning on ThinLTO causes it to segfault but either way something funky is happening here. I don't know yet whethere this is an LLVM or trans bug. I've narrowed the bad IR down to one of the codegen units, and I've attempted to hand-minimize that IR as much as possible, reaching this state. Unfortunately it's still quite large and it's not at all clear where the bug is.

jonhoo · 2018-01-19T06:08:49Z

Ugh, yeah, that's not great. I wish I could provide any more insight, but it seems almost arbitrary whether removing something will still trigger the fault. I sometimes found that removing more code would cause the bug to reappear. Similarly, sometimes removing a particular piece would cause the bug to disappear, but if something else was removed first, then removing it would cause the crash to remain. This all makes me think that the bug may actually be there most of the time, but just doesn't manifest as a crash or panic. Running the binary through valgrind may help confirm this suspicion, and might allow minimizing significantly more...

alexcrichton · 2018-01-23T04:00:25Z

Aha interesting!

I saw #47674 come in and using bisect-rust also points to #46739 as the PR-at-fault.

I'll tag this appropriately for compiler team discussion. @arielb1 you're likely interested in this though.

dotdash · 2018-01-23T14:30:42Z

FWIW, it seems that the MergedLoadStoreMotionPass corrupts the memory depence results, leaving cache entries that claim a non-local dependency which actually became a local one due to this pass. This causes GVN to perform broken replacements later on.

nikomatsakis · 2018-01-23T16:49:11Z

Should we roll back #46739 then?

dotdash · 2018-01-23T17:18:15Z

AFAICT, the changed order only makes it more likely for the bug to be exposed. MLSM followed by GVN is already present in the original pass order.

I've filed https://bugs.llvm.org//show_bug.cgi?id=36063

dotdash · 2018-01-23T17:20:57Z

Also, this is how far I could minimize:

fn main() {
    nom_sql::selection(b"x ");
}

pub enum Err<P>{
    Position(P),
    NodePosition(u32),
}

pub enum IResult<I,O> {
    Done(I,O),
    Error(Err<I>),
    Incomplete(u32, u64)
}

pub fn multispace<T: Copy>(input: T) -> ::IResult<i8, i8> {
    ::IResult::Done(0, 0)
}

mod nom_sql {
    fn where_clause(i: &[u8]) -> ::IResult<&[u8], Option<String>> {
        let X = match ::multispace(i) {
            ::IResult::Done(..) => ::IResult::Done(i, None::<String>),
            _ => ::IResult::Error(::Err::NodePosition(0)),
        };
        match X {
            ::IResult::Done(_, _) => ::IResult::Done(i, None),
            _ => X
        }
    }

    pub fn selection(i: &[u8]) {
        let Y = match {
            match {
                where_clause(i)
            } {
                ::IResult::Done(_, o) => ::IResult::Done(i, Some(o)),
                ::IResult::Error(_) => ::IResult::Done(i, None),
                _ => ::IResult::Incomplete(0, 0),
            }
        } {
            ::IResult::Done(z, _) => ::IResult::Done(z, None::<String>),
            _ => return ()
        };
        match Y {
            ::IResult::Done(x, _) => {
                let bytes = b";   ";
                let len = x.len();
                bytes[len];
            }
            _ => ()
        }
    }
}

dotdash · 2018-01-23T17:32:59Z

FWIW, if we want to keep the changed pass order, we could also mark MLSM as not preserving MemDep, which is true anyway and fixes the bug. Not sure how much of a slowdown that means.

arielb1 · 2018-01-23T20:08:48Z

I'm having some difficulties debugging this. @dotdash could you try to find a fix? I think the MLSM not preserving MemDep is probably a good idea.

dotdash · 2018-01-23T20:40:54Z

JFYI, I found a reduced testcase and added that to the LLVM issue.

nikomatsakis · 2018-01-24T17:25:19Z

FWIW, if we want to keep the changed pass order, we could also mark MLSM as not preserving MemDep, which is true anyway and fixes the bug. Not sure how much of a slowdown that means.

If it's true anyway, seems good to me.

nikomatsakis · 2018-01-25T14:31:23Z

@dotdash are you up for making that change?

dotdash · 2018-01-25T14:57:46Z

Waiting on rust-lang/llvm#102 now

Fixes rust-lang#47364

oli-obk · 2018-01-27T22:20:35Z

:( I have no clue what is happening, but this is the only test in all of run-pass that is failing on stage 1 in #46882 and it's failing with "index out of bounds: the len is 4 but the index is 2"

all run-fail tests are passing, all compile-fail tests are passing, all ui tests are passing and all run-pass tests except for this one are passing. Any tips?

side note: the MIR pre-trans looks totally fine

oli-obk · 2018-01-27T23:57:07Z

sorry about the noise, it was just x.py failing to build the new llvm

kennytm added I-crash Issue: The compiler crashes (SIGSEGV, SIGABRT, etc). Use I-ICE instead when the compiler panics. C-bug Category: This is a bug. labels Jan 11, 2018

Mark-Simulacrum added regression-from-stable-to-nightly Performance or correctness regression from stable to nightly. I-nominated T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. labels Jan 11, 2018

jonhoo added a commit to mit-pdos/noria that referenced this issue Jan 12, 2018

Currently need codegen-units 1

8b973c8

See rust-lang/rust#47364

jonhoo added a commit to jonhoo/rust-issue-47364 that referenced this issue Jan 12, 2018

Reproduce rust-lang/rust#47364

b192e82

Mark-Simulacrum added this to the 1.25 milestone Jan 15, 2018

rust-highfive added P-high High priority and removed I-nominated labels Jan 18, 2018

alexcrichton added the regression-from-stable-to-beta Performance or correctness regression from stable to beta. label Jan 23, 2018

alexcrichton mentioned this issue Jan 23, 2018

regression -- test fails after #46739 that succeeded before it #47674

Closed

nikomatsakis removed the I-nominated label Jan 25, 2018

dotdash mentioned this issue Jan 26, 2018

Upgrade LLVM to incorporate a fix for #47364 #47769

Merged

dotdash added a commit to dotdash/rust that referenced this issue Jan 26, 2018

Upgrade LLVM to incorporate a fix for rust-lang#47364

aca88e1

Fixes rust-lang#47364

bors closed this as completed in #47769 Jan 26, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Segfault on latest nightly with codegen-units > 1 #47364

Segfault on latest nightly with codegen-units > 1 #47364

jonhoo commented Jan 11, 2018 •

edited

Loading

nagisa commented Jan 12, 2018

jonhoo commented Jan 12, 2018

nagisa commented Jan 12, 2018

jonhoo commented Jan 12, 2018

jonhoo commented Jan 12, 2018

jonhoo commented Jan 12, 2018

jonhoo commented Jan 13, 2018

jonhoo commented Jan 13, 2018 •

edited

Loading

nagisa commented Jan 13, 2018

jonhoo commented Jan 13, 2018

jonhoo commented Jan 13, 2018

nagisa commented Jan 13, 2018

jonhoo commented Jan 17, 2018

jonhoo commented Jan 17, 2018

nikomatsakis commented Jan 18, 2018

nagisa commented Jan 18, 2018

alexcrichton commented Jan 19, 2018

jonhoo commented Jan 19, 2018

alexcrichton commented Jan 23, 2018

dotdash commented Jan 23, 2018

nikomatsakis commented Jan 23, 2018

dotdash commented Jan 23, 2018

dotdash commented Jan 23, 2018

dotdash commented Jan 23, 2018

arielb1 commented Jan 23, 2018

dotdash commented Jan 23, 2018

nikomatsakis commented Jan 24, 2018

nikomatsakis commented Jan 25, 2018

dotdash commented Jan 25, 2018

oli-obk commented Jan 27, 2018 •

edited

Loading

oli-obk commented Jan 27, 2018

Segfault on latest nightly with codegen-units > 1 #47364

Segfault on latest nightly with codegen-units > 1 #47364

Comments

jonhoo commented Jan 11, 2018 • edited Loading

nagisa commented Jan 12, 2018

jonhoo commented Jan 12, 2018

nagisa commented Jan 12, 2018

jonhoo commented Jan 12, 2018

jonhoo commented Jan 12, 2018

jonhoo commented Jan 12, 2018

jonhoo commented Jan 13, 2018

jonhoo commented Jan 13, 2018 • edited Loading

nagisa commented Jan 13, 2018

jonhoo commented Jan 13, 2018

jonhoo commented Jan 13, 2018

nagisa commented Jan 13, 2018

jonhoo commented Jan 17, 2018

jonhoo commented Jan 17, 2018

nikomatsakis commented Jan 18, 2018

nagisa commented Jan 18, 2018

alexcrichton commented Jan 19, 2018

jonhoo commented Jan 19, 2018

alexcrichton commented Jan 23, 2018

dotdash commented Jan 23, 2018

nikomatsakis commented Jan 23, 2018

dotdash commented Jan 23, 2018

dotdash commented Jan 23, 2018

dotdash commented Jan 23, 2018

arielb1 commented Jan 23, 2018

dotdash commented Jan 23, 2018

nikomatsakis commented Jan 24, 2018

nikomatsakis commented Jan 25, 2018

dotdash commented Jan 25, 2018

oli-obk commented Jan 27, 2018 • edited Loading

oli-obk commented Jan 27, 2018

jonhoo commented Jan 11, 2018 •

edited

Loading

jonhoo commented Jan 13, 2018 •

edited

Loading

oli-obk commented Jan 27, 2018 •

edited

Loading