-
Notifications
You must be signed in to change notification settings - Fork 12.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Segfault on latest nightly with codegen-units > 1 #47364
Comments
Cannot reproduce on Linux with exactly the same rustc:
I even went out of my way to compile with a number of different flags. What happens instead is:
right after executable is started. This makes me suspect there might be UB-invoking code. Please provide more information, and a test case as minimal as possible. |
That's very interesting... I was compiling with some special I'd love to give a smaller replicating example, but this is a pretty large codebase, so reducing becomes tricky. Do you have a recommendation for where I should start looking/pruning? |
So, I was including your commit most recent commit, but I also made sure to set some It seems to me like it would be easier to reduce the panic above, as the backtrace seems fairly local to your benchmark executable, but you could work on reducing for the sigsegv as well. I often begin reducing by blindly removing non-critical components piecemeal. For example one of the first components I’d remove would be argument parsing (and so -- the fairly large dependency on clap) and hardcoding the arguments used to reproduce. |
I'll try that. Curiously enough, I still see the SIGSEGV even with |
Here's a start: extern crate nom_sql;
fn main() {
let r_txt = "SELECT Article.id, title, VoteCount.votes AS votes \
FROM Article \
LEFT JOIN (SELECT Vote.id, COUNT(user) AS votes \
FROM Vote GROUP BY Vote.id) AS VoteCount \
ON (Article.id = VoteCount.id) WHERE Article.id = ?";
nom_sql::parser::parse_query(r_txt);
} This segfaults with 4 units, though only some of the time. Depends only on the |
The smallest reproducing example I've been able to come up with thus far can be found at https://github.com/jonhoo/rust-issue-47364. Just run |
Now down to: #[macro_use]
extern crate nom; // nom = "1.2.4" in [dependencies]
use nom::multispace;
pub enum ConditionExpression {
Field(String),
Placeholder,
}
pub fn condition_expr<'a>(i: &'a [u8]) -> nom::IResult<&[u8], ConditionExpression, u32> {
nom::IResult::Done(i, ConditionExpression::Placeholder)
}
named!(pub selection<&[u8], Option<ConditionExpression>>,
chain!(
select: chain!(
tag!("x") ~
cond: opt!(complete!(chain!(
multispace? ~
cond: condition_expr,
|| { cond }
))) ~
|| { cond }
) ~
tag!(";"),
|| { select }
)
);
fn main() {
selection("x ".as_bytes());
} I don't know that I can minimize much more without starting to dig into |
@nagisa I've now pruned a bunch of the code from |
I can reproduce with #47364 (comment) and b192e82 from the repository, but not with any of the later minimisations. That is still a good amount of progress. I will look into this. |
Yeah, from what I can tell it must be related to memory layout to some extent. Specifically, the latest commit always segfaults on my laptop, and never on another machine I have available (both x86_64, both fairly up-to-date Arch Linux installs). I observed this while fiddling with the minimization too though -- seemingly unrelated changes (like removing a struct field) would cause the segfault to disappear. This suggests that there's memory corruption going on, and that the question is just whether or not it triggers a segfault. The latest commit also has the memory corruption, it just doesn't consistently (as in, across machines) trigger a segfault. |
Actually, I can reproduce with jonhoo/rust-issue-47364@3f0e3a7 on a different machine too, and that already has a decent amount of minimization. |
Memory corruption of some sort is my conclusion as well. I ended up arriving at fairly different minimisation. This minimisation doesn’t crash, but still produces different results between optimised and non-optimised version at a carefully placed assertion in the code. Sadly my case is still huge and results in a 5k lines of IR sized function that hasn’t anything obviously wrong with it. |
I'm seeing this crop up in some other cases too now, though none that are particularly much easier to minimize. Any luck digging into this yet @nagisa? |
Also, could this be related to #47071? |
triage: P-high Regression, segfault. @nagisa let me know if you feel you don't have time to follow-up here. |
I’m planning to work on this further, however I’m also fairly busy with other matters that take precedence over this, so I’m not currently actively looking at the issue. |
I've been trying to minimize this but I'm unfortunately not making a ton of progress. So far I've got this Rust code which fails with:
Turning on ThinLTO causes it to segfault but either way something funky is happening here. I don't know yet whethere this is an LLVM or trans bug. I've narrowed the bad IR down to one of the codegen units, and I've attempted to hand-minimize that IR as much as possible, reaching this state. Unfortunately it's still quite large and it's not at all clear where the bug is. |
Ugh, yeah, that's not great. I wish I could provide any more insight, but it seems almost arbitrary whether removing something will still trigger the fault. I sometimes found that removing more code would cause the bug to reappear. Similarly, sometimes removing a particular piece would cause the bug to disappear, but if something else was removed first, then removing it would cause the crash to remain. This all makes me think that the bug may actually be there most of the time, but just doesn't manifest as a crash or panic. Running the binary through valgrind may help confirm this suspicion, and might allow minimizing significantly more... |
Aha interesting! I saw #47674 come in and using bisect-rust also points to #46739 as the PR-at-fault. I'll tag this appropriately for compiler team discussion. @arielb1 you're likely interested in this though. |
FWIW, it seems that the MergedLoadStoreMotionPass corrupts the memory depence results, leaving cache entries that claim a non-local dependency which actually became a local one due to this pass. This causes GVN to perform broken replacements later on. |
Should we roll back #46739 then? |
AFAICT, the changed order only makes it more likely for the bug to be exposed. MLSM followed by GVN is already present in the original pass order. I've filed https://bugs.llvm.org//show_bug.cgi?id=36063 |
Also, this is how far I could minimize: fn main() {
nom_sql::selection(b"x ");
}
pub enum Err<P>{
Position(P),
NodePosition(u32),
}
pub enum IResult<I,O> {
Done(I,O),
Error(Err<I>),
Incomplete(u32, u64)
}
pub fn multispace<T: Copy>(input: T) -> ::IResult<i8, i8> {
::IResult::Done(0, 0)
}
mod nom_sql {
fn where_clause(i: &[u8]) -> ::IResult<&[u8], Option<String>> {
let X = match ::multispace(i) {
::IResult::Done(..) => ::IResult::Done(i, None::<String>),
_ => ::IResult::Error(::Err::NodePosition(0)),
};
match X {
::IResult::Done(_, _) => ::IResult::Done(i, None),
_ => X
}
}
pub fn selection(i: &[u8]) {
let Y = match {
match {
where_clause(i)
} {
::IResult::Done(_, o) => ::IResult::Done(i, Some(o)),
::IResult::Error(_) => ::IResult::Done(i, None),
_ => ::IResult::Incomplete(0, 0),
}
} {
::IResult::Done(z, _) => ::IResult::Done(z, None::<String>),
_ => return ()
};
match Y {
::IResult::Done(x, _) => {
let bytes = b"; ";
let len = x.len();
bytes[len];
}
_ => ()
}
}
} |
FWIW, if we want to keep the changed pass order, we could also mark MLSM as not preserving MemDep, which is true anyway and fixes the bug. Not sure how much of a slowdown that means. |
I'm having some difficulties debugging this. @dotdash could you try to find a fix? I think the MLSM not preserving MemDep is probably a good idea. |
JFYI, I found a reduced testcase and added that to the LLVM issue. |
If it's true anyway, seems good to me. |
@dotdash are you up for making that change? |
Waiting on rust-lang/llvm#102 now |
:( I have no clue what is happening, but this is the only test in all of all run-fail tests are passing, all compile-fail tests are passing, all ui tests are passing and all run-pass tests except for this one are passing. Any tips? side note: the MIR pre-trans looks totally fine |
sorry about the noise, it was just x.py failing to build the new llvm |
On the latest nightly, the default is now to run release builds with
codegen-units
> 1 (see #46910). For some reason, this causes https://github.com/mit-pdos/distributary to segfault at runtime. Settingmakes the segfault go away. To reproduce:
The text was updated successfully, but these errors were encountered: