Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory unsafety problem in safe Rust #69225

Closed
dfyz opened this issue Feb 17, 2020 · 38 comments
Closed

Memory unsafety problem in safe Rust #69225

dfyz opened this issue Feb 17, 2020 · 38 comments
Labels
A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. C-bug Category: This is a bug. I-unsound Issue: A soundness hole (worst kind of bug), see: https://en.wikipedia.org/wiki/Soundness ICEBreaker-LLVM Bugs identified for the LLVM ICE-breaker group P-medium Medium priority regression-from-stable-to-stable Performance or correctness regression from one stable version to another. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.

Comments

@dfyz
Copy link
Contributor

dfyz commented Feb 17, 2020

I have a small program (a simplification of a test function from a larger project) that slices a small array and tries to access an out-of-bounds element of the slice. Running it with cargo run --release using the stable 1.41.0 release prints something like this (tested on macOS 10.15 and Ubuntu 19.10):

0 0 3 18446744073709551615
[1]    21065 segmentation fault  cargo run --release

It looks like the resulting slice somehow has length 2**64 - 1, so the bounds checking is omitted, which predictably results in a segfault. On 1.39.0 and 1.40.0 the very same program prints what I would expect:

0 0 3 0
thread 'main' panicked at 'index out of bounds: the len is 0 but the index is 16777216', src/main.rs:13:35
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace.

The problem goes away if I do any of the following:

  • remove either of the two do_test(...); calls in main();
  • remove the for _ in 0..1 { loop;
  • replace the for y in 0..x { loop with for y in 0..1 {;
  • remove the z.extend(std::iter::repeat(0).take(x)); line or replace it with z.extend(std::iter::repeat(0).take(1));;
  • replace the for arr_ref in arr { loop with let arr_ref = &arr[0];;
  • specify RUSTFLAGS="-C opt-level=2";
  • specify RUSTFLAGS="-C codegen-units=1".

My best guess is -C opt-level=3 enables a problematic optimization pass in LLVM, which results in miscompilation. This is corroborated by the fact that MIR (--emit mir) and LLVM IR before optimizations (--emit llvm-ir -C no-prepopulate-passes) is the same for both -C opt-level=2 and -C opt-level=3.

Some additional info that might be helpful:

  • I can't reproduce the problem in the Rust playground (presumably because it uses codegen-units = 1);
  • I can't reproduce the problem on Windows 10 with the same 1.41.0 release (no idea what makes it different);
  • cargo-bisect-rustc says the regression first happened in the 2019-12-12 nightly, specifically in this commit. This seems suspicious to me, given that 1.40.0, which does not exhibit the problem, was released after this date.

I'm attaching the program inline in case the GitHub repo doesn't work (if you want to compile it without Cargo, use rustc -C opt-level=3 main.rs):

fn do_test(x: usize) {
    let arr = vec![vec![0u8; 3]];

    let mut z = Vec::new();
    for arr_ref in arr {
        for y in 0..x {
            for _ in 0..1 {
                z.extend(std::iter::repeat(0).take(x));
                let a = y * x;
                let b = (y + 1) * x - 1;
                let slice = &arr_ref[a..b];
                eprintln!("{} {} {} {}", a, b, arr_ref.len(), slice.len());
                eprintln!("{:?}", slice[1 << 24]);
            }
        }
    }
}

fn main() {
    do_test(1);
    do_test(2);
}
@dfyz dfyz added the C-bug Category: This is a bug. label Feb 17, 2020
@jonas-schievink jonas-schievink added A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. I-nominated I-unsound Issue: A soundness hole (worst kind of bug), see: https://en.wikipedia.org/wiki/Soundness T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. regression-from-stable-to-stable Performance or correctness regression from one stable version to another. labels Feb 17, 2020
@pietroalbini
Copy link
Member

cc @rust-lang/compiler
@rustbot ping icebreakers-llvm

The release team is considering making a point release for Rust 1.41 (we briefly discussed it in last week's meeting), and I'd love for this to be included in it if we can get a PR up soon.

@rustbot rustbot added the ICEBreaker-LLVM Bugs identified for the LLVM ICE-breaker group label Feb 17, 2020
@rustbot
Copy link
Collaborator

rustbot commented Feb 17, 2020

Hey LLVM ICE-breakers! This bug has been identified as a good
"LLVM ICE-breaking candidate". In case it's useful, here are some
instructions for tackling these sorts of bugs. Maybe take a look?
Thanks! <3

cc @comex @DutchGhost @hanna-kruppe @hdhoang @heyrutvik @JOE1994 @jryans @mmilenko @nagisa @nikic @Noah-Kennedy @SiavoshZarrasvand @spastorino @vertexclique @vgxbj

@RalfJung
Copy link
Member

RalfJung commented Feb 17, 2020

Running it with cargo run --release using the stable 1.41.0 release prints something like this (tested on macOS 10.15 and Ubuntu 19.10):

I cannot reproduce this on playground. The program works fine there on 1.41.0 in release mode.

EDIT: Ah, you already said that.
Also the program is fine in Miri, so this is likely not UB but a miscompilation.

@BurntSushi
Copy link
Member

Just to add a data point, I can reproduce this on Linux with the latest nightly:

[andrew@krusty rust-69225]$ rustc --version
rustc 1.43.0-nightly (5e7af4669 2020-02-16)

[andrew@krusty rust-69225]$ cat main.rs
fn do_test(x: usize) {
    let arr = vec![vec![0u8; 3]];

    let mut z = Vec::new();
    for arr_ref in arr {
        for y in 0..x {
            for _ in 0..1 {
                z.extend(std::iter::repeat(0).take(x));
                let a = y * x;
                let b = (y + 1) * x - 1;
                let slice = &arr_ref[a..b];
                eprintln!("{} {} {} {}", a, b, arr_ref.len(), slice.len());
                eprintln!("{:?}", slice[1 << 24]);
            }
        }
    }
}

fn main() {
    do_test(1);
    do_test(2);
}

[andrew@krusty rust-69225]$ rustc -C opt-level=3 main.rs

[andrew@krusty rust-69225]$ ./main
0 0 3 18446744073709551615
zsh: segmentation fault (core dumped)  ./main

I was able to reproduce the above with the exact same output with Rust 1.41 stable. Rust 1.40 stable does not exhibit the problem:

$ ./main
0 0 3 0
thread 'main' panicked at 'index out of bounds: the len is 0 but the index is 16777216', main.rs:13:35
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace.

I think this is all consistent with @dfyz's report, except this at least confirms that it isn't macOS specific.

@SimonSapin
Copy link
Contributor

  • cargo-bisect-rustc says the regression first happened in the 2019-12-12 nightly, specifically in this commit. This seems suspicious to me, given that 1.40.0, which does not exhibit the problem, was released after this date.

This is expected. 1.40.0 was released on 2019-12-19 based on what was then the beta branch, which was branched from master six weeks earlier (around the time of the 1.39.0 release). See https://doc.rust-lang.org/book/appendix-07-nightly-rust.html for more about release channels.

@RalfJung
Copy link
Member

If I had to guess, I'd say #67015 is the likely culprit. This fixed 3 codegen issues, so it touches codegen-critical code.
Cc @osa1 @oli-obk @wesleywiser

@jonas-schievink jonas-schievink added the E-needs-bisection Call for participation: This issue needs bisection: https://github.com/rust-lang/cargo-bisect-rustc label Feb 17, 2020
@osa1
Copy link
Contributor

osa1 commented Feb 17, 2020

Thanks for the ping. I'm currently making a build to investigate. It's possible that this is a bug introduced with #67015. Another possibility is that #67015 just revealed an existing bug.

@osa1
Copy link
Contributor

osa1 commented Feb 17, 2020

I managed to reproduce the segfault using rustc nightly on Linux, but not with my build generated with this config.toml:

config.toml
[llvm]

[build]

[install]

[rust]
optimize = true
debug = true
codegen-units = 0
debug-assertions = true
debuginfo-level = 2

[target.x86_64-unknown-linux-gnu]
llvm-config = "/usr/bin/llvm-config-9"

[dist]

Using rustc nightly I checked MIRs before and after ConstProp and the MIRs are identical. So if this is caused by ConstProp then it's because of a difference in generated code of a library, not of this program.

@hellow554
Copy link
Contributor

Regression in 033662d

@rustbot modify labels: -E-needs-bisection


@osa1 the debug-assertions = true is probably to blame. When I try to compile (with a vanilla nightly compiler) the program with -C debug-assertions=y the program panics instead of the segfault

@rustbot rustbot removed the E-needs-bisection Call for participation: This issue needs bisection: https://github.com/rust-lang/cargo-bisect-rustc label Feb 17, 2020
@shahn
Copy link
Contributor

shahn commented Feb 17, 2020

I think I solved it! Reverting a983e05 fixes the issue. It seemed like the culprit to me all day, but I couldn't test it at work :) Can someone help me how to best make a PR for this issue? I think the best way would be to add a testcase that must fail, but it seems this is very platform-specific etc, so maybe that is not a good idea after all? Any ideas here? Thanks!

@lqd
Copy link
Member

lqd commented Feb 17, 2020

(This was already bisected by the OP)

I managed to reproduce this with a local build with debug-assertions turned off (thanks for that @hellow554).

Some of the PRs in the rollup cause conflicts when reverted because we have since then rustfmt-ed everything, but I believe this issue is because of #67174.

edit: it seems we found this at the same time @shahn :)

@shahn

This comment has been minimized.

@peterhj
Copy link
Contributor

peterhj commented Feb 17, 2020

To add another data point, the problem disappears when codegen-units is set to 3 or less (assuming release profile, with incremental=false). In other words I'm able to reproduce when codegen-units is 4 or greater.

@tmiasko
Copy link
Contributor

tmiasko commented Feb 18, 2020

The call to panic_bounds_check disappears after LLVM Jump Threading pass. I can reproduce the issue with opt from LLVM 9, but not from LLVM 10.

@dfyz
Copy link
Contributor Author

dfyz commented Feb 18, 2020

So, I checked out and built a stage 1 rustc (./x.py build -i --stage 1), rebuilt libstd (./x.py build -i --stage 1 --keep-stage 0 src/libstd) without #67174, and recompiled the segfaulting program with four codegen units (rustc +stage1 -C opt-level=3 -C codegen-units=4 main.rs). As expected, this made the segfault go away. If I apply #67174 back again, the segfault returns.

This means I now have two compilers which only differ in the standard library they use. Let's call these compilers GOOD (no segfault) and BAD (segfault).

I then noticed that the 4 unoptimized *.ll files generated by GOOD (-C no-prepopulate-passes) are pretty much the same as the ones generated by BAD (the only difference I saw were the different random IDs in function names), but the optimized *.ll files (no -C no-prepopulate-passes) are wildly different. I'm not a compiler expert in any way, but since the program being compiled is exactly the same in both cases, both compilers proper are exactly the same, and the only difference is in a precompiled standard library, I think LTO might be involved.

Indeed, if I pass any of -Z thinlto=no, -C lto=no, -C lto=yes, -C lto=thin (at this point I'm really not sure, but I guess all forms of -C lto are different from ThinLTO, which is used by default) to BAD, the segfault once again disappears.

Does it look likely that LTO might be at fault here?

Mark-Simulacrum pushed a commit to Mark-Simulacrum/rust that referenced this issue Feb 21, 2020
This fixes a a segfault in safe code, a stable regression. Reported in
\rust-lang#69225.

This reverts commit a983e05.

Also adds a test for the expected behaviour.
Mark-Simulacrum pushed a commit to Mark-Simulacrum/rust that referenced this issue Feb 21, 2020
This fixes a a segfault in safe code, a stable regression. Reported in
\rust-lang#69225.

This reverts commit a983e05.

Also adds a test for the expected behaviour.
@dfyz
Copy link
Contributor Author

dfyz commented Feb 24, 2020

I think I just found a way to reproduce the issue even after reverting #67174. Here's a slightly longer, but still safe program that reliably segfaults on Windows, Linux and macOS using the latest nightly with #67174 reverted:

fn do_test(x: usize) {
    let mut arr = vec![vec![0u8; 3]];

    let mut z = vec![0];
    for arr_ref in arr.iter_mut() {
        for y in 0..x {
            for _ in 0..1 {
                z.reserve_exact(x);
                let iterator = std::iter::repeat(0).take(x);
                let mut cnt = 0;
                iterator.for_each(|_| {
                    z[0] = 0;
                    cnt += 1;
                });
                let a = y * x;
                let b = (y + 1) * x - 1;
                let slice = &mut arr_ref[a..b];
                slice[1 << 24] += 1;
            }
        }
    }
}

fn main() {
    do_test(1);
    do_test(2);
}

Windows:

PS> rustup run nightly rustc --version
rustc 1.43.0-nightly (6d0e58bff 2020-02-23)
PS> rustup run nightly cargo run --release
    Finished release [optimized] target(s) in 0.01s
     Running `target\release\rust-segfault.exe`
error: process didn't exit successfully: `target\release\rust-segfault.exe` (exit code: 0xc0000005, STATUS_ACCESS_VIOLATION)

Linux:

$ rustup run nightly rustc --version
rustc 1.43.0-nightly (6d0e58bff 2020-02-23)
$ rustup run nightly cargo run --release
    Finished release [optimized] target(s) in 1.13s
     Running `target/release/rust-segfault`
Segmentation fault (core dumped)

macOS:

λ rustup run nightly rustc --version
rustc 1.43.0-nightly (6d0e58bff 2020-02-23)
λ rustup run nightly cargo run --release
    Finished release [optimized] target(s) in 0.01s
     Running `target/release/rust-segfault`
[1]    24331 segmentation fault  rustup run nightly cargo run --release

This program doesn't depend on the number of codegen units, so it segfaults in the Playground, too (on stable, beta, and nightly). I also reproduced this by compiling rustc from master (with #67174 reverted) linked against LLVM 9.

The underlying LLVM bug is still the same, so upgrading to LLVM 10 or cherry-picking the LLVM fix makes the segfault go away.

I really wish I understood what's going on better. It does look like the bounds checks are elided because of the extra nuw, which comes from incorrectly cached SCEV values (just like in the C program from the thread @nikic linked to). But by the time the bad optimization happens, I can barely recognize my simple program through the layers of LLVM basic blocks; worse, any seemingly no-op change in the source code (e.g., removing the cnt variable) leads to a very differently looking LLVM IR and makes the problem disappear.

@dfyz
Copy link
Contributor Author

dfyz commented Feb 24, 2020

My impression is that 1.41.1 has just been finalized in #69359 (bad timing on my part), so there is not much that can be done at this point. Is it at least a good idea to update the comment in Layout::repeat() with a more detailed explanation of the LLVM issue? If so, I can send a PR.

@pietroalbini
Copy link
Member

pietroalbini commented Feb 24, 2020

My impression is that 1.41.1 has just been finalized in #69359 (bad timing on my part), so there is not much that can be done at this point.

If the patch we included in 1.41.1 doesn't actually fix the problem we should reconsider whether we want to backport the new fix and rebuild the release. There was consensus in the release team meeting not to backport the LLVM fix, but I personally think another this new PoC could warrant another discussion on the topic.

cc @Mark-Simulacrum @rust-lang/release

@pietroalbini
Copy link
Member

@dfyz we'll try to get another build of 1.41.1 with the LLVM fix backported, while we wait for consensus on actually shipping that.

@cuviper
Copy link
Member

cuviper commented Feb 24, 2020

FWIW, for me the new reproducer works as expected (index out of bounds) on stable 1.38.0 and earlier, but segfaults on 1.39.0 and later. There's not a whole lot of difference in LLVM between 1.38 and 1.39 (rust-lang/llvm-project@71fe7ec...8adf9bd), but it could be any sort of Rust-generated difference along the way too.

bors added a commit that referenced this issue Feb 24, 2020
…oalbini

[stable] Cherry-pick LLVM fix for 1.41.1

This PR cherry-picks the LLVM patch required to properly fix #69225 on 1.41.1, and adds a test to make sure the fix works.

r? @ghost
cuviper added a commit to cuviper/rust that referenced this issue Feb 25, 2020
@dfyz
Copy link
Contributor Author

dfyz commented Feb 25, 2020

the new reproducer works as expected (index out of bounds) on stable 1.38.0

I (accidentally) found out that setting -C codegen-units=1 on 1.38.0 reproduces the segfault. 1.37.0 seems safe to me (no combination of options I tried produces a segfault).

Disregard that, 1.37.0 uses LLVM 8.
Curiously, the LLVM IR diff between 1.37.0 and 1.38.0 (with -C codegen-units=1) is just one line:

- %71 = icmp eq {}* %70, null
+ %71 = icmp ule {}* %70, null

(where %70 is derived from the result of <core::slice::IterMut<T> as core::iter::traits::iterator::Iterator>::next())

This alone is enough to trick LLVM into adding the dreaded nuw to add nuw i64 %x, -1.

@cuviper
Copy link
Member

cuviper commented Feb 25, 2020

1.37.0 seems safe to me (no combination of options I tried produces a segfault).

That's using LLVM 8, so the blamed SCEV change shouldn't exist at all.

@dfyz
Copy link
Contributor Author

dfyz commented Feb 25, 2020

That's using LLVM 8

My bad, sorry for the confusion (I was so happy to reduce it to a one-line diff I didn't even bother checking the LLVM version).

@pietroalbini
Copy link
Member

We prepared new 1.41.1 artifacts with the LLVM fix cherry-picked in it. You can test them locally with:

RUSTUP_DIST_SERVER=https://dev-static.rust-lang.org rustup update stable

bors added a commit that referenced this issue Feb 25, 2020
Cherry-pick the LLVM fix for #69225

An additional reproducer was provided in #69225 -- the new testcase here -- which still crashes even after #69241 reverted #67174. Now this pull request updates LLVM with the cherry-picked reversion of its own. This is also going to stable in #69444.

I have not tried to reapply #67174 yet -- cc @kraai @shahn
@petrochenkov
Copy link
Contributor

ping in #69225 (comment)

[triagebot] The issue was successfully resolved without any involvement from the pinged compiler team.
Bretty good.

@dfyz
Copy link
Contributor Author

dfyz commented Feb 27, 2020

1.41.1 is out, I guess it's time to finally close this issue.

@dfyz dfyz closed this as completed Feb 27, 2020
Mark-Simulacrum pushed a commit to Mark-Simulacrum/rust that referenced this issue Feb 28, 2020
bors added a commit that referenced this issue Feb 28, 2020
[beta] backports

This backports the following PRs:

* ci: switch macOS builders to 10.15 #68863
* Backport release notes of 1.41.1 #69468
* Cherry-pick the LLVM fix for #69225 #69450
* `lit_to_const`: gracefully bubble up type errors. #69330
* [beta] bootstrap from 1.41.1 stable #69518
* bootstrap: Configure cmake when building sanitizer runtimes #69104

r? @ghost
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. C-bug Category: This is a bug. I-unsound Issue: A soundness hole (worst kind of bug), see: https://en.wikipedia.org/wiki/Soundness ICEBreaker-LLVM Bugs identified for the LLVM ICE-breaker group P-medium Medium priority regression-from-stable-to-stable Performance or correctness regression from one stable version to another. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests