Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Racy test failure: treeifying a Moved entry #83

Closed
jonhoo opened this issue Apr 10, 2020 · 8 comments · Fixed by #87
Closed

Racy test failure: treeifying a Moved entry #83

jonhoo opened this issue Apr 10, 2020 · 8 comments · Fixed by #87
Assignees
Labels
bug Something isn't working

Comments

@jonhoo
Copy link
Owner

jonhoo commented Apr 10, 2020

$ cargo test --test jdk concurrent_associate::test_concurrent_insert
thread '<unnamed>' panicked at 'internal error: entered unreachable code: treeifying a Moved entry', /home/jon/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/src/libstd/macros.rs:16:9
stack backtrace:
   0: backtrace::backtrace::libunwind::trace
             at /cargo/registry/src/gh.neting.cc-1ecc6299db9ec823/backtrace-0.3.46/src/backtrace/libunwind.rs:86
   1: backtrace::backtrace::trace_unsynchronized
             at /cargo/registry/src/gh.neting.cc-1ecc6299db9ec823/backtrace-0.3.46/src/backtrace/mod.rs:66
   2: std::sys_common::backtrace::_print_fmt
             at src/libstd/sys_common/backtrace.rs:78
   3: <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt
             at src/libstd/sys_common/backtrace.rs:59
   4: core::fmt::write
             at src/libcore/fmt/mod.rs:1069
   5: std::io::Write::write_fmt
             at src/libstd/io/mod.rs:1504
   6: std::sys_common::backtrace::_print
             at src/libstd/sys_common/backtrace.rs:62
   7: std::sys_common::backtrace::print
             at src/libstd/sys_common/backtrace.rs:49
   8: std::panicking::default_hook::{{closure}}
             at src/libstd/panicking.rs:198
   9: std::panicking::default_hook
             at src/libstd/panicking.rs:218
  10: std::panicking::rust_panic_with_hook
             at src/libstd/panicking.rs:511
  11: rust_begin_unwind
             at src/libstd/panicking.rs:419
  12: std::panicking::begin_panic_fmt
             at src/libstd/panicking.rs:373
  13: flurry::map::HashMap<K,V,S>::treeify_bin
             at /home/jon/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/src/libstd/macros.rs:16
  14: flurry::map::HashMap<K,V,S>::put
             at ./src/map.rs:1952
  15: flurry::map::HashMap<K,V,S>::insert
             at ./src/map.rs:1625
  16: jdk::concurrent_associate::insert
             at tests/jdk/concurrent_associate.rs:24
  17: core::ops::function::Fn::call
             at /home/jon/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/src/libcore/ops/function.rs:72
  18: jdk::concurrent_associate::test_once::{{closure}}
             at tests/jdk/concurrent_associate.rs:52
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
test concurrent_associate::test_concurrent_insert ... FAILED
@jonhoo jonhoo added the bug Something isn't working label Apr 10, 2020
@jonhoo
Copy link
Owner Author

jonhoo commented Apr 10, 2020

The map::tree_bins::concurrent_tree_bin test also occasionally segfaults for me without a backtrace. Not sure if that traces back to the same issue, but it's something we should definitely look into! I can reproduce on current nightly on Linux by running this command for a while:

$ while cargo test --lib map::tree_bins::concurrent_tree_bin -- --test-threads=1 --nocapture; do :; done

@jonhoo
Copy link
Owner Author

jonhoo commented Apr 10, 2020

I think it reproduces more frequently if you are also running other things in the background to cause more scheduler noise (all the more reason for us to get #34 :p)

@jonhoo
Copy link
Owner Author

jonhoo commented Apr 10, 2020

I factored out the segfault into #84

@domenicquirl
Copy link
Collaborator

So far unable to reproduce on Windows stable. The only place where treeify_bin is called is in put. I don't think the note there about the difference compared to the Java code matters here, since bin_count has to satisfy >= TREEIFY_THRESHOLD.

The Java code also checks this condition inside the case for non-MOVED nodes, but outside the synchronized section. So hitting MOVED specifically is not possible in the Java code. But I think technically we can transfer before we re-load the bin in treeify_bin?

@jonhoo
Copy link
Owner Author

jonhoo commented Apr 10, 2020

Bah, just got another crash that seems related with #85 applied:

thread '<unnamed>' panicked at 'internal error: entered unreachable code: treeifying a Tree', /home/jon/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/src/libstd/macros.rs:16:9
stack backtrace:
...
  13: flurry::map::HashMap<K,V,S>::treeify_bin
             at /home/jon/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/src/libstd/macros.rs:16
  14: flurry::map::HashMap<K,V,S>::put
             at ./src/map.rs:1952
  15: flurry::map::HashMap<K,V,S>::insert
             at ./src/map.rs:1625
  16: jdk::concurrent_associate::insert
             at tests/jdk/concurrent_associate.rs:24
  17: core::ops::function::Fn::call
             at /home/jon/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/src/libcore/ops/function.rs:72
  18: jdk::concurrent_associate::test_once::{{closure}}
             at tests/jdk/concurrent_associate.rs:52

@domenicquirl
Copy link
Collaborator

I'm still on the road, so no tests. But the more I look at this, the more I think there isn't actually a problem. If you look at the original implementation, you'll see that it doesn't have the panics, as the Java code also doesn't trigger a failure there. They were then added during review so we notice if we try to treeify something that is not a tree bin.

Now, due to the resoning from my last comment, in the Java code it must also be possible to call treeifyBin with something other than a regular Node (also I think they would have noticed if the synchronization was a problem there). Of course you could force the call to treeifyBin to happen inside the critical section by executing it and the comparison inside the synchronized (while holding the lock). But let's think about the cases you encountered:

  • One thread inserts a new entry and passes the TREEIFY_THRESHOLD, but a different thread executes a move of that bin. Under the assumption that the treefication is always forced, it is very likely that the move will split the bin in question into two smaller bins, both below the threshold, and has to untreeify the bin again. If it is not, and due to scheduling the move happens first, it is fine for the first thread to see a MOVED upon re-check and just not treeify the bin.
  • In the same scenario, if there is one insert passing the threshold and then another insert before treeification happens which also gets to treeifying the bin (due to scheduling), then the bin is already treeified for the first thread and it is again fine not to treefy_bin. Of course there is a tradeoff here were the second insert would already have happend into a tree bin if we forced treeification under the lock, but the second thread would also have to wait for the lock before the insert.

So I think the semantics of the Java code here are that of a rather subtle else { // pass }, and the issue is that we made this crash on Moved and TreeBin (TreeNode is fine, since it shouldn't be the head of a bin). I suggest matching in the loaded bin in treeify_bin with the existing treeify case for Node, two empty cases for TreeBin and Moved with an explanation and keeping the panic only for TreeNode.

@jonhoo
Copy link
Owner Author

jonhoo commented Apr 11, 2020

Ah, that's interesting. Yeah, I think I buy it. Would you mind filing a PR that replaces those assertions with comments outlining what you wrote before for why we don't want to assert?

@jonhoo
Copy link
Owner Author

jonhoo commented Apr 13, 2020

I can confirm that this was fixed by #87 after having run it in a loop for a while.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants