Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

abandoned_in_gc not empty in MarkSweep #84

Closed
wks opened this issue Jul 9, 2024 · 4 comments · Fixed by mmtk/ruby#84 or #92
Closed

abandoned_in_gc not empty in MarkSweep #84

wks opened this issue Jul 9, 2024 · 4 comments · Fixed by mmtk/ruby#84 or #92

Comments

@wks
Copy link
Collaborator

wks commented Jul 9, 2024

We observed this assertion error when running MarkSweep: https://github.com/mmtk/mmtk-ruby/actions/runs/9853731634/job/27204907289?pr=83

  | [2024[2024-07-09T08:43:46Z INFO  mmtk::util::heap::gc_trigger] [POLL] ms: Triggering collection (10987/10559 pag[2024-07-09T08:43:46Z INFO  mmtk::scheduler::scheduler] End of GC (1016/12098 pages, took 132 ms)
  | bootstraptest.test_io.rb_90_693.rb:16:in 'IO.copy_stream': Interrupt
  | 	from bootstraptest.test_io.rb_90_693.rb:16:in '<main>'
  |  MMTk GC thread panicked.  This is a bug.
  | panicked at /home/runner/.cargo/git/checkouts/mmtk-core-3306bdeb8eb4322b/6cae51c/src/policy/marksweepspace/native_ms/global.rs:150:13:
  | assertion failed: self.available[i].is_empty()
  |    0: mmtk_ruby::handle_gc_thread_panic
  |              at /home/runner/work/mmtk-ruby/mmtk-ruby/git/mmtk-ruby/mmtk/src/lib.rs:101:14
  |    1: mmtk_ruby::set_panic_hook::{{closure}}
  |              at /home/runner/work/mmtk-ruby/mmtk-ruby/git/mmtk-ruby/mmtk/src/lib.rs:130:13
  |    2: <alloc::boxed::Box<F,A> as core::ops::function::Fn<Args>>::call
  |              at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/alloc/src/boxed.rs:2029:9
  |    3: std::panicking::rust_panic_with_hook
  |              at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/std/src/panicking.rs:785:13
  |    4: std::panicking::begin_panic_handler::{{closure}}
  |              at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/std/src/panicking.rs:651:13
  |    5: std::sys_common::backtrace::__rust_end_short_backtrace
  |              at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/std/src/sys_common/backtrace.rs:171:18
  |    6: rust_begin_unwind
  |              at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/std/src/panicking.rs:647:5
  |    7: core::panicking::panic_fmt
  |              at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/core/src/panicking.rs:72:14
  |    8: core::panicking::panic
  |              at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/core/src/panicking.rs:144:5
  |    9: mmtk::policy::marksweepspace::native_ms::global::AbandonedBlockLists::assert_empty
  |              at /home/runner/.cargo/git/checkouts/mmtk-core-3306bdeb8eb4322b/6cae51c/src/policy/marksweepspace/native_ms/global.rs:150:13
  |   10: mmtk::policy::marksweepspace::native_ms::global::MarkSweepSpace<VM>::prepare
  |              at /home/runner/.cargo/git/checkouts/mmtk-core-3306bdeb8eb4322b/6cae51c/src/policy/marksweepspace/native_ms/global.rs:341:9
  |   11: <mmtk::plan::marksweep::global::MarkSweep<VM> as mmtk::plan::global::Plan>::prepare
  |              at /home/runner/.cargo/git/checkouts/mmtk-core-3306bdeb8eb4322b/6cae51c/src/plan/marksweep/global.rs:60:9
  |   12: <mmtk::scheduler::gc_work::Prepare<C> as mmtk::scheduler::work::GCWork<<C as mmtk::scheduler::work::GCWorkContext>::VM>>::do_work
  |              at /home/runner/.cargo/git/checkouts/mmtk-core-3306bdeb8eb4322b/6cae51c/src/scheduler/gc_work.rs:60:9
  |   13: mmtk::scheduler::work::GCWork::do_work_with_stat
  |              at /home/runner/.cargo/git/checkouts/mmtk-core-3306bdeb8eb4322b/6cae51c/src/scheduler/work.rs:45:9
  |   14: mmtk::scheduler::worker::GCWorker<VM>::run
  |              at /home/runner/.cargo/git/checkouts/mmtk-core-3306bdeb8eb4322b/6cae51c/src/scheduler/worker.rs:244:13
  |   15: mmtk::memory_manager::start_worker
  |              at /home/runner/.cargo/git/checkouts/mmtk-core-3306bdeb8eb4322b/6cae51c/src/memory_manager.rs:492:5
  |   16: <mmtk_ruby::collection::VMCollection as mmtk::vm::collection::Collection<mmtk_ruby::Ruby>>::spawn_gc_thread::{{closure}}
  |              at /home/runner/work/mmtk-ruby/mmtk-ruby/git/mmtk-ruby/mmtk/src/collection.rs:54:21
  |   17: std::sys_common::backtrace::__rust_begin_short_backtrace
  |              at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/std/src/sys_common/backtrace.rs:155:18
  |   18: std::thread::Builder::spawn_unchecked_::{{closure}}::{{closure}}
  |              at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/std/src/thread/mod.rs:529:17
  |   19: <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once
  |              at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/core/src/panic/unwind_safe.rs:272:9
  |   20: std::panicking::try::do_call
  |              at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/std/src/panicking.rs:554:40
  |   21: __rust_try
  |   22: std::panicking::try
  |              at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/std/src/panicking.rs:518:19
  |   23: std::panic::catch_unwind
  |              at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/std/src/panic.rs:142:14
  |   24: std::thread::Builder::spawn_unchecked_::{{closure}}
  |              at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/std/src/thread/mod.rs:528:30
  |   25: core::ops::function::FnOnce::call_once{{vtable.shim}}
  |              at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/core/src/ops/function.rs:250:5
  |   26: <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once
  |              at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/alloc/src/boxed.rs:2015:9
  |   27: <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once
  |              at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/alloc/src/boxed.rs:2015:9
  |   28: std::sys::pal::unix::thread::Thread::new::thread_start
  |              at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/std/src/sys/pal/unix/thread.rs:108:17
  |   29: <unknown>
  |   30: <unknown>
  | 
    megacontent-copy_stream
@eileencodes
Copy link

This is reproducible on macos if you set the max heap lower. If you're on macos this will segv almost every time. I turned off the rust log so we don't get other failures due to stdout.

make -j12 test-all RUBY_TESTOPTS="--excludes-dir=test/.excludes-mmtk --seed=65183" RUN_OPTS="--mmtk-plan=MarkSweep --mmtk-max-heap=16GiB" RUST_LOG=none RUST_BACKTRACE=1

I then narrowed it down to two tests by running not in parallel. I can reproduce this with a combination of test/ruby/test_objectspace.rb and test/fiddle/test_c_struct_entry.rb

make test-all RUBY_TESTOPTS="--excludes-dir=test/.excludes-mmtk --seed=65183" RUN_OPTS="--mmtk --mmtk-plan=MarkSweep --mmtk-max-heap=16GiB" RUST_BACKTRACE=1 RUBY_CODESIGN=1 TESTS="test/ruby/test_objectspace.rb test/fiddle/test_c_struct_entry.rb"

@eileencodes
Copy link

eileencodes commented Jul 31, 2024

I've narrowed this down to two tests those files. TestObjectSpace#test_finalizer_thread_raise and Fiddle::TestCStructEntity#test_free_with_func. I then narrowed that down even more - all the second test needs to do to trigger this crash is call GC.start.

Here's a really small repro script. It doesn't actually require lowering the memory to repro now. ./ruby --mmtk-plan=MarkSweep test.rb is enough.

def test_finalizer_thread_raise
  Thread.new do
  end
  GC.start
end

def test_free_with_func
  GC.start
end

test_finalizer_thread_raise
test_free_with_func

@wks
Copy link
Collaborator Author

wks commented Aug 1, 2024

The bug is in the Ruby binding. The return value of ActivePlan::number_of_mutators() is sometimes different from the number of mutators returned from ActivePlan::mutators().

  • ActivePlan::number_of_mutators() simply returns GET_VM()->ractor.main_ractor->threads.cnt
  • ActivePlan::mutators() will iterate through all threads in main_ractor->threads.set, and visit only threads that have non-NULL th->mutator.

Ruby caches native threads, and threads->set includes both actively running Ruby threads and threads that finished executing Ruby threads and are waiting to be reused.

Function thread_create_core calls rb_ractor_living_threads_insert to increment r->threads.cnt just before creating the native thread, and the newly created native thread will not set th->mutator until thread_start_func_2 calls rb_mmtk_bind_mutator. So if the main thread triggers GC using GC.start, it will see threads.cnt == 2, but the thread has not acquired a mutator struct from MMTk-core, yet. That'll trigger the bug.

The proper fix is to use the same loop in the current rb_mmtk_get_mutators for both counting the number of mutators and visiting each mutator.

wks added a commit to wks/mmtk-core that referenced this issue Aug 1, 2024
This commit adds assertions about the number of mutators and the list of
mutators returned from the VM binding.

Also asserts that epilogues are not silently dropped before they are
executed.  It could happen if a buggy VM binding returns a number of
mutators that does not match the actual list of mutators.

This should detect bugs like mmtk/mmtk-ruby#84
wks added a commit to wks/mmtk-core that referenced this issue Aug 1, 2024
This commit adds assertions about the number of mutators and the list of
mutators returned from the VM binding.

Also asserts that epilogues are not silently dropped before they are
executed.  It could happen if a buggy VM binding returns a number of
mutators that does not match the actual list of mutators.

This should detect bugs like mmtk/mmtk-ruby#84
wks added a commit to wks/mmtk-ruby that referenced this issue Aug 1, 2024
Updated the `ruby` repo revision to fix a bug where the number of
mutators returned from `ActivePlan::number_of_mutators` does not match
the number of mutators returned from `ActivePlan::mutators`.

Also updated the `mmtk-core` repo revision.  More assertions are added
so that such bugs can be detected earlier.

mmtk-core PR: mmtk/mmtk-core#1182
ruby PR: mmtk/ruby#84

Fixes: mmtk#84
github-merge-queue bot pushed a commit to mmtk/mmtk-core that referenced this issue Aug 1, 2024
This commit adds assertions about the number of mutators and the list of
mutators returned from the VM binding.

Also asserts that epilogues are not silently dropped before they are
executed. It could happen if a buggy VM binding returns a number of
mutators that does not match the actual list of mutators.

This should detect bugs like mmtk/mmtk-ruby#84
github-merge-queue bot pushed a commit to mmtk/mmtk-core that referenced this issue Aug 1, 2024
This commit adds assertions about the number of mutators and the list of
mutators returned from the VM binding.

Also asserts that epilogues are not silently dropped before they are
executed. It could happen if a buggy VM binding returns a number of
mutators that does not match the actual list of mutators.

This should detect bugs like mmtk/mmtk-ruby#84
wks added a commit to mmtk/ruby that referenced this issue Aug 1, 2024
It previously returns the number of threads, including cached threads
not executing Ruby code.  That is inconsistent with the number of
mutators returned from rb_mmtk_get_mutators.  This commit reimplementa
rb_mmtk_number_of_mutators with rb_mmtk_get_mutators to ensure they
see the same number of mutators.

Fixes: mmtk/mmtk-ruby#84
@wks
Copy link
Collaborator Author

wks commented Aug 1, 2024

@eileencodes The PR mmtk/ruby#84 for the ruby repo fixes the bug. It has been merged into the dev/mmtk-overrides-default branch. I also cherry-picked it onto the mmtk branch of the https://github.com/mmtk/ruby repo so that you can merge or rebase mmtk/ruby#80 with the mmtk branch.

The PR #92 for the mmtk-ruby repo is not necessary for fixing the bug in ActivePlan::number_of_mutators. It affects the ruby revision the mmtk-core CI is testing against, and updates to a newer mmtk-core revision that contains assertions that can detect this kind of bug. But currently we are having some problem with a random crash in cargo clippy when running the CI for Darwin (Mac OS). So I merged the ruby repo PR first. If you have any insights about cargo clippy crashing on Mac OS, please share your experiences with us because we have been having this kind of problem for a while. See discussion at: https://mmtk.zulipchat.com/#narrow/stream/262673-mmtk-core/topic/clippy.20failing.20on.20darwin/near/455616980

github-merge-queue bot pushed a commit to mmtk/mmtk-core that referenced this issue Aug 2, 2024
This commit adds assertions about the number of mutators and the list of
mutators returned from the VM binding.

Also asserts that epilogues are not silently dropped before they are
executed. It could happen if a buggy VM binding returns a number of
mutators that does not match the actual list of mutators.

This should detect bugs like mmtk/mmtk-ruby#84
mmtkgc-bot added a commit that referenced this issue Aug 2, 2024
Updated the `ruby` repo revision to fix a bug where the number of
mutators returned from `ActivePlan::number_of_mutators` does not match
the number of mutators returned from `ActivePlan::mutators`.

Also updated the `mmtk-core` repo revision. More assertions are added so
that such bugs can be detected earlier.

mmtk-core PR: mmtk/mmtk-core#1182
ruby PR: mmtk/ruby#84

Fixes: #84

---------

Co-authored-by: mmtkgc-bot <mmtkgc.bot@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants