Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

All workers parked when some open buckets are not empty. #792

Closed
wks opened this issue Apr 15, 2023 · 3 comments
Closed

All workers parked when some open buckets are not empty. #792

wks opened this issue Apr 15, 2023 · 3 comments

Comments

@wks
Copy link
Collaborator

wks commented Apr 15, 2023

The PR #782 may still have synchronisation issue.

In rare cases, the coordinator may have an assertion failure when all workers parked.

Link: https://github.com/mmtk/mmtk-openjdk/actions/runs/4696973080/jobs/8327558441#step:4:7314

++ build/linux-x86_64-normal-server-fastdebug/jdk/bin/java -XX:+UseThirdPartyHeap -server -XX:MetaspaceSize=100M -Xms500M -Xmx500M -jar /home/runner/work/mmtk-openjdk/mmtk-openjdk/.github/scripts/../../repos/openjdk/dacapo/dacapo-2006-10-MR2.jar eclipse
[2023-04-14T06:49:53Z INFO  mmtk::memory_manager] Initialized MMTk with GenImmix
===== DaCapo eclipse starting =====
<setting up workspace...>
<creating projects..............................................................>

... more lines omitted here

[2023-04-14T06:50:22Z INFO  mmtk::util::heap::gc_trigger] [POLL] los: Triggering collection (128068/128000 pages)
[2023-04-14T06:50:22Z INFO  mmtk::plan::generational::global] Nursery GC
[2023-04-14T06:50:23Z INFO  mmtk::scheduler::gc_work] End of GC (15036/128000 pages, took 437 ms)
[2023-04-14T06:50:24Z INFO  mmtk::util::heap::gc_trigger] [POLL] nursery: Triggering collection (128014/128000 pages)
[2023-04-14T06:50:24Z INFO  mmtk::plan::generational::global] Nursery GC
[2023-04-14T06:50:25Z INFO  mmtk::scheduler::gc_work] End of GC (17233/128000 pages, took 392 ms)
[2023-04-14T06:50:26Z INFO  mmtk::util::heap::gc_trigger] [POLL] nursery: Triggering collection (128010/128000 pages)
[2023-04-14T06:50:26Z INFO  mmtk::plan::generational::global] Nursery GC
[2023-04-14T06:50:26Z INFO  mmtk::scheduler::gc_work] End of GC (18867/128000 pages, took 457 ms)
<performing AST tests...>
	AST creation: org.eclipse.jdt.internal.compiler.parser
[2023-04-14T06:50:29Z INFO  mmtk::util::heap::gc_trigger] [POLL] nursery: Triggering collection (128015/128000 pages)
[2023-04-14T06:50:29Z INFO  mmtk::plan::generational::global] Nursery GC
thread '<unnamed>' panicked at 'assertion failed: self.scheduler.all_activated_buckets_are_empty()', /home/runner/.cargo/git/checkouts/mmtk-core-4a4b541a6bb74ff9/b935a28/src/scheduler/controller.rs:95:9
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
fatal runtime error: failed to initiate panic, error 5
./.github/scripts/ci-test-only-normal.sh: line 66: 54478 Aborted                 (core dumped) build/linux-x86_64-normal-server-$DEBUG_LEVEL/jdk/bin/java -XX:+UseThirdPartyHeap -server -XX:MetaspaceSize=100M -Xms500M -Xmx500M -jar $DACAPO_PATH/dacapo-2006-10-MR2.jar eclipse
Error: Process completed with exit code 134.

It may be related to generational GC, but I cannot reproduce it locally.

@qinsoon
Copy link
Member

qinsoon commented Apr 17, 2023

I am seeing this as well with sticky immix in Julia (running with Github CI runner).

@wks
Copy link
Collaborator Author

wks commented Apr 17, 2023

Apply the following patch to mmtk-core and the error is almost guaranteed to be reproducible.

diff --git a/src/scheduler/scheduler.rs b/src/scheduler/scheduler.rs
index 8dbf8ba39..8e49bcf57 100644
--- a/src/scheduler/scheduler.rs
+++ b/src/scheduler/scheduler.rs
@@ -138,6 +138,11 @@ impl<VM: VMBinding> GCWorkScheduler<VM> {
         // Prepare global/collectors/mutators
         self.work_buckets[WorkBucketStage::Prepare].add(Prepare::<C>::new(plan));
 
+        warn!("Prepare added! Sleeping...");
+        let timeout = std::time::Duration::from_millis(100);
+        std::thread::sleep(timeout);
+        warn!("Waken. Oh! I have just added Prepare! Continuing...");
+
         // Release global/collectors/mutators
         self.work_buckets[WorkBucketStage::Release].add(Release::<C>::new(plan));
 
@@ -388,6 +393,10 @@ impl<VM: VMBinding> GCWorkScheduler<VM> {
             let all_parked = sync.inc_parked_workers();
 
             if all_parked {
+                warn!("Worker {} observed all parked.  Sleeping...", worker.ordinal);
+                let timeout = std::time::Duration::from_millis(100);
+                std::thread::sleep(timeout);
+                warn!("Worker {} waken! Oh I observed all parked.  Continuing...", worker.ordinal);
                 // If all workers are parked, enter "group sleeping" and notify controller.
                 sync.group_sleep = true;
                 debug!("Entered group-sleeping state");

It doesn't need generational plan. Use the Immix plan to run lusearch is enough.

I'll analyse the bug later.

@wks
Copy link
Collaborator Author

wks commented Apr 18, 2023

I think this is what happened. Assume there is only one workers. (Yes. It can be reproduced with MMTK_THREADS=1.)

  1. GC is triggered.
  2. The coordinator thread executed the ScheduleCollection work packet.
    a. It added the StopMutators work packet into the Unconstrained bucket.
    b. It added the Prepare work packet to the Prepare bucket.
    c. Just at this moment, the OS did a context switching, and paused the coordinator. (Emulated by the sleep in schedule_common_work.)
  3. The worker quickly executed all available work packets.
    a. It executed StopMutators which opened the Prepare bucket,
    b. and it continued to drain all packets in the Unconstrained and Prepared buckets.
    c. Then the worker had nothing to do. It entered poll_slow, grabbed the WorkerMonitor lock, parked, and found it was the last parked worker. But just at this moment, the OS did another context switching, and paused the worker. (Emulated by the sleep in poll_slow.)
  4. The coordinator woke up. It finished executing the ScheduleCollection work packet, and wants to notify the works. But the worker is still holding the WorkerMonitor lock in poll_slow. The coordinator yielded.
  5. The OS resumed the worker. The worker set group_sleep to true, sent a message to the coordinator that all workers parked, and wait() (releasing the WorkerMonitor lock).
  6. (Important) The coordinator acquired the WorkerMonitor lock, set group_sleep to false and calling notify_all(). The OS did a context switch and pause the coordinator on the notify_all() call.
  7. The worker is notified and resumed from wait(). At this time, no new buckets are open, so it parked again, found it was the last last parked worker, Then the OS did context switch again (emulated by the sleep in poll_slow).
  8. (Important) The coordinator continued and found the message that all workers parked. It opened the Closure bucket, and tried to notify the workers. Again the coordinator is blocked on the WorkerMonitor lock.
  9. The OS resumed the worker. IT sets group_sleep to true, sent an "all workers parked" message to the coordinator, and wait() (releasing the WorkerMonitor lock).
  10. The coordinator acquired the WorkerMonitor lock. It found that all workers have parked. At that this moment, the Closure bucket is already open, but it is not empty. The assertion failed.

The immediate cause of this problem is that the coordinator received two consecutive "all workers parked" message, despite that the message is carried by a boolean variable ChannelSync::all_workers_parked and is supposed to be idempotent.

And the worker sent the message twice because of the "notify_all" after executing ScheduleCollection.

  1. After the worker drained all packets in Unconstrained and Prepare, it sent the first "all workers parked" message to the coordinator, and entered "group sleep".
  2. The coordinator notified the worker because it finished executing the ScheduleCollection work packet.
  3. The worker woke up, found no new buckets opened and no new packets added. But the OS blocked it.
  4. The coordinator received the first "all workers parked" message, and opened the Closure bucket.
  5. The worker is resumed by the OS, sent the second "all workers parked" message, and entered "group sleep" again.
  6. The coordinator received the second "all workers parked" message, and assertion failed because the Closure bucket is not empty.

Step 2 is the culprit.

  • The coordinator notified the workers because it wants the workers to execute new packets added by ScheduleCollection.
  • But the workers thought they were notified because new buckets are opened. They worked so fast that they drained everything ScheduleCollection had added before the coordinator notified the workers.

@wks wks closed this as completed in 43e8a92 Apr 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants