Fix race condition in cycle detector block sent handling #3666

SeanTAllen · 2020-09-26T12:37:28Z

This fixes a bug introduced with the recent short-lived actor garbage
collection improvements.

Here's the race:

Cycle detector is on
Cycle detector is aware of an actor

Actor finishes a scheduler run
Actor marks queue as empty
Actor has rc of 0 so it sends a BLOCK message to the cycle detector

Between 2 and 3, the cycle detector sends an IS_BLOCKED message to
the actor. The actor is rescheduled on the same scheduler thread as the
cycle detector.

Because in 3, we sent a BLOCK message and the rc is 0 (and can't change),
the cycle detector is going to delete the actor when it processes the
BLOCK message.

However, the actor exists on another schedulers queue. Either of the following
could happen:

Actor runs again before deletion, sends another BLOCK message to the cycle
detector.
Actor is deleted by the cycle detector before the other scheduler runs it

Either way, hilarity in the form of segfaults or other race condition oddities
will occur.

As the comment in actor.c says:

If we mark the queue as empty, then it is no longer safe to do any
operations on this actor that aren't concurrency safe unless, the actor
has an rc of 0 and the cycle detector isn't aware of the actor's
existence.

Prior to this commit, sending a block message to the cd after marking the
queue as empty was not concurrency safe due to the aforementioned race
condition.

This fixes a bug introduced with the recent short-lived actor garbage collection improvements. Here's the race: - Cycle detector is on - Cycle detector is aware of an actor 1. Actor finishes a scheduler run 2. Actor marks queue as empty 3. Actor has rc of 0 so it sends a BLOCK message to the cycle detector Between 2 and 3, the cycle detector sends an IS_BLOCKED message to the actor. The actor is rescheduled on the same scheduler thread as the cycle detector. Because in 3, we sent a BLOCK message and the rc is 0 (and can't change), the cycle detector is going to delete the actor when it processes the BLOCK message. However, the actor exists on another schedulers queue. Either of the following could happen: - Actor runs again before deletion, sends another BLOCK message to the cycle detector. - Actor is deleted by the cycle detector before the other scheduler runs it Either way, hilarity in the form of segfaults or other race condition oddities will occur. As the comment in actor.c says: If we mark the queue as empty, then it is no longer safe to do any operations on this actor that aren't concurrency safe unless, the actor has an rc of 0 and the cycle detector isn't aware of the actor's existence. Prior to this commit, sending a block message to the cd after marking the queue as empty was not concurrency safe due to the aforementioned race condition.

SeanTAllen added the changelog - fixed Automatically add "Fixed" CHANGELOG entry on merge label Sep 26, 2020

SeanTAllen requested a review from dipinhora September 26, 2020 12:37

SeanTAllen mentioned this pull request Sep 26, 2020

Trigger GC for actors when they tell the cycle detector they're blocked #3278

Merged

dipinhora approved these changes Sep 26, 2020

View reviewed changes

SeanTAllen merged commit 765e11d into master Sep 26, 2020

SeanTAllen deleted the seantallen/cd-block-sent-race-condition branch September 26, 2020 15:19

github-actions bot pushed a commit that referenced this pull request Sep 26, 2020

Updates release notes for PR #3666

d41a1c3

github-actions bot pushed a commit that referenced this pull request Sep 26, 2020

Update CHANGELOG for PR #3666

c34307e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix race condition in cycle detector block sent handling #3666

Fix race condition in cycle detector block sent handling #3666

SeanTAllen commented Sep 26, 2020

Fix race condition in cycle detector block sent handling #3666

Fix race condition in cycle detector block sent handling #3666

Conversation

SeanTAllen commented Sep 26, 2020