Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix race condition in cycle detector block sent handling #3666

Merged
merged 1 commit into from
Sep 26, 2020

Conversation

SeanTAllen
Copy link
Member

This fixes a bug introduced with the recent short-lived actor garbage
collection improvements.

Here's the race:

  • Cycle detector is on
  • Cycle detector is aware of an actor
  1. Actor finishes a scheduler run
  2. Actor marks queue as empty
  3. Actor has rc of 0 so it sends a BLOCK message to the cycle detector

Between 2 and 3, the cycle detector sends an IS_BLOCKED message to
the actor. The actor is rescheduled on the same scheduler thread as the
cycle detector.

Because in 3, we sent a BLOCK message and the rc is 0 (and can't change),
the cycle detector is going to delete the actor when it processes the
BLOCK message.

However, the actor exists on another schedulers queue. Either of the following
could happen:

  • Actor runs again before deletion, sends another BLOCK message to the cycle
    detector.
  • Actor is deleted by the cycle detector before the other scheduler runs it

Either way, hilarity in the form of segfaults or other race condition oddities
will occur.

As the comment in actor.c says:

If we mark the queue as empty, then it is no longer safe to do any
operations on this actor that aren't concurrency safe unless, the actor
has an rc of 0 and the cycle detector isn't aware of the actor's
existence.

Prior to this commit, sending a block message to the cd after marking the
queue as empty was not concurrency safe due to the aforementioned race
condition.

This fixes a bug introduced with the recent short-lived actor garbage
collection improvements.

Here's the race:

- Cycle detector is on
- Cycle detector is aware of an actor

1. Actor finishes a scheduler run
2. Actor marks queue as empty
3. Actor has rc of 0 so it sends a BLOCK message to the cycle detector

Between 2 and 3, the cycle detector sends an IS_BLOCKED message to
the actor. The actor is rescheduled on the same scheduler thread as the
cycle detector.

Because in 3, we sent a BLOCK message and the rc is 0 (and can't change),
the cycle detector is going to delete the actor when it processes the
BLOCK message.

However, the actor exists on another schedulers queue. Either of the following
could happen:

- Actor runs again before deletion, sends another BLOCK message to the cycle
  detector.
- Actor is deleted by the cycle detector before the other scheduler runs it

Either way, hilarity in the form of segfaults or other race condition oddities
will occur.

As the comment in actor.c says:

If we mark the queue as empty, then it is no longer safe to do any
operations on this actor that aren't concurrency safe unless, the actor
has an rc of 0 and the cycle detector isn't aware of the actor's
existence.

Prior to this commit, sending a block message to the cd after marking the
queue as empty was not concurrency safe due to the aforementioned race
condition.
@SeanTAllen SeanTAllen added the changelog - fixed Automatically add "Fixed" CHANGELOG entry on merge label Sep 26, 2020
@SeanTAllen SeanTAllen merged commit 765e11d into master Sep 26, 2020
@SeanTAllen SeanTAllen deleted the seantallen/cd-block-sent-race-condition branch September 26, 2020 15:19
github-actions bot pushed a commit that referenced this pull request Sep 26, 2020
github-actions bot pushed a commit that referenced this pull request Sep 26, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
changelog - fixed Automatically add "Fixed" CHANGELOG entry on merge
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants