Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some exchange tasks can't exit when query is canceled #4229

Closed
bestwoody opened this issue Mar 11, 2022 · 1 comment · Fixed by #4219
Closed

Some exchange tasks can't exit when query is canceled #4229

bestwoody opened this issue Mar 11, 2022 · 1 comment · Fixed by #4219
Assignees
Labels

Comments

@bestwoody
Copy link
Contributor

Bug Report

Please answer these questions before submitting your issue. Thanks!

1. Minimal reproduce step (Required)

1.Do a lot of querys.
2.Cancel that querys.
3.Repeat step1、2.

2. What did you expect to see? (Required)

Thread of Exchange task exit.

3. What did you see instead (Required)

Thread of Exchange task didn't exit.

4. What is your TiFlash version? (Required)

v6.0.0 alpha

@fuzhe1989
Copy link
Contributor

A real-life example:

  1. A TiFlash server received a CancelMPPTask request which wants to cancel query Q.
  2. MPPTaskManager::cancelMPPQuery found Q had 4 tasks: T1, T2, T3, and T4. It then started 4 threads each for cancelling one task in parallel. Notice that now the CancelMPPTask thread was holding shared_ptr of T1, T2, T3, and T4. It won't release shared_ptrs until all task cancelling threads done.
  3. In the meanwhile, T2 (MPPTask) catched an exception and finished. Notice that it didn't explicitly cancel it's input streams since the design was input streams would be destructed during the destruction of MPPTask. However T2 wasn't destructed, because one shared_ptr of it was held by the CancelMPPTask thread.
  4. There was a local exchange tunnel from T4 (the sender side) to T2 (the receiver side). By design T4 could not exit silently, it should first close the tunnel, then wait until T2 acknowledged (by MPPTunnel::consumerFinish).
  5. However T2 couldn't acknowledge this, because all it's ExchangeReceiver threads were blocking at pushing packets to computing input streams. Notice that T2 was finished so no input streams could consume these packets. And before T2 was destructed nobody could tell the ExchangeRecever threads that the task was finished. They would block until T2 was destructed.
  6. Now we know T2's ExchangeReceiver was blocking so T4 had to wait. This blocked T4's cancel thread too. So MPPTaskManager::cancelMPPQuery couldn't release shared_ptrs it was holding, which contains T2's. DeadLock!

A more simple version:

  1. T2 was finished, but one of its shared_ptr was alive and waiting for the cancellation of T4.
  2. T4 already closed its MPPTunnels but one of it was a local tunnel and need to wait the consumer (which is T2's ExchangeReceiver) finish.
  3. T2's ExchangeReceiver was blocking at pushing packets. But T2 was finished so there's no consumer.
  4. T2's destruction could break the cycle but it couldn't, because it's waiting for the cancellation of T4.

ti-chi-bot pushed a commit that referenced this issue Mar 11, 2022
ti-chi-bot pushed a commit that referenced this issue Mar 15, 2022
JaySon-Huang pushed a commit to JaySon-Huang/tiflash that referenced this issue Mar 17, 2022
windtalker added a commit that referenced this issue Apr 1, 2022
windtalker added a commit to windtalker/tiflash that referenced this issue Apr 14, 2022
windtalker added a commit that referenced this issue Apr 14, 2022
ti-chi-bot added a commit that referenced this issue Apr 14, 2022
ti-chi-bot added a commit that referenced this issue Apr 14, 2022
ti-chi-bot added a commit that referenced this issue Apr 22, 2022
ti-chi-bot added a commit that referenced this issue Jun 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants