Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MPPTask destruct when holding the lock of MPPTaskManager, and cause TiFlash hang forever #4954

Closed
windtalker opened this issue May 21, 2022 · 0 comments · Fixed by #4958
Closed

Comments

@windtalker
Copy link
Contributor

Bug Report

Please answer these questions before submitting your issue. Thanks!

1. Minimal reproduce step (Required)

Consider A MPP query has 2 MPPTask in one TiFlash node, task1 and task2, and task1 reads data from task2 using local tunnel.

  1. task1 and task2 both dispatched to TiFlash, and after prepare and preprocess, both of them are waiting for scheduler, so they are pushed into the waiting_tasks queue
  2. The scheduler lets task2 to run, then found the thread usage exceed the hard limit, so makes the schedule state of task1 as EXCEEDED
  3. Task1 found its schedule state is EXCEEDED, then it throws exception, and unregister itself from MPPTaskManager, note that unregister only moves the task1 from task_map, and waiting_tasks queue still holds the reference of task1, and after runImpl finishes, the reference in waiting_tasks is the last reference of task1 in the system.
  4. TiDB found task1 is failed, then it sends CandMPPTask request to TiFlash to cancel the mpp query
  5. In CancelMPPQuery , it acquires the lock of MPPTaskManager, then calls scheduler->deleteQuery, inside deleteQuery, it removes task1 from waiting_tasks, since the task1 in waiting_tasks is the last reference of the shared ptr, task1 is destructed after it is removed from waiting_tasks
  6. When destruct task1, its ExchangeReceiver will wait the reading thread exit, for task1, the reading thread in ExchangeReceiver is the local read thread which tries to read data from task2, local read thread can only be exited after task 2 finishes or task1/task2 is cancelled, but
    • task2 can not finish because task 1 is not reading its output
    • task1 and task2 can not be cancelled, because the cancel thread is now waiting task1 to be deconstructed

So deadlock happens, and since the CancelMPPQuery holds the lock of MPPTaskManager, no more queries can be served.

2. What did you expect to see? (Required)

3. What did you see instead (Required)

4. What is your TiFlash version? (Required)

@windtalker windtalker added the type/bug The issue is confirmed as a bug. label May 21, 2022
@windtalker windtalker changed the title MPPTask destruct when holding the lock if MPPTaskManager, and cause TiFlash hang forever MPPTask destruct when holding the lock of MPPTaskManager, and cause TiFlash hang forever May 21, 2022
ti-chi-bot pushed a commit that referenced this issue May 23, 2022
…uled task with exceeded state from the waiting tasks queue (#4958)

close #4954
ti-chi-bot added a commit that referenced this issue May 24, 2022
…uled task with exceeded state from the waiting tasks queue (#4958) (#4975)

close #4954
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants