-
Notifications
You must be signed in to change notification settings - Fork 1.6k
dispute-coordintor
is stalling during huge load
#6710
Comments
Things to check:
Results of former investigations: #5884 |
Just for the first thing to check: For our situation I think the dispute coordinator gets stuck in a "loop/cycle" and stops processing signals and the queue fills up. I only found a single usecase for sending a message from outside subsystem context - getting the inherent data from the provisioner: https://github.com/paritytech/polkadot/blob/master/node/core/parachains-inherent/src/lib.rs#L78 I believe this is not related to what we see. |
So this is not about signals? Only about messages, meaning the message queue would need to be filled up (2000 messages) and on top of that we would block the overseer for 10s, without processing any message? Or more precisely not processing enough messages to unblock everybody (including the overseer) within 10 seconds? So realistically, the time for processing a single message has to be in the seconds range for this to happen. Or is there another explanation? Yes there is: If unblocking wasn't fair. If we unblock one task, by processing a message but it immediately sends another one and we would keep servicing that one (servicing means enqueuing - not processing), then we would starve other senders and this could also trigger the timeout. Anything else? How do channels work, do they guarantee some fairness? |
There is a separate signal bounded queue of size 64 and subsystems always recv() signals over messages (signals have higher priority). The subsytem stalled error is due to this queue being full. I am not sure what the dispute coordinator is doing (maybe waiting on |
Yes, the overseer would be blocked >10s in
Not sure if I fully understand what you mean:) I'll give it a shot still! so, AFAIK an unblocked task can be still blocked intentionally by the tokio runtime through any await point even if the future being awaited was Ready. This is to prevent such starvation scenarios. |
The stalled error can happen on 2 cases: a signal send timeout or a message send timeout. I will make a PR to improve diagnostics here, as it should definitely include two things:
|
I was unable to reproduce the issue exactly as described here. #6808 probably will further improve the performance of |
While testing #6161 a bad runtime was deployed on versi which suddenly caused finality stall. The assumption is that due to a failing PVF check all validators has raised a dispute basically for each new block.
The symptoms observed are:
dispute-coordinator
failed withOverseer exited with error err=Generated(SubsystemStalled("dispute-coordinator-subsystem"))
. Grafana linkTry recreating this issue by deploying at least 400 (or 500, 600) validators on Versi and cause each of them to raise dispute.
Thanks to @s0me0ne-unkn0wn for reporting the issue and @sandreim for the initial investigation.
The text was updated successfully, but these errors were encountered: