`dispute-coordintor` is stalling during huge load #6710

tdimitrov · 2023-02-14T08:20:06Z

While testing #6161 a bad runtime was deployed on versi which suddenly caused finality stall. The assumption is that due to a failing PVF check all validators has raised a dispute basically for each new block.

The symptoms observed are:

Node has restarted - this caused the finality stall.
dispute-coordinator failed with Overseer exited with error err=Generated(SubsystemStalled("dispute-coordinator-subsystem")). Grafana link
CPU usage was huge: Grafana link

Try recreating this issue by deploying at least 400 (or 500, 600) validators on Versi and cause each of them to raise dispute.

Thanks to @s0me0ne-unkn0wn for reporting the issue and @sandreim for the initial investigation.

The text was updated successfully, but these errors were encountered:

eskimor · 2023-02-14T09:13:35Z

Things to check:

When does the overseer decide that a subsystem is stalled? This gives an indication on how bad the dispute-coordinator was really doing.
What could trigger the dispute-coordinator to stall?
- What is taking so long? - Can we fix it?
- Why isn't back pressure working correctly?
- Why isn't the bottleneck PVF execution or availability recovery?
Can some of the above already be checked with the incident that already happened? E.g. checking out whether some existing metrics already give some idea, would be worth a shot.

Results of former investigations: #5884

sandreim · 2023-02-14T10:48:07Z

Just for the first thing to check:
The orchestra implementation identifies stalled subsystems when sending messages (only ones that we send using the OverSeer handle - not coming from subsystem context) or signals. There is a timeout (10s) which represents the maximum amount of time we wait on the signal/message bounded queue to send the message. If timeout fires we declare the destination subsystem stalled.

For our situation I think the dispute coordinator gets stuck in a "loop/cycle" and stops processing signals and the queue fills up.

I only found a single usecase for sending a message from outside subsystem context - getting the inherent data from the provisioner: https://github.com/paritytech/polkadot/blob/master/node/core/parachains-inherent/src/lib.rs#L78 I believe this is not related to what we see.

eskimor · 2023-02-14T11:52:23Z

So this is not about signals? Only about messages, meaning the message queue would need to be filled up (2000 messages) and on top of that we would block the overseer for 10s, without processing any message? Or more precisely not processing enough messages to unblock everybody (including the overseer) within 10 seconds? So realistically, the time for processing a single message has to be in the seconds range for this to happen. Or is there another explanation?

Yes there is: If unblocking wasn't fair. If we unblock one task, by processing a message but it immediately sends another one and we would keep servicing that one (servicing means enqueuing - not processing), then we would starve other senders and this could also trigger the timeout.

Anything else? How do channels work, do they guarantee some fairness?

sandreim · 2023-02-14T12:21:03Z

There is a separate signal bounded queue of size 64 and subsystems always recv() signals over messages (signals have higher priority). The subsytem stalled error is due to this queue being full. I am not sure what the dispute coordinator is doing (maybe waiting on ApprovalVotingMessage::GetApprovalSignaturesForCandidate), but it is not calling recv() on its queue for at least 10s.

sandreim · 2023-02-14T12:37:37Z

So this is not about signals? Only about messages, meaning the message queue would need to be filled up (2000 messages) and on top of that we would block the overseer for 10s, without processing any message? Or more precisely not processing enough messages to unblock everybody (including the overseer) within 10 seconds? So realistically, the time for processing a single message has to be in the seconds range for this to happen. Or is there another explanation?

Yes, the overseer would be blocked >10s in broadcast_signal when sending to dispute coordinator. The overseer could be unblocked only by dispute coordinator calling ctx.recv(), but as we see that never happens.

Yes there is: If unblocking wasn't fair. If we unblock one task, by processing a message but it immediately sends another one and we would keep servicing that one (servicing means enqueuing - not processing), then we would starve other senders and this could also trigger the timeout.

Anything else? How do channels work, do they guarantee some fairness?

Not sure if I fully understand what you mean:) I'll give it a shot still! so, AFAIK an unblocked task can be still blocked intentionally by the tokio runtime through any await point even if the future being awaited was Ready. This is to prevent such starvation scenarios.

vstakhov · 2023-02-14T13:33:49Z

The stalled error can happen on 2 cases: a signal send timeout or a message send timeout. I will make a PR to improve diagnostics here, as it should definitely include two things:

signal or message - what has caused stall
message/signal type - what exact message has caused this case

tdimitrov · 2023-03-01T19:28:06Z

I was unable to reproduce the issue exactly as described here. dispute-coordinator is really stalling because it doesn't process ActiveLeavesUpdate fast enough under load. But the stall happens after hours with 4 malus nodes in the network and only when the validators are running the test branch used at the moment of the incident (#6530). I was unable to reproduce the issue with latest master.

#6808 probably will further improve the performance of dispute-coordinator under load.

tdimitrov added the T4-parachains_engineering This PR/Issue is related to Parachains performance, stability, maintenance. label Feb 14, 2023

tdimitrov added this to Disputes + Slashing + Rewards Feb 14, 2023

eskimor moved this to Todo in Disputes + Slashing + Rewards Feb 14, 2023

vstakhov mentioned this issue Feb 14, 2023

Be more specific about stall reason paritytech/orchestra#33

Merged

tdimitrov self-assigned this Feb 15, 2023

eskimor moved this from Todo to In Progress in Disputes + Slashing + Rewards Feb 15, 2023

tdimitrov closed this as completed Mar 1, 2023

github-project-automation bot moved this from In Progress to Done in Disputes + Slashing + Rewards Mar 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`dispute-coordintor` is stalling during huge load #6710

`dispute-coordintor` is stalling during huge load #6710

tdimitrov commented Feb 14, 2023

eskimor commented Feb 14, 2023

sandreim commented Feb 14, 2023 •

edited

Loading

eskimor commented Feb 14, 2023

sandreim commented Feb 14, 2023 •

edited

Loading

sandreim commented Feb 14, 2023

vstakhov commented Feb 14, 2023

tdimitrov commented Mar 1, 2023

dispute-coordintor is stalling during huge load #6710

dispute-coordintor is stalling during huge load #6710

Comments

tdimitrov commented Feb 14, 2023

eskimor commented Feb 14, 2023

sandreim commented Feb 14, 2023 • edited Loading

eskimor commented Feb 14, 2023

sandreim commented Feb 14, 2023 • edited Loading

sandreim commented Feb 14, 2023

vstakhov commented Feb 14, 2023

tdimitrov commented Mar 1, 2023

`dispute-coordintor` is stalling during huge load #6710

`dispute-coordintor` is stalling during huge load #6710

sandreim commented Feb 14, 2023 •

edited

Loading

sandreim commented Feb 14, 2023 •

edited

Loading