Skip to content
This repository has been archived by the owner on Nov 15, 2023. It is now read-only.

dispute-coordintor is stalling during huge load #6710

Closed
tdimitrov opened this issue Feb 14, 2023 · 7 comments
Closed

dispute-coordintor is stalling during huge load #6710

tdimitrov opened this issue Feb 14, 2023 · 7 comments
Assignees
Labels
T4-parachains_engineering This PR/Issue is related to Parachains performance, stability, maintenance.

Comments

@tdimitrov
Copy link
Contributor

While testing #6161 a bad runtime was deployed on versi which suddenly caused finality stall. The assumption is that due to a failing PVF check all validators has raised a dispute basically for each new block.

The symptoms observed are:

  • Node has restarted - this caused the finality stall.
  • dispute-coordinator failed with Overseer exited with error err=Generated(SubsystemStalled("dispute-coordinator-subsystem")). Grafana link
  • CPU usage was huge: Grafana link
    image

Try recreating this issue by deploying at least 400 (or 500, 600) validators on Versi and cause each of them to raise dispute.

Thanks to @s0me0ne-unkn0wn for reporting the issue and @sandreim for the initial investigation.

@tdimitrov tdimitrov added the T4-parachains_engineering This PR/Issue is related to Parachains performance, stability, maintenance. label Feb 14, 2023
@eskimor
Copy link
Member

eskimor commented Feb 14, 2023

Things to check:

  • When does the overseer decide that a subsystem is stalled? This gives an indication on how bad the dispute-coordinator was really doing.
  • What could trigger the dispute-coordinator to stall?
    • What is taking so long? - Can we fix it?
    • Why isn't back pressure working correctly?
    • Why isn't the bottleneck PVF execution or availability recovery?
  • Can some of the above already be checked with the incident that already happened? E.g. checking out whether some existing metrics already give some idea, would be worth a shot.

Results of former investigations: #5884

@sandreim
Copy link
Contributor

sandreim commented Feb 14, 2023

Just for the first thing to check:
The orchestra implementation identifies stalled subsystems when sending messages (only ones that we send using the OverSeer handle - not coming from subsystem context) or signals. There is a timeout (10s) which represents the maximum amount of time we wait on the signal/message bounded queue to send the message. If timeout fires we declare the destination subsystem stalled.

For our situation I think the dispute coordinator gets stuck in a "loop/cycle" and stops processing signals and the queue fills up.

I only found a single usecase for sending a message from outside subsystem context - getting the inherent data from the provisioner: https://github.com/paritytech/polkadot/blob/master/node/core/parachains-inherent/src/lib.rs#L78 I believe this is not related to what we see.

@eskimor
Copy link
Member

eskimor commented Feb 14, 2023

So this is not about signals? Only about messages, meaning the message queue would need to be filled up (2000 messages) and on top of that we would block the overseer for 10s, without processing any message? Or more precisely not processing enough messages to unblock everybody (including the overseer) within 10 seconds? So realistically, the time for processing a single message has to be in the seconds range for this to happen. Or is there another explanation?

Yes there is: If unblocking wasn't fair. If we unblock one task, by processing a message but it immediately sends another one and we would keep servicing that one (servicing means enqueuing - not processing), then we would starve other senders and this could also trigger the timeout.

Anything else? How do channels work, do they guarantee some fairness?

@sandreim
Copy link
Contributor

sandreim commented Feb 14, 2023

There is a separate signal bounded queue of size 64 and subsystems always recv() signals over messages (signals have higher priority). The subsytem stalled error is due to this queue being full. I am not sure what the dispute coordinator is doing (maybe waiting on ApprovalVotingMessage::GetApprovalSignaturesForCandidate), but it is not calling recv() on its queue for at least 10s.

@sandreim
Copy link
Contributor

So this is not about signals? Only about messages, meaning the message queue would need to be filled up (2000 messages) and on top of that we would block the overseer for 10s, without processing any message? Or more precisely not processing enough messages to unblock everybody (including the overseer) within 10 seconds? So realistically, the time for processing a single message has to be in the seconds range for this to happen. Or is there another explanation?

Yes, the overseer would be blocked >10s in broadcast_signal when sending to dispute coordinator. The overseer could be unblocked only by dispute coordinator calling ctx.recv(), but as we see that never happens.

Yes there is: If unblocking wasn't fair. If we unblock one task, by processing a message but it immediately sends another one and we would keep servicing that one (servicing means enqueuing - not processing), then we would starve other senders and this could also trigger the timeout.

Anything else? How do channels work, do they guarantee some fairness?

Not sure if I fully understand what you mean:) I'll give it a shot still! so, AFAIK an unblocked task can be still blocked intentionally by the tokio runtime through any await point even if the future being awaited was Ready. This is to prevent such starvation scenarios.

@vstakhov
Copy link
Contributor

The stalled error can happen on 2 cases: a signal send timeout or a message send timeout. I will make a PR to improve diagnostics here, as it should definitely include two things:

  • signal or message - what has caused stall
  • message/signal type - what exact message has caused this case

@tdimitrov
Copy link
Contributor Author

I was unable to reproduce the issue exactly as described here. dispute-coordinator is really stalling because it doesn't process ActiveLeavesUpdate fast enough under load. But the stall happens after hours with 4 malus nodes in the network and only when the validators are running the test branch used at the moment of the incident (#6530). I was unable to reproduce the issue with latest master.

#6808 probably will further improve the performance of dispute-coordinator under load.

@github-project-automation github-project-automation bot moved this from In Progress to Done in Disputes + Slashing + Rewards Mar 1, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
T4-parachains_engineering This PR/Issue is related to Parachains performance, stability, maintenance.
Projects
No open projects
Development

No branches or pull requests

4 participants