Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove the timeout system #73

Open
ordian opened this issue Jan 29, 2024 · 11 comments
Open

Remove the timeout system #73

ordian opened this issue Jan 29, 2024 · 11 comments

Comments

@ordian
Copy link
Member

ordian commented Jan 29, 2024

The issue for validators definitely deserves proper fix (and repro). I think the whole system with timeouts is a bandaid by itself and IIRC was implemented there as a poor man's detection of deadlocks between subsystems and it that case it makes sense to shutdown. However, I don't think it's unreasonable that sometimes subsystems actually take a long time to process messages (esp if run in a VM that shares CPU resources), so I would question this mechanism in the first place. If it is caused by an actual deadlock, that definitely needs to be fixed along with a better prevention mechanisms.

Originally posted by @ordian in paritytech/polkadot-sdk#1730 (comment)

The issue that timeouts were trying to solve is when a subsystem is either deadlocking or unreasonably slow, we should detect that and shut down the node gracefully.

IMHO we should not shutdown on slowness (it might be caused by some other system blips) and I think we need a more robust deadlock prevention mechanism. So I'm suggesting removing the timeouts completely once a better deadlock prevention mechanism is in place.

The dotgraph was a useful feature, but it has false-positives, since it doesn't distinguish between bounded and unbounded AFAIK, which is why we're getting these annoying warnings about cycles on polkadot-sdk build.

@eskimor
Copy link
Member

eskimor commented Jan 29, 2024

A watchdog mechanism is actually the only practical way to detect deadlocks or other kinds of malfunctions. I doubt there is a better way. If timeouts trigger when they should not, then they are likely too strict. They can be quite generous (and I think they are already), but if the node is stuck, you really want to have it restarted sooner rather than later.

@ordian
Copy link
Member Author

ordian commented Jan 29, 2024

A watchdog mechanism is actually the only practical way to detect deadlocks

I disagree here. (Most types of) deadlocks can be detected statically: you construct a bipartite graph with threads and resources (be it a Mutex or else) and check for cycles (this is what dotgraph is doing btw).

@eskimor
Copy link
Member

eskimor commented Jan 29, 2024

Not every cycle is a deadlock, also not every malfunction is an actual deadlock. Having a generic solution is solving the halting problem.

Enhancing the architecture to avoid the need for cycles are certainly a worthwhile explorative target, but don't fully solve the issue.

@alexggh
Copy link
Contributor

alexggh commented Jan 30, 2024

A watchdog mechanism is actually the only practical way to detect deadlocks or other kinds of malfunctions

The interesting part is that we have most of the things in place to implement the watchdog, all subsystems receive signals and we need to process them, so we could most likely easily detect if the subsystem stop processing the signals and assert on that.

@eskimor
Copy link
Member

eskimor commented Jan 30, 2024

The interesting part is that we have most of the things in place to implement the watchdog, all subsystems receive signals and we need to process them, so we could most likely easily detect if the subsystem stop processing the signals and assert on that.

Isn't this exactly what we already have? We send signals, we queue them up to a certain point. Eventually the queue fills, we block and timeout*). If the subsystem responds to signals in a reasonable time frame, we never block and we therefore also never timeout. And now with the separate recv_signal we even have a way for subsystems to empty the queue, in case they really have some uninterruptible long running task.

*) This is at least my understanding on how this works, it has been a while since I last read the code. If my understanding is off, then this is a different discussion.

@alexggh
Copy link
Contributor

alexggh commented Jan 30, 2024

The way I see it the difference between signals and normal messages is that signals are few and they are prioritised in recv, so if for messages you can timeout because the queue is full and you are slow on processing them, for signals there is no good reason for which you should fail to process it, or is it :D.

So that's why I think it is still reasonable to timeout on signal sending, but not on messages.

@ordian
Copy link
Member Author

ordian commented Jan 30, 2024

I was also thinking of removing the signal queue and replacing them only with the latest state. The subsystems need to handle missing block imports anyway and walk down the chain to the latest know one using determine_new_heads and missing some finalized notifications due to some slowness is also not bad, we want to know the latest finalized number anyway.
So we could use something like https://docs.rs/tokio/latest/tokio/sync/index.html#watch-channel there.

@eskimor
Copy link
Member

eskimor commented Feb 2, 2024

I was also thinking of removing the signal queue and replacing them only with the latest state. The subsystems need to handle missing block imports anyway and walk down the chain to the latest know one using determine_new_heads and missing some finalized notifications due to some slowness is also not bad, we want to know the latest finalized number anyway.
So we could use something like https://docs.rs/tokio/latest/tokio/sync/index.html#watch-channel there.

That is interesting and does make a lot of sense! We would lose the watchdog though 😬 But other than that, I think this could help with performance and also might lead to cleaner code in some places.

@drahnr
Copy link
Collaborator

drahnr commented Feb 3, 2024

The watch channel will break the active leaves concept, so competing blocks will be difficult to model iiuc. Finality is the easy one, there is no wrong thing for that, but tracking active leaves is different.

@ordian
Copy link
Member Author

ordian commented Feb 3, 2024

The watch channel will break the active leaves concept, so competing blocks will be difficult to model iiuc.

Not if we send all currently active leaves (our view). And it is bound by a small constant, so it shouldn't be a problem iiuc.

@drahnr
Copy link
Collaborator

drahnr commented Feb 4, 2024

The watch channel will break the active leaves concept, so competing blocks will be difficult to model iiuc.

Not if we send all currently active leaves (our view). And it is bound by a small constant, so it shouldn't be a problem iiuc.

That shifts the actual update calc to the individual subsystem, which is fair. Just trying to understand what implications would arise for gensyn and our current usage - from the top of my head it'd be feasible if really needed, although not necessarily desirable - our numbers are much much larger (ballpark 1000s as a lower end, likely going to be a few orders of magnitude higher), so deltas help us to compress them a lot.

Let me ponder on it for a bit, my hunch is still the subsystem not able to process signals fast enough has a borked design and should be fixed.

A watchdog - by design - is a last resort. If timeouts are not generous enough, that points to either too small timeouts (i.e. CPU shares not taken into account, IO congestion, ..) or (again) a borked Subsystem architecture.

An improved deadlock mechanism is feasible, by limiting the enum variants a Subsystem can send rather than only a specific enum, that would improve a) the generated graph b) compile time errors.
It won't fix the issue of missing path coverage from message reception to sending a message to a Subsystem. Not should we necessarily try to solve a foundational CS problem with orchestra :)

tl;dr not a fan, it'd make our life (a little) harder

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Backlog
Development

No branches or pull requests

4 participants