Skip to content
This repository has been archived by the owner on Nov 15, 2023. It is now read-only.

[node] Add additional sanitization checks to overseer #3427

Closed
6 of 8 tasks
drahnr opened this issue Jul 7, 2021 · 4 comments
Closed
6 of 8 tasks

[node] Add additional sanitization checks to overseer #3427

drahnr opened this issue Jul 7, 2021 · 4 comments
Assignees
Labels
J0-enhancement An additional feature request. T4-parachains_engineering This PR/Issue is related to Parachains performance, stability, maintenance.

Comments

@drahnr
Copy link
Contributor

drahnr commented Jul 7, 2021

@drahnr drahnr added the J0-enhancement An additional feature request. label Jul 7, 2021
@drahnr drahnr self-assigned this Jul 7, 2021
@drahnr drahnr changed the title Add additional sanitization checks to overseer [node] Add additional sanitization checks to overseer Jul 7, 2021
@rphmeier
Copy link
Contributor

rphmeier commented Jul 9, 2021

annotated subsystems which messages are sent by a subsystem

This and everything under it is amazing. Looking forward to it

@drahnr drahnr mentioned this issue Jul 15, 2021
2 tasks
@bkchr
Copy link
Member

bkchr commented Sep 9, 2021

* inject a `unbounded` channel into each cycle, to prevent potential dead-locks until there is a more systematic approach with prioritization

Why do we try to work around the fundamental issue and not solve it properly? The fundamental issue here is that oneshot channels are used inside these messages and that we await them, blocking the entire subsystem. IMO oneshot channels should be forbidden in these messages or only allowed in very rare cases where we can proof that no dead lock ever happens.

Instead of oneshot channels there could just be a closure that constructs a AllMessages message and this message would be dispatched again.

Yes that would complicate the design of some subsystems, but it would make it much easier to reason about the implementation of a subsystem and problematic code pieces.

@drahnr
Copy link
Contributor Author

drahnr commented Sep 9, 2021

* inject a `unbounded` channel into each cycle, to prevent potential dead-locks until there is a more systematic approach with prioritization

Why do we try to work around the fundamental issue and not solve it properly? The fundamental issue here is that oneshot channels are used inside these messages and that we await them, blocking the entire subsystem. IMO oneshot channels should be forbidden in these messages or only allowed in very rare cases where we can proof that no dead lock ever happens.

Instead of oneshot channels there could just be a closure that constructs a AllMessages message and this message would be dispatched again.

Yes that would complicate the design of some subsystems, but it would make it much easier to reason about the implementation of a subsystem and problematic code pieces.

In principle you are correct.
The above was meant as low-impact change, and that's the precise advantage of it - not to re-architect half of our subsystems logic.

I also think that spawning additional tasks whenever there is a response channel could get the same effect with less impact. Note that the unbounded channel also solves potential congestion due to loops in our subsystem communication, which is something that is no easier with AllMessages-based request-response. I am happy to discuss this further.

IMO we should get the observability up for oneshots before making grand changes - hence implementing #3825, #3824, and #3648 to get a top down view of our entire system and which oneshots are actually delayed, and then based on that make a decision what to do about it.

@bkchr
Copy link
Member

bkchr commented Sep 10, 2021

The above was meant as low-impact change, and that's the precise advantage of it - not to re-architect half of our subsystems logic.

A low impact change would be to just use unbounded channels for all subsystem communications. This problem exists now since more than 1 year and isn't solved. And the more we roll this out, the more we get reports of users nodes going down.

Note that the unbounded channel also solves potential congestion due to loops in our subsystem communication, which is something that is no easier with AllMessages-based request-response.

Not sure how this solves congestion, it just trades memory usage for execution time. However, as one subsystem probably doesn't send 100 messages to another susbsystem at once, more like it receives one, processes the message and sends a new message to other subsystem, the cognestion is probably not happening.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
J0-enhancement An additional feature request. T4-parachains_engineering This PR/Issue is related to Parachains performance, stability, maintenance.
Projects
None yet
Development

No branches or pull requests

5 participants