Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revisit and unify error handling in subsystems #752

Open
ordian opened this issue Nov 25, 2022 · 7 comments
Open

Revisit and unify error handling in subsystems #752

ordian opened this issue Nov 25, 2022 · 7 comments
Labels
I4-refactor Code needs refactoring.

Comments

@ordian
Copy link
Member

ordian commented Nov 25, 2022

After some discussion with @s0me0ne-unkn0wn about paritytech/polkadot#6161 (comment), we've noticed there are some inconsistencies in how we treat different errors wrt fatality.

For example, https://github.com/paritytech/polkadot/blob/795b20c7150b04759ebddadd419e2f26fa382519/node/subsystem-util/src/lib.rs#L85-L87

doesn't derive #[fatality::fatality(splitable)] and the caller treats all errors as non-fatal. Whereas some of them, like Oneshot could be an indication of a node shutting down. There's a sub-module errors that does it better:
https://github.com/paritytech/polkadot/blob/795b20c7150b04759ebddadd419e2f26fa382519/node/subsystem-util/src/runtime/error.rs#L25-L27

Another thing we've noticed is a lot of duplication in some places: https://github.com/paritytech/polkadot/blob/795b20c7150b04759ebddadd419e2f26fa382519/node/core/backing/src/error.rs#L49-L56

Probably we don't really need to distinguish between all of them.

@slumber
Copy link
Contributor

slumber commented Nov 25, 2022

Probably we don't really need to distinguish between all of them.

I think this is useful for logging.

@s0me0ne-unkn0wn
Copy link
Contributor

I think this is useful for logging.

In general, yes. At this very point, I'm not sure. oneshot::Canceled means that node is going down and the error is propagated through the whole call stack, and I want to believe that nobody cares which exact part in every subsystem it hits.

The point is that if we want strongly typed errors we also want to convert between them when the fatal ones propagate. When runtime API request results in subsystem_util::runtime::Error::RuntimeRequestCanceled, the subsystem wants to convert it to its own error, say backing::Error in a convenient way (e.g. implementing From<runtime::Error> for backing::Error) and propagate it further to the caller who shouldn't care much if happened in ValidateFromChainState or StoreAvailableData call because it only wants to propagate it further too. I doesn't make much sense to differentiate them as we're going down anyway, logs will be flooded with cancelled subsys requests if the node is working hard at the moment.

And if we do differentiate them, and we want to add one more variant like backing::Error::ExecutorParamsRequest(oneshot::Canceled) we'll have to fix all the callers also which doesn't make much sense to me.

The subsys will still be able to log an exact position where the error is happened if it's important in any particular location but it shouldn't overburden the caller with the information which is too specific to it.

Does it make sense?

@slumber
Copy link
Contributor

slumber commented Nov 26, 2022

oneshot::Canceled doesn't necessarily mean the node is going down. I don't see where it is coming from.

The line number in the log would resolve the issue, but I'm not sure if it's available in non-debug builds.

@ordian
Copy link
Member Author

ordian commented Nov 26, 2022

oneshot::Canceled doesn't necessarily mean the node is going down.

In this case it does in all 3 cases.

@s0me0ne-unkn0wn
Copy link
Contributor

@slumber I mean, if we fail during, say, ValidateFromChainState phase, instead of returning backing::Error::ValidateFromChainState, we could just .map_err(|e| { log_error!("Screwed up in ValidateFromChainState"); backing::Error::GeneralRuntimeCommunicationError(e) })?

So we still log what happened exactly, and letting the caller know that something has screwed up but without details which are redundant to the caller.

@slumber
Copy link
Contributor

slumber commented Nov 26, 2022

oneshot::Canceled doesn't necessarily mean the node is going down.

In this case it does in all 3 cases.

But in general it's not the rule. This issue doesn't target backing in particular, right?

@ordian
Copy link
Member Author

ordian commented Nov 26, 2022

I think this is the case in most cases where we make a request to a subsystem that doesn't cancel its tasks (i.e. not job-based). We could have a generic fatal error SubsystemIsShuttingDown(subsystem_name: &'static str) instead of having this useless oneshot::Canceled inner field.

This issue doesn't target backing in particular, right?

Yes, for example in av-store we have a custom error with manual is_fatal implementation instead of using fatality.

@Sophia-Gold Sophia-Gold transferred this issue from paritytech/polkadot Aug 24, 2023
claravanstaden pushed a commit to Snowfork/polkadot-sdk that referenced this issue Dec 8, 2023
* refactor: use weightToFee for xcm Trader

* fix format
@s0me0ne-unkn0wn s0me0ne-unkn0wn mentioned this issue Jan 17, 2024
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
I4-refactor Code needs refactoring.
Projects
Status: Backlog
Status: To do
Development

No branches or pull requests

4 participants