Revisit and unify error handling in subsystems #752

ordian · 2022-11-25T19:25:35Z

After some discussion with @s0me0ne-unkn0wn about paritytech/polkadot#6161 (comment), we've noticed there are some inconsistencies in how we treat different errors wrt fatality.

For example, https://github.com/paritytech/polkadot/blob/795b20c7150b04759ebddadd419e2f26fa382519/node/subsystem-util/src/lib.rs#L85-L87

doesn't derive #[fatality::fatality(splitable)] and the caller treats all errors as non-fatal. Whereas some of them, like Oneshot could be an indication of a node shutting down. There's a sub-module errors that does it better:
https://github.com/paritytech/polkadot/blob/795b20c7150b04759ebddadd419e2f26fa382519/node/subsystem-util/src/runtime/error.rs#L25-L27

Another thing we've noticed is a lot of duplication in some places: https://github.com/paritytech/polkadot/blob/795b20c7150b04759ebddadd419e2f26fa382519/node/core/backing/src/error.rs#L49-L56

Probably we don't really need to distinguish between all of them.

The text was updated successfully, but these errors were encountered:

slumber · 2022-11-25T20:13:17Z

Probably we don't really need to distinguish between all of them.

I think this is useful for logging.

s0me0ne-unkn0wn · 2022-11-26T10:39:46Z

I think this is useful for logging.

In general, yes. At this very point, I'm not sure. oneshot::Canceled means that node is going down and the error is propagated through the whole call stack, and I want to believe that nobody cares which exact part in every subsystem it hits.

The point is that if we want strongly typed errors we also want to convert between them when the fatal ones propagate. When runtime API request results in subsystem_util::runtime::Error::RuntimeRequestCanceled, the subsystem wants to convert it to its own error, say backing::Error in a convenient way (e.g. implementing From<runtime::Error> for backing::Error) and propagate it further to the caller who shouldn't care much if happened in ValidateFromChainState or StoreAvailableData call because it only wants to propagate it further too. I doesn't make much sense to differentiate them as we're going down anyway, logs will be flooded with cancelled subsys requests if the node is working hard at the moment.

And if we do differentiate them, and we want to add one more variant like backing::Error::ExecutorParamsRequest(oneshot::Canceled) we'll have to fix all the callers also which doesn't make much sense to me.

The subsys will still be able to log an exact position where the error is happened if it's important in any particular location but it shouldn't overburden the caller with the information which is too specific to it.

Does it make sense?

slumber · 2022-11-26T10:52:07Z

oneshot::Canceled doesn't necessarily mean the node is going down. I don't see where it is coming from.

The line number in the log would resolve the issue, but I'm not sure if it's available in non-debug builds.

ordian · 2022-11-26T11:39:34Z

oneshot::Canceled doesn't necessarily mean the node is going down.

In this case it does in all 3 cases.

s0me0ne-unkn0wn · 2022-11-26T11:52:02Z

@slumber I mean, if we fail during, say, ValidateFromChainState phase, instead of returning backing::Error::ValidateFromChainState, we could just .map_err(|e| { log_error!("Screwed up in ValidateFromChainState"); backing::Error::GeneralRuntimeCommunicationError(e) })?

So we still log what happened exactly, and letting the caller know that something has screwed up but without details which are redundant to the caller.

slumber · 2022-11-26T11:52:29Z

oneshot::Canceled doesn't necessarily mean the node is going down.

In this case it does in all 3 cases.

But in general it's not the rule. This issue doesn't target backing in particular, right?

ordian · 2022-11-26T11:56:55Z

I think this is the case in most cases where we make a request to a subsystem that doesn't cancel its tasks (i.e. not job-based). We could have a generic fatal error SubsystemIsShuttingDown(subsystem_name: &'static str) instead of having this useless oneshot::Canceled inner field.

This issue doesn't target backing in particular, right?

Yes, for example in av-store we have a custom error with manual is_fatal implementation instead of using fatality.

* refactor: use weightToFee for xcm Trader * fix format

ordian added I8-refactor labels Nov 25, 2022

s0me0ne-unkn0wn mentioned this issue Nov 26, 2022

Executor Environment parameterization paritytech/polkadot#6161

Merged

9 tasks

Sophia-Gold transferred this issue from paritytech/polkadot Aug 24, 2023

the-right-joyce added I4-refactor Code needs refactoring. T8-parachains_engineering and removed I8-refactor labels Aug 25, 2023

the-right-joyce added this to parachains team board Sep 29, 2023

the-right-joyce moved this to Backlog in parachains team board Oct 6, 2023

the-right-joyce removed the T8-parachains_engineering label Oct 23, 2023

claravanstaden pushed a commit to Snowfork/polkadot-sdk that referenced this issue Dec 8, 2023

refactor: use weightToFee for xcm Trader (paritytech#752)

3df6afc

* refactor: use weightToFee for xcm Trader * fix format

s0me0ne-unkn0wn mentioned this issue Jan 17, 2024

Pvf thiserror #2958

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Revisit and unify error handling in subsystems #752

Revisit and unify error handling in subsystems #752

ordian commented Nov 25, 2022

slumber commented Nov 25, 2022

s0me0ne-unkn0wn commented Nov 26, 2022

slumber commented Nov 26, 2022

ordian commented Nov 26, 2022

s0me0ne-unkn0wn commented Nov 26, 2022

slumber commented Nov 26, 2022

ordian commented Nov 26, 2022

Revisit and unify error handling in subsystems #752

Revisit and unify error handling in subsystems #752

Comments

ordian commented Nov 25, 2022

slumber commented Nov 25, 2022

s0me0ne-unkn0wn commented Nov 26, 2022

slumber commented Nov 26, 2022

ordian commented Nov 26, 2022

s0me0ne-unkn0wn commented Nov 26, 2022

slumber commented Nov 26, 2022

ordian commented Nov 26, 2022