Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor availability-recovery strategies #1457

Merged
merged 15 commits into from
Sep 20, 2023

Conversation

alindima
Copy link
Contributor

@alindima alindima commented Sep 8, 2023

Refactors availability-recovery strategies to allow for easily adding new hotpaths and failover mechanisms.

The new interface allows for chaining multiple RecoveryStrategy-es together, to cleanly express the relationship between them and share state and code where neccessary/possible:

This was done in order to aid in implementing new hotpaths like systematic chunks recovery and fetching from approval checkers.

Thanks to this design, intermediate state can be shared between the strategies. For example, if the systematic chunks recovery retrieved less than the needed amount of chunks, pass them over to the next FetchChunks strategy, which will only need to recover the remaining number of chunks.

Draft example of how a systematic chunk recovery strategy would look: 667d870 (notice how easy it was to add and reuse code)

Note that this PR doesn't itself add any new strategy, it should fully preserve backwards compatiblity in terms of functionality. Follow-up PRs to add new strategies will come.

@alindima alindima added A0-needs_burnin Pull request needs to be tested on a live validator node before merge. DevOps is notified via matrix T0-node This PR/Issue is related to the topic “node”. I4-refactor Code needs refactoring. labels Sep 8, 2023
@alindima alindima changed the title RFC: Refactor availability-recovery strategies Refactor availability-recovery strategies Sep 8, 2023
Copy link
Contributor

@sandreim sandreim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @alindima . I like the direction this is going. There are a few things I think can be improved and also we should add some tests to cover usage of multiple strategies.

polkadot/node/network/availability-recovery/src/lib.rs Outdated Show resolved Hide resolved
polkadot/node/network/availability-recovery/src/task.rs Outdated Show resolved Hide resolved
/// Intermediate/common data that must be passed between `RecoveryStrategy`s belonging to the
/// same `RecoveryTask`.
pub struct State {
/// Chunks received so far.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can move the RecoveryParams here. These shouldn't change across strategies.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could, but I purposefully only placed mutable data into the State.
as you said, params don't change across strategies, so we only borrow them immutably. That's why I left them separate.
They are still shared across strategies, but by virtue of the RecoveryTask struct

polkadot/node/network/availability-recovery/src/task.rs Outdated Show resolved Hide resolved
polkadot/node/network/availability-recovery/src/task.rs Outdated Show resolved Hide resolved
polkadot/node/network/availability-recovery/src/task.rs Outdated Show resolved Hide resolved
polkadot/node/network/availability-recovery/src/task.rs Outdated Show resolved Hide resolved
polkadot/node/network/availability-recovery/src/task.rs Outdated Show resolved Hide resolved
polkadot/node/network/availability-recovery/src/lib.rs Outdated Show resolved Hide resolved
@alindima
Copy link
Contributor Author

Thanks @alindima . I like the direction this is going.

Thanks!

There are a few things I think can be improved and also we should add some tests to cover usage of multiple strategies.

In regards to tests:
This PR shouldn't change any of the logic concerning availability-recovery. It's meant to be preserve functionality while allowing to add more hotpaths and optimisations later on (which will come accompanied by a lot of tests).

I saw that there are arguably a good amount of tests that already exercise the multiple recovery paths.

Sounds good for now?

Copy link
Member

@ordian ordian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks reasonable

polkadot/node/network/availability-recovery/src/lib.rs Outdated Show resolved Hide resolved
polkadot/node/network/availability-recovery/src/lib.rs Outdated Show resolved Hide resolved
@@ -222,7 +222,7 @@ impl State {
sender
.send_message(NetworkBridgeTxMessage::SendRequests(
requests,
IfDisconnected::TryConnect,
IfDisconnected::ImmediateError,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks! according to what I read on the issue, the cumulus 0002-pov_recovery test should be failing with the current code of the PR, right? but it's not

I'll roll back to using TryConnect just to be safe in this PR.
I'll think about the optimisations we can do in upcoming PRs, because I'd rather have an incremental approach and keep this refactoring backwards-compatible. @eskimor @sandreim sounds good?

If so, can we have this PR merged if it looks good?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, sounds good. The fact that test is passing with ImmediateError is concerning, please open an issue - this needs further investigation cc @skunert

Copy link
Contributor

@skunert skunert Sep 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I investigated this a bit, to me it looks like zombienet omits some of the args we pass to the collator. In the logs, I see no trace of the --reserved-only flag that we pass in the toml. paritytech/zombienet#1360

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created #1647 for tracking purposes

Copy link
Contributor

@sandreim sandreim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work @alindima

polkadot/node/network/availability-recovery/src/lib.rs Outdated Show resolved Hide resolved
polkadot/node/network/availability-recovery/src/task.rs Outdated Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A0-needs_burnin Pull request needs to be tested on a live validator node before merge. DevOps is notified via matrix I4-refactor Code needs refactoring. T0-node This PR/Issue is related to the topic “node”.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants