Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cumulus/client: added external rpc connection retry logic #5515

Merged

Conversation

iulianbarbu
Copy link
Contributor

@iulianbarbu iulianbarbu commented Aug 29, 2024

Description

Adds retry logic that makes the RPC relay chain interface more reliable for the cases of a collator connecting to external RPC servers.

Closes #5514
Closes #4278

Final solution still debated on #5514 , what this PR addresses might change (e.g. #4278 might require a more advanced approach).

Integration

Users that start collators should barely observe differences based on this logic, since the retry logic applies only in case the collators fail to connect to the RPC servers. In practice I assume the RPC servers are already live before starting collators, and the issue isn't visible.

Review Notes

The added retry logic is for retrying the connection to the RPC servers (which can be multiple). It is at the level of the cumulus/client/relay-chain-rpc-interface module, but more specifically relevant to the RPC clients logic (ClientManager). The retry logic is not configurable, it tries to connect to the RPC client for 5 times, with an exponential backoff in between each iteration starting with 1 second wait time and ending with 16 seconds. The same logic is applied in case an existing connection to an RPC is dropped. There is a ReconnectingWebsocketWorker who ensures there is connectivity to at least on RPC node, and the retry logic makes this stronger by insisting on trying connections to the RPC servers list for 5 times.

Testing

zombienet -l text --dir zbn-run -f --provider native spawn polkadot-sdk/cumulus/zombienet/tests/0006-rpc_collator_builds_blocks.toml
  • Added a unit test that exercises the retry logic for a client connection to a server that comes online in 10 seconds. The retry logic can wait for as long as 30 seconds, but thought that it is too much for a unit test. Just being conscious of CI time if it runs this test, but I am happy to see suggestions around it too. I am not that sure either it runs in CI, haven't figured it out entirely yet. The test can be considered an integration test too, but it exercises crate internal implementation, not the public API.

Collators example logs after the change:

2024-08-29 14:28:11.730  INFO tokio-runtime-worker reconnecting-websocket-client: [Parachain] Trying to connect to next external relaychain node. current_iteration=0 index=2 url="ws://127.0.0.1:37427/"
2024-08-29 14:28:12.737  INFO tokio-runtime-worker reconnecting-websocket-client: [Parachain] Trying to connect to next external relaychain node. current_iteration=1 index=0 url="ws://127.0.0.1:43617/"
2024-08-29 14:28:12.739  INFO tokio-runtime-worker reconnecting-websocket-client: [Parachain] Trying to connect to next external relaychain node. current_iteration=1 index=1 url="ws://127.0.0.1:37965/"
2024-08-29 14:28:12.755  INFO tokio-runtime-worker reconnecting-websocket-client: [Parachain] Trying to connect to next external relaychain node. current_iteration=1 index=2 url="ws://127.0.0.1:37427/"
2024-08-29 14:28:14.758  INFO tokio-runtime-worker reconnecting-websocket-client: [Parachain] Trying to connect to next external relaychain node. current_iteration=2 index=0 url="ws://127.0.0.1:43617/"
2024-08-29 14:28:14.759  INFO tokio-runtime-worker reconnecting-websocket-client: [Parachain] Trying to connect to next external relaychain node. current_iteration=2 index=1 url="ws://127.0.0.1:37965/"
2024-08-29 14:28:14.760  INFO tokio-runtime-worker reconnecting-websocket-client: [Parachain] Trying to connect to next external relaychain node. current_iteration=2 index=2 url="ws://127.0.0.1:37427/"
2024-08-29 14:28:18.766  INFO tokio-runtime-worker reconnecting-websocket-client: [Parachain] Trying to connect to next external relaychain node. current_iteration=3 index=0 url="ws://127.0.0.1:43617/"
2024-08-29 14:28:18.768  INFO tokio-runtime-worker reconnecting-websocket-client: [Parachain] Trying to connect to next external relaychain node. current_iteration=3 index=1 url="ws://127.0.0.1:37965/"
2024-08-29 14:28:18.768  INFO tokio-runtime-worker reconnecting-websocket-client: [Parachain] Trying to connect to next external relaychain node. current_iteration=3 index=2 url="ws://127.0.0.1:37427/"
2024-08-29 14:28:26.770  INFO tokio-runtime-worker reconnecting-websocket-client: [Parachain] Trying to connect to next external relaychain node. current_iteration=4 index=0 url="ws://127.0.0.1:43617/"

@iulianbarbu iulianbarbu added R0-silent Changes should not be mentioned in any release notes T9-cumulus This PR/Issue is related to cumulus. labels Aug 29, 2024
@iulianbarbu iulianbarbu self-assigned this Aug 29, 2024
@iulianbarbu iulianbarbu force-pushed the minimal-node-retry-conn-to-external-rpc branch 6 times, most recently from 51cd9b4 to 3434bb1 Compare August 29, 2024 13:59
@iulianbarbu iulianbarbu removed the R0-silent Changes should not be mentioned in any release notes label Aug 29, 2024
@iulianbarbu iulianbarbu force-pushed the minimal-node-retry-conn-to-external-rpc branch 3 times, most recently from 83b2520 to a9fa9cc Compare August 29, 2024 14:22
Signed-off-by: Iulian Barbu <iulian.barbu@parity.io>
Signed-off-by: Iulian Barbu <iulian.barbu@parity.io>
@iulianbarbu iulianbarbu force-pushed the minimal-node-retry-conn-to-external-rpc branch 3 times, most recently from 4733819 to 7e82332 Compare August 30, 2024 08:22
Signed-off-by: Iulian Barbu <iulian.barbu@parity.io>
@iulianbarbu iulianbarbu force-pushed the minimal-node-retry-conn-to-external-rpc branch from 7e82332 to 8e18ede Compare August 30, 2024 09:22
@iulianbarbu iulianbarbu marked this pull request as ready for review August 30, 2024 11:31
// If we reached the end of the urls list, backoff before retrying
// connections to the entire list once more.
let Ok(current_iteration) = (counter / urls.len()).try_into() else {
tracing::error!(target: LOG_TARGET, "Too many connection attempts to the RPC servers, aborting...");
Copy link
Contributor

@michalkucharczyk michalkucharczyk Aug 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
tracing::error!(target: LOG_TARGET, "Too many connection attempts to the RPC servers, aborting...");
tracing::error!(target: LOG_TARGET, "Too many failed connection attempts to the RPC servers, aborting...");

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm, will this error be ever printed?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need extra check to print the error if loop concluded w/o actual connection.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm, will this error be ever printed?

In practice it shouldn't be printed ever, but if we'll ever see the log then something's weird (either we're iterating for too many times for the retry logic or memory gets corrupted at runtime).

I think we need extra check to print the error if loop concluded w/o actual connection.

You mean outside of the loop if not concluding with a connection? I added a log here: d47a965. Let me know if that's what you mean.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So when talking about too many attempts we actually talk about ~ 2^64 too many? Does it make sense to impose such limits? I.e., how much time should pass to actually hit it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should never hit this branch since this will loop for a maximum urls.len() * DEFAULT_EXTERNAL_RPC_CONN_RETRIES times.

// time to catche the RPC server online and connect to it.
let conn_res = tokio::spawn(async move {
tokio::time::timeout(
Duration::from_secs(8),
Copy link
Contributor

@michalkucharczyk michalkucharczyk Aug 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CI tests are run on quite overloaded machines. 1s margin may not be enough, but let's see how it goes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pushed a simpler version of the test here: ee68d04.

Signed-off-by: Iulian Barbu <iulian.barbu@parity.io>
@@ -112,6 +144,8 @@ async fn connect_next_available_rpc_server(
Err(err) => tracing::debug!(target: LOG_TARGET, url, ?err, "Unable to connect."),
};
}

tracing::error!(target: LOG_TARGET, "Retrying to connect to any external relaychain node failed.");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: should we display list of nodes maybe?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's individually displayed with each iteration for each url. I think we should be fine.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming info level is enabled which may not be a case for defaults. Anyway it is minor.

Copy link
Contributor

@skunert skunert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice Job!
I think this is really nice to have. We can run into the situation where some subsystems try to make calls and waits for a longer time due to retries. I am not sure all places handle that super graceful, but it still sounds a lot better than crashing.

let index = (starting_position + counter) % urls.len();
tracing::info!(
target: LOG_TARGET,
current_iteration,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: From users perspective we should probably just print attempt here and summarize index and iteration.

Signed-off-by: Iulian Barbu <iulian.barbu@parity.io>
Copy link
Contributor

@lexnv lexnv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Nice job here! This should make things better 👍

@github-actions github-actions bot requested a review from lexnv September 2, 2024 16:04
@paritytech-cicd-pr
Copy link

The CI pipeline was cancelled due to failure one of the required jobs.
Job name: test-linux-stable 2/3
Logs: https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/7221535

Copy link
Contributor

@skunert skunert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once we have two reviews we need to merge fast, every time you merge master it re-requests our review.

@iulianbarbu
Copy link
Contributor Author

Once we have two reviews we need to merge fast, every time you merge master it re-requests our review.

Yes, sorry about that. Trying to figure out how to fix it. I requested to join a few GH teams, hopefully after that's done next PRs will not have this issue.

// If we reached the end of the urls list, backoff before retrying
// connections to the entire list once more.
let Ok(current_iteration) = (counter / urls.len()).try_into() else {
tracing::error!(target: LOG_TARGET, "Too many connection attempts to the RPC servers, aborting...");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So when talking about too many attempts we actually talk about ~ 2^64 too many? Does it make sense to impose such limits? I.e., how much time should pass to actually hit it?

@skunert skunert added this pull request to the merge queue Sep 3, 2024
Merged via the queue into paritytech:master with commit 4d2f793 Sep 3, 2024
185 of 187 checks passed
@iulianbarbu iulianbarbu deleted the minimal-node-retry-conn-to-external-rpc branch September 3, 2024 19:26
x3c41a pushed a commit that referenced this pull request Sep 4, 2024
# Description

Adds retry logic that makes the RPC relay chain interface more reliable
for the cases of a collator connecting to external RPC servers.

Closes #5514 
Closes #4278

Final solution still debated on #5514 , what this PR addresses might
change (e.g. #4278 might require a more advanced approach).

## Integration

Users that start collators should barely observe differences based on
this logic, since the retry logic applies only in case the collators
fail to connect to the RPC servers. In practice I assume the RPC servers
are already live before starting collators, and the issue isn't visible.

## Review Notes

The added retry logic is for retrying the connection to the RPC servers
(which can be multiple). It is at the level of the
cumulus/client/relay-chain-rpc-interface module, but more specifically
relevant to the RPC clients logic (`ClientManager`). The retry logic is
not configurable, it tries to connect to the RPC client for 5 times,
with an exponential backoff in between each iteration starting with 1
second wait time and ending with 16 seconds. The same logic is applied
in case an existing connection to an RPC is dropped. There is a
`ReconnectingWebsocketWorker` who ensures there is connectivity to at
least on RPC node, and the retry logic makes this stronger by insisting
on trying connections to the RPC servers list for 5 times.

## Testing

- This was tested manually by starting zombienet natively based on
[006-rpc_collator_builds_blocks.toml](https://github.com/paritytech/polkadot-sdk/blob/master/cumulus/zombienet/tests/0006-rpc_collator_builds_blocks.toml)
and observing collators don't fail anymore:

```bash
zombienet -l text --dir zbn-run -f --provider native spawn polkadot-sdk/cumulus/zombienet/tests/0006-rpc_collator_builds_blocks.toml
```

- Added a unit test that exercises the retry logic for a client
connection to a server that comes online in 10 seconds. The retry logic
can wait for as long as 30 seconds, but thought that it is too much for
a unit test. Just being conscious of CI time if it runs this test, but I
am happy to see suggestions around it too. I am not that sure either it
runs in CI, haven't figured it out entirely yet. The test can be
considered an integration test too, but it exercises crate internal
implementation, not the public API.

Collators example logs after the change:
```
2024-08-29 14:28:11.730  INFO tokio-runtime-worker reconnecting-websocket-client: [Parachain] Trying to connect to next external relaychain node. current_iteration=0 index=2 url="ws://127.0.0.1:37427/"
2024-08-29 14:28:12.737  INFO tokio-runtime-worker reconnecting-websocket-client: [Parachain] Trying to connect to next external relaychain node. current_iteration=1 index=0 url="ws://127.0.0.1:43617/"
2024-08-29 14:28:12.739  INFO tokio-runtime-worker reconnecting-websocket-client: [Parachain] Trying to connect to next external relaychain node. current_iteration=1 index=1 url="ws://127.0.0.1:37965/"
2024-08-29 14:28:12.755  INFO tokio-runtime-worker reconnecting-websocket-client: [Parachain] Trying to connect to next external relaychain node. current_iteration=1 index=2 url="ws://127.0.0.1:37427/"
2024-08-29 14:28:14.758  INFO tokio-runtime-worker reconnecting-websocket-client: [Parachain] Trying to connect to next external relaychain node. current_iteration=2 index=0 url="ws://127.0.0.1:43617/"
2024-08-29 14:28:14.759  INFO tokio-runtime-worker reconnecting-websocket-client: [Parachain] Trying to connect to next external relaychain node. current_iteration=2 index=1 url="ws://127.0.0.1:37965/"
2024-08-29 14:28:14.760  INFO tokio-runtime-worker reconnecting-websocket-client: [Parachain] Trying to connect to next external relaychain node. current_iteration=2 index=2 url="ws://127.0.0.1:37427/"
2024-08-29 14:28:18.766  INFO tokio-runtime-worker reconnecting-websocket-client: [Parachain] Trying to connect to next external relaychain node. current_iteration=3 index=0 url="ws://127.0.0.1:43617/"
2024-08-29 14:28:18.768  INFO tokio-runtime-worker reconnecting-websocket-client: [Parachain] Trying to connect to next external relaychain node. current_iteration=3 index=1 url="ws://127.0.0.1:37965/"
2024-08-29 14:28:18.768  INFO tokio-runtime-worker reconnecting-websocket-client: [Parachain] Trying to connect to next external relaychain node. current_iteration=3 index=2 url="ws://127.0.0.1:37427/"
2024-08-29 14:28:26.770  INFO tokio-runtime-worker reconnecting-websocket-client: [Parachain] Trying to connect to next external relaychain node. current_iteration=4 index=0 url="ws://127.0.0.1:43617/"
```

---------

Signed-off-by: Iulian Barbu <iulian.barbu@parity.io>
Co-authored-by: Sebastian Kunert <skunert49@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
T9-cumulus This PR/Issue is related to cumulus.
Projects
None yet
7 participants