Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Local Collator Panic - Connecting to Rococo with Validator RPC #4167

Closed
2 tasks done
SBalaguer opened this issue Apr 17, 2024 · 17 comments · Fixed by #4471
Closed
2 tasks done

Local Collator Panic - Connecting to Rococo with Validator RPC #4167

SBalaguer opened this issue Apr 17, 2024 · 17 comments · Fixed by #4471
Assignees
Labels
I2-bug The node fails to follow expected behavior. I10-unconfirmed Issue might be valid, but it's not yet known.

Comments

@SBalaguer
Copy link
Contributor

Is there an existing issue?

  • I have searched the existing issues

Experiencing problems? Have you tried our Stack Exchange first?

  • This is not a support question.

Description of bug

I'm testing out registering a Parachain on Rococo and getting some blocks validated using onDemandAssingmentProvider. The objective is to reproduce what a new builder could potentially do when registering and testing the system for the first time.

In order to achieve this, I'm running a collator leveraging an RPC connection to Rococo instead of running a Rococo validator directly within the collator. I'm doing this by passing the flag --relay-chain-rpc-url "wss://rococo-rpc.polkadot.io" to my collator.

At the beginning everything seems to work fine, and I even manage to produce blocks on demand with my parachain, however it gets to a moment in time where the collator panics. The message looks like this (full logs fromalice and bob attached):

2024-04-17 10:52:22.165 ERROR tokio-runtime-worker overseer: [Relaychain] Overseer exited with error err=Generated(SubsystemStalled("collator-protocol-subsystem", "signal", "polkadot_node_subsystem_types::OverseerSignal"))
2024-04-17 10:52:22.166 ERROR tokio-runtime-worker sc_service::task_manager: [Relaychain] Essential task `overseer` failed. Shutting down service.
2024-04-17 10:52:22.165 ERROR tokio-runtime-worker polkadot_overseer: [Relaychain] subsystem exited with error subsystem="availability-recovery" err=FromOrigin { origin: "availability-recovery", source: Generated(Context("Signal channel is terminated and empty.")) }
2024-04-17 10:52:22.166 ERROR tokio-runtime-worker polkadot_overseer: [Relaychain] subsystem exited with error subsystem="network-bridge-rx" err=FromOrigin { origin: "network-bridge", source: SubsystemError(Generated(Context("Signal channel is terminated and empty."))) }
2024-04-17 10:52:22.166 ERROR tokio-runtime-worker parachain::collation-generation: [Relaychain] error receiving message from subsystem context: Generated(Context("Signal channel is terminated and empty.")) err=Generated(Context("Signal channel is terminated and empty."))
2024-04-17 10:52:22.166 ERROR tokio-runtime-worker polkadot_overseer: [Relaychain] subsystem exited with error subsystem="network-bridge-tx" err=FromOrigin { origin: "network-bridge", source: SubsystemError(Generated(Context("Signal channel is terminated and empty."))) }
2024-04-17 10:52:22.167 ERROR tokio-runtime-worker polkadot_overseer: [Relaychain] subsystem exited with error subsystem="runtime-api" err=Generated(Context("Signal channel is terminated and empty."))
Error: Service(Other("Essential task failed."))

The parachain (runtime and node) I'm testing is from the OpenZeppelin Generic Template.

More logs 👇

alice-logs-extra-logging.txt
bob-logs-extra-logging.txt

Steps to reproduce

  1. Locally build a parachain using the OpenZeppelin Generic Template.
  2. Register Parachain on Rococo.
  3. Run collators as follows on your local machine (no cloud or fancy devops setup). I ran une for alice and one for bob with the same flags (changing ports), although I noticed the same behavior when running only one (worked because of force-authoring flag).
./target/release/parachain-template-node \
    --bob \
    --collator \
    --force-authoring \
    --chain raw-parachain-chainspec.json \
    --base-path tmp/bob \
    --port 40333 \
    --rpc-port 8845 \
    --relay-chain-rpc-url "wss://rococo-rpc.polkadot.io" \
  1. Get a block validated on Rococo, buy running the extrinsic OnDemandAssingmentProvider.placeOrderAllowDeath.
  2. Wait :)
@SBalaguer SBalaguer added I2-bug The node fails to follow expected behavior. I10-unconfirmed Issue might be valid, but it's not yet known. labels Apr 17, 2024
@skunert skunert self-assigned this Apr 17, 2024
@skunert
Copy link
Contributor

skunert commented Apr 18, 2024

I did a general sanity test today by testing locally with an rpc-based collator and the openzeppelin template. The base case of it producing blocks worked (for several hours). So if this crashes for you maybe something else is wrong.

I tried setting up your exact scenario with OnDemandAssingmentProvider.placeOrderAllowDeath locally by setting up rococo and the broker chain, but I did not get it work yet. Will come back to this.

@skunert skunert added this to SDK Node Apr 19, 2024
@github-project-automation github-project-automation bot moved this to backlog in SDK Node Apr 19, 2024
@skunert skunert moved this from backlog to in progress in SDK Node Apr 19, 2024
@skunert
Copy link
Contributor

skunert commented Apr 22, 2024

I was able to try the ondemand scenario.
Everytime I ordered a core via OnDemandAssingmentProvider.placeOrderAllowDeath the included block on chain advanced by 1. No crashes for me.

After checking your logs again I noticed that I built the openzeppelin template using cargo build --release --features async-backing, whereas you built your node without async-backing AFAIS.

In general the situation is not yet optimal when it comes to collating for these on-demand cores. The lookahead collator tries to fill the unincluded segment by building multiple blocks, but we only have one spot, so the node will have a higher block locally than what is included in the relay chain.

@SBalaguer Can I ask you to try two things?

  • Can you reproduce the old issue while having -lparachain=trace? We want even more logs from all the subsystems.
  • Can you try to build your node with --features async-backing and see if it still crashes? (This also builds the runtime with changed parameters, so you will need to re-register).

@stakeworld
Copy link
Contributor

stakeworld commented Apr 24, 2024

Hi, a thought, please ignore if irrelevant; i'm having the feeling there is a problem between the relaychain and systemchain interfacing; before the overseer crashes it gives Skipping importing block as new best block, because there already exists a best block with an higher number and after that everything seems to go downhill

2024-04-17 10:52:12.135 DEBUG tokio-runtime-worker cumulus-consensus: [Parachain] Importing block as new best for parachain. hash=0xf7f59624594399d8a4f63328a0ced0b4cea345fc56e2bf8c4fc546589333759b
2024-04-17 10:52:12.135 DEBUG tokio-runtime-worker cumulus-consensus: [Parachain] Skipping importing block as new best block, because there already exists a best block with an higher number best_number=4 block_number=3

on a systemchain collator I also saw the overseer error and crash when the relaychain was not available for a moment which also made the collator crash with a similar error
overseer-crash.txt

@Curu24
Copy link

Curu24 commented Apr 24, 2024

Hello,
after discussing in "system parachain collators" channel on element I'm adding adding to this thread as it is probably connected.

Our collaectives collator started crashing yesterday 6pm UTC without any prior interventions with the following error:

Apr 23 18:21:20 collectives-polkadot polkadot[601753]: 2024-04-23 18:21:20 [Relaychain] Overseer exited with error err=Generated(SubsystemStalled("collator-protocol-subsystem", "signal", "polkadot_node_subsystem_types::OverseerSignal"))
Apr 23 18:21:20 collectives-polkadot polkadot[601753]: 2024-04-23 18:21:20 [Relaychain] subsystem exited with error subsystem="network-bridge-rx" err=FromOrigin { origin: "network-bridge", source: SubsystemError(Generated(Context("Signal channel is terminated and empty."))) }
Apr 23 18:21:20 collectives-polkadot polkadot[601753]: 2024-04-23 18:21:20 [Relaychain] subsystem exited with error subsystem="chain-api" err=FromOrigin { origin: "chain-api", source: Generated(Context("Signal channel is terminated and empty.")) }
Apr 23 18:21:20 collectives-polkadot polkadot[601753]: 2024-04-23 18:21:20 [Relaychain] subsystem exited with error subsystem="network-bridge-tx" err=FromOrigin { origin: "network-bridge", source: SubsystemError(Generated(Context("Signal channel is terminated and empty."))) }
Apr 23 18:21:20 collectives-polkadot polkadot[601753]: 2024-04-23 18:21:20 [Relaychain] error receiving message from subsystem context: Generated(Context("Signal channel is terminated and empty.")) err=Generated(Context("Signal channel is terminated and empty."))
Apr 23 18:21:20 collectives-polkadot polkadot[601753]: 2024-04-23 18:21:20 [Relaychain] subsystem exited with error subsystem="availability-recovery" err=FromOrigin { origin: "availability-recovery", source: Generated(Context("Signal channel is terminated and empty.")) }
Apr 23 18:21:20 collectives-polkadot polkadot[601753]: 2024-04-23 18:21:20 [Relaychain] subsystem exited with error subsystem="runtime-api" err=Generated(Context("Signal channel is terminated and empty."))
Apr 23 18:21:20 collectives-polkadot polkadot[601753]: 2024-04-23 18:21:20 [Parachain] Essential task `aura` failed. Shutting down service.
Apr 23 18:21:20 collectives-polkadot polkadot[601753]: 2024-04-23 18:21:20 [Relaychain] Protocol command streams have been shut down
Apr 23 18:21:20 collectives-polkadot polkadot[601753]: 2024-04-23 18:21:20 [Relaychain] Essential task `overseer` failed. Shutting down service.
Apr 23 18:21:20 collectives-polkadot polkadot[601753]: Error: Service(Other("Essential task failed."))
Apr 23 18:21:20 collectives-polkadot systemd[1]: polkadot.service: Main process exited, code=exited, status=1/FAILURE

This repeated every few minutes and some other system chain collators crashed from time to time as well. After approximately 4 hours everything went back to normal. All the collators share a same datacenter and use remote relay chain.
Here is a example of systemd service flags:

ExecStart=/usr/local/bin/polkadot \
  --name Polkadotters \
  --base-path '/var/lib/collectives' \
  --telemetry-url 'wss://telemetry.polkadot.io/submit/ 0' \
  --wasm-execution Compiled \
  --collator \
  --chain /var/lib/collectives/chainspec.json \
  --trie-cache-size 0 \
  --relay-chain-rpc-url ws://57.129.1.46:9944 \
  --relay-chain-rpc-url ws://178.18.244.255:9944 \

As we have other collators in different datacenters that use the same remote relay rpc endpoints, it suggests that there might have been some minor connectivity problems in this particular datacenter which caused remote relays to fail.

@skunert
Copy link
Contributor

skunert commented Apr 25, 2024

Thanks for the reports!

Hi, a thought, please ignore if irrelevant; i'm having the feeling there is a problem between the relaychain and systemchain interfacing; before the overseer crashes it gives Skipping importing block as new best block, because there already exists a best block with an higher number and after that everything seems to go downhill

This message is not suspicious at all. It just means that we already know of a higher best block locally than what is included in the relay chain, hence we skip setting a lower bestblock.

Fri 2024-04-19 07:14:33 CEST sw-col-bri.service[3000723]: 2024-04-19 07:14:33 [Parachain] Trying to connect to next external relaychain node. index=1 url="wss://ksm-rpc.stakeworld.io:443/"
Fri 2024-04-19 07:14:33 CEST sw-col-bri.service[3000723]: 2024-04-19 07:14:33 [Parachain] Trying to connect to next external relaychain node. index=0 url="ws://localhost:9800/"
Fri 2024-04-19 07:14:33 CEST sw-col-bri.service[3000723]: 2024-04-19 07:14:33 [Parachain] Unable to reconnect, stopping worker. Unable to find valid external RPC server, shutting down.

Its true that this crash happened because both relay chain nodes that you specified were unavailable. This is expected and we see these proper error messages beforehand. But I think we should improve this, I opened a issue for tracking: #4278

@Curu24 Your error seems very relevant, can you post more logs? Which collator version are you running?

@skunert
Copy link
Contributor

skunert commented Apr 25, 2024

Apr 23 18:21:20 collectives-polkadot polkadot[601753]: 2024-04-23 18:21:20 [Relaychain] Overseer exited with error err=Generated(SubsystemStalled("collator-protocol-subsystem", "signal", "polkadot_node_subsystem_types::OverseerSignal"))

So until now this has only been reported on collators (and not full nodes). Also, it seems to be not related to usage of ondemand cores. I did a quick skim of the collator-protocol subsystem but nothing immediately striked the eye.

@sandreim Since we are seeing this just now, could this be related to the recent changes regarding elastic-scaling?

@Curu24
Copy link

Curu24 commented Apr 25, 2024

Thanks for the reports!

Hi, a thought, please ignore if irrelevant; i'm having the feeling there is a problem between the relaychain and systemchain interfacing; before the overseer crashes it gives Skipping importing block as new best block, because there already exists a best block with an higher number and after that everything seems to go downhill

This message is not suspicious at all. It just means that we already know of a higher best block locally than what is included in the relay chain, hence we skip setting a lower bestblock.

Fri 2024-04-19 07:14:33 CEST sw-col-bri.service[3000723]: 2024-04-19 07:14:33 [Parachain] Trying to connect to next external relaychain node. index=1 url="wss://ksm-rpc.stakeworld.io:443/"
Fri 2024-04-19 07:14:33 CEST sw-col-bri.service[3000723]: 2024-04-19 07:14:33 [Parachain] Trying to connect to next external relaychain node. index=0 url="ws://localhost:9800/"
Fri 2024-04-19 07:14:33 CEST sw-col-bri.service[3000723]: 2024-04-19 07:14:33 [Parachain] Unable to reconnect, stopping worker. Unable to find valid external RPC server, shutting down.

Its true that this crash happened because both relay chain nodes that you specified were unavailable. This is expected and we see these proper error messages beforehand. But I think we should improve this, I opened a issue for tracking: #4278

@Curu24 Your error seems very relevant, can you post more logs? Which collator version are you running?

Running 1.10.1. Sending a part of the logs when errors started here: https://pastebin.com/raw/K2rEUgMW
Let me know if it is enough.

@skunert
Copy link
Contributor

skunert commented May 2, 2024

Quick update: Came back to this and was finally able to reproduce the issue. The problem is that collator-protocol is taking a long time for one iteration of its core loop. It looks like prospective-parachains is performing runtime calls via RPC for every relay parent and every parachain, which leads to a ton of calls. They take a while via RPC, so I have seen that one loop in collator-protocol takes over 30s and the node terminates.

@skunert
Copy link
Contributor

skunert commented May 2, 2024

Specifically this line seems to be the culprit, fetching the backing state for every para id for every activated leaf:

let backing_state = fetch_backing_state(&mut *ctx, hash, para).await?;

@alindima I propose we to introduce a parameter on prospective-parachains to specify a para id. Collators could specify their para id there and the subsystem would only maintain itself for that single para instead of all scheduled ones. This would reduce the load immensely. Does that sound reasonable?

@alindima
Copy link
Contributor

alindima commented May 8, 2024

I was quite surprised to see the collators having the prospective-parachains subsystem activated. Digging a bit, it's only used in the CollationGeneration subsystem, to fetch the minimum relay parents the collator can build collations on.

It's indeed quite an overkill to run the entire subsystem just for this piece of information.

We can either do as you suggest and add a parameter for the prospective-parachains subsystem or directly use the ChainApi for this information and duplicate this bit of code in the collation-generation.

@skunert
Copy link
Contributor

skunert commented May 8, 2024

directly use the ChainApi for this information and duplicate this bit of code in the collation-generation

I like this 👍. IMO not running the prospective-para system is preferrable to adding "collator-specific" logic to it.

github-merge-queue bot pushed a commit that referenced this issue May 21, 2024
Implements #4429

Collators only need to maintain the implicit view for the paraid they
are collating on.
In this case, bypass prospective-parachains entirely. It's still useful
to use the GetMinimumRelayParents message from prospective-parachains
for validators, because the data is already present there.

This enables us to entirely remove the subsystem from collators, which
consumed resources needlessly

Aims to resolve #4167 

TODO:
- [x] fix unit tests
@github-project-automation github-project-automation bot moved this from in progress to done in SDK Node May 21, 2024
Ank4n pushed a commit that referenced this issue May 21, 2024
Implements #4429

Collators only need to maintain the implicit view for the paraid they
are collating on.
In this case, bypass prospective-parachains entirely. It's still useful
to use the GetMinimumRelayParents message from prospective-parachains
for validators, because the data is already present there.

This enables us to entirely remove the subsystem from collators, which
consumed resources needlessly

Aims to resolve #4167 

TODO:
- [x] fix unit tests
hitchhooker pushed a commit to ibp-network/polkadot-sdk that referenced this issue Jun 5, 2024
…h#4471)

Implements paritytech#4429

Collators only need to maintain the implicit view for the paraid they
are collating on.
In this case, bypass prospective-parachains entirely. It's still useful
to use the GetMinimumRelayParents message from prospective-parachains
for validators, because the data is already present there.

This enables us to entirely remove the subsystem from collators, which
consumed resources needlessly

Aims to resolve paritytech#4167 

TODO:
- [x] fix unit tests
@Dinonard
Copy link
Contributor

@skunert sorry for the tag in a closed issue, but I wanted to avoid opening another for a follow-up question.

We've started experiencing similar issues after upgrading to use polkadot-release-v1.9.0 recently, on nodes that rely on the relay chain RPC interface. The fix you linked is available from polkadot-release-v1.13.0 as far as I can see.

For our particular setup, we have 8 separate Shiden nodes communicating with two Kusama nodes (private, only used by us). Before the upgrade it worked fine, but now the errors above with crash happen constantly.

I'd like to ask was there more discussion about this issue somewhere else?
Maybe a rule of thumb about how many parachain nodes each relay chain node should be able to handle?
(I'm aware this is strongly influenced by HW & networking, but still might be worth asking 🙂 )

@skunert
Copy link
Contributor

skunert commented Jul 25, 2024

We've started experiencing similar issues after upgrading to use polkadot-release-v1.9.0 recently, on nodes that rely on
the relay chain RPC interface. The fix you linked is available from polkadot-release-v1.13.0 as far as I can see.

From which version did you upgrade and what is the exact error message?
If is the one from #4167 (comment) the solution is indeed to upgrade to the fix.

The problem is only loosely related to the relay chain nodes themselves. The problem is a subsystem that was contained in the collator. It was doing a lot of RPC calls which led to a stall in that subsystem. The fix in the linked PRs is to not include that subsystem.

In general your setup sounds reasonable, 4 parachain nodes connecting via RPC to a relay chain node seems fine to me.

@Dinonard
Copy link
Contributor

From which version did you upgrade and what is the exact error message? If is the one from #4167 (comment) the solution is indeed to upgrade to the fix.
In general your setup sounds reasonable, 4 parachain nodes connecting via RPC to a relay chain node seems fine to me.

We've upgraded from v1.3.0 to v1.9.0.
The error message is the same as in the comment you pointed out.

It's actually 8 parachain collators/nodes connecting to two relay chain nodes 🙂.
URLs of both Kusama nodes are passed to each Shiden client via --relay-chain-rpc-url args.

The problem is only loosely related to the relay chain nodes themselves. The problem is a subsystem that was contained in the collator. It was doing a lot of RPC calls which led to a stall in that subsystem. The fix in the linked PRs is to not include that subsystem.

Maybe I misunderstood, but from your comment I figured out the issue was that the subsystem was sending too many RPC calls (same as you repeated now) but also that the relay couldn't handle the load, i.e. reply in timely manner.

We also have a testnet parachain which relies on relay chain RPC but in this case only single node uses it and we haven't had any problems there (the client code is exactly the same).

@skunert
Copy link
Contributor

skunert commented Jul 29, 2024

It's actually 8 parachain collators/nodes connecting to two relay chain nodes 🙂.

Yes, I was assuming that you provided a different order so that 4 connect to the first relay node and 4 to the other. In case one fails, all 8 would connect to the same.

Maybe I misunderstood, but from your comment I figured out the issue was that the subsystem was sending too many RPC calls (same as you repeated now) but also that the relay couldn't handle the load, i.e. reply in timely manner.

Yes its a mix of both. But the main problem was the high amount of queries send. If you have your nodes exclusively running for you, I would expect them to be able to handle the normal load.

We also have a testnet parachain which relies on relay chain RPC but in this case only single node uses it and we haven't had any problems there (the client code is exactly the same).

The subsystem that was running on the collators was sending a lot of requests per parachain registered on the relay chain. This means that for testnets where you only have a couple chains its fine. But on kusama quite a few are registered, so its sending a lot more requests.

@Dinonard
Copy link
Contributor

The subsystem that was running on the collators was sending a lot of requests per parachain registered on the relay chain. This means that for testnets where you only have a couple chains its fine. But on kusama quite a few are registered, so its sending a lot more requests.

I see, missed it before! Thanks for this explanation!

Yes, I was assuming that you provided a different order so that 4 connect to the first relay node and 4 to the other. In case one fails, all 8 would connect to the same.

I wrongly assumed that a random connection is picked out of the two but I've checked the code and see that's not true 🙈.
Thanks again for this!

TarekkMA pushed a commit to moonbeam-foundation/polkadot-sdk that referenced this issue Aug 2, 2024
…h#4471)

Implements paritytech#4429

Collators only need to maintain the implicit view for the paraid they
are collating on.
In this case, bypass prospective-parachains entirely. It's still useful
to use the GetMinimumRelayParents message from prospective-parachains
for validators, because the data is already present there.

This enables us to entirely remove the subsystem from collators, which
consumed resources needlessly

Aims to resolve paritytech#4167 

TODO:
- [x] fix unit tests
@Mihajlo-Pavlovic
Copy link

Hi @Dinonard, we encountered the same issue updating binaries to v1.9.0 Did you manage to find a fix for it? If you found an alternative solution or workaround, I’d appreciate hearing how you resolved it. Thanks in advance!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
I2-bug The node fails to follow expected behavior. I10-unconfirmed Issue might be valid, but it's not yet known.
Projects
Status: done
Development

Successfully merging a pull request may close this issue.

7 participants