Local Collator Panic - Connecting to Rococo with Validator RPC #4167

SBalaguer · 2024-04-17T09:10:33Z

Is there an existing issue?

I have searched the existing issues

Experiencing problems? Have you tried our Stack Exchange first?

This is not a support question.

Description of bug

I'm testing out registering a Parachain on Rococo and getting some blocks validated using onDemandAssingmentProvider. The objective is to reproduce what a new builder could potentially do when registering and testing the system for the first time.

In order to achieve this, I'm running a collator leveraging an RPC connection to Rococo instead of running a Rococo validator directly within the collator. I'm doing this by passing the flag --relay-chain-rpc-url "wss://rococo-rpc.polkadot.io" to my collator.

At the beginning everything seems to work fine, and I even manage to produce blocks on demand with my parachain, however it gets to a moment in time where the collator panics. The message looks like this (full logs fromalice and bob attached):

2024-04-17 10:52:22.165 ERROR tokio-runtime-worker overseer: [Relaychain] Overseer exited with error err=Generated(SubsystemStalled("collator-protocol-subsystem", "signal", "polkadot_node_subsystem_types::OverseerSignal"))
2024-04-17 10:52:22.166 ERROR tokio-runtime-worker sc_service::task_manager: [Relaychain] Essential task `overseer` failed. Shutting down service.
2024-04-17 10:52:22.165 ERROR tokio-runtime-worker polkadot_overseer: [Relaychain] subsystem exited with error subsystem="availability-recovery" err=FromOrigin { origin: "availability-recovery", source: Generated(Context("Signal channel is terminated and empty.")) }
2024-04-17 10:52:22.166 ERROR tokio-runtime-worker polkadot_overseer: [Relaychain] subsystem exited with error subsystem="network-bridge-rx" err=FromOrigin { origin: "network-bridge", source: SubsystemError(Generated(Context("Signal channel is terminated and empty."))) }
2024-04-17 10:52:22.166 ERROR tokio-runtime-worker parachain::collation-generation: [Relaychain] error receiving message from subsystem context: Generated(Context("Signal channel is terminated and empty.")) err=Generated(Context("Signal channel is terminated and empty."))
2024-04-17 10:52:22.166 ERROR tokio-runtime-worker polkadot_overseer: [Relaychain] subsystem exited with error subsystem="network-bridge-tx" err=FromOrigin { origin: "network-bridge", source: SubsystemError(Generated(Context("Signal channel is terminated and empty."))) }
2024-04-17 10:52:22.167 ERROR tokio-runtime-worker polkadot_overseer: [Relaychain] subsystem exited with error subsystem="runtime-api" err=Generated(Context("Signal channel is terminated and empty."))
Error: Service(Other("Essential task failed."))

The parachain (runtime and node) I'm testing is from the OpenZeppelin Generic Template.

More logs 👇

alice-logs-extra-logging.txt
bob-logs-extra-logging.txt

Steps to reproduce

Locally build a parachain using the OpenZeppelin Generic Template.
Register Parachain on Rococo.
Run collators as follows on your local machine (no cloud or fancy devops setup). I ran une for alice and one for bob with the same flags (changing ports), although I noticed the same behavior when running only one (worked because of force-authoring flag).

./target/release/parachain-template-node \
    --bob \
    --collator \
    --force-authoring \
    --chain raw-parachain-chainspec.json \
    --base-path tmp/bob \
    --port 40333 \
    --rpc-port 8845 \
    --relay-chain-rpc-url "wss://rococo-rpc.polkadot.io" \

Get a block validated on Rococo, buy running the extrinsic OnDemandAssingmentProvider.placeOrderAllowDeath.
Wait :)

The text was updated successfully, but these errors were encountered:

skunert · 2024-04-18T14:56:31Z

I did a general sanity test today by testing locally with an rpc-based collator and the openzeppelin template. The base case of it producing blocks worked (for several hours). So if this crashes for you maybe something else is wrong.

I tried setting up your exact scenario with OnDemandAssingmentProvider.placeOrderAllowDeath locally by setting up rococo and the broker chain, but I did not get it work yet. Will come back to this.

skunert · 2024-04-22T09:22:25Z

I was able to try the ondemand scenario.
Everytime I ordered a core via OnDemandAssingmentProvider.placeOrderAllowDeath the included block on chain advanced by 1. No crashes for me.

After checking your logs again I noticed that I built the openzeppelin template using cargo build --release --features async-backing, whereas you built your node without async-backing AFAIS.

In general the situation is not yet optimal when it comes to collating for these on-demand cores. The lookahead collator tries to fill the unincluded segment by building multiple blocks, but we only have one spot, so the node will have a higher block locally than what is included in the relay chain.

@SBalaguer Can I ask you to try two things?

Can you reproduce the old issue while having -lparachain=trace? We want even more logs from all the subsystems.
Can you try to build your node with --features async-backing and see if it still crashes? (This also builds the runtime with changed parameters, so you will need to re-register).

stakeworld · 2024-04-24T06:56:28Z

Hi, a thought, please ignore if irrelevant; i'm having the feeling there is a problem between the relaychain and systemchain interfacing; before the overseer crashes it gives Skipping importing block as new best block, because there already exists a best block with an higher number and after that everything seems to go downhill

2024-04-17 10:52:12.135 DEBUG tokio-runtime-worker cumulus-consensus: [Parachain] Importing block as new best for parachain. hash=0xf7f59624594399d8a4f63328a0ced0b4cea345fc56e2bf8c4fc546589333759b
2024-04-17 10:52:12.135 DEBUG tokio-runtime-worker cumulus-consensus: [Parachain] Skipping importing block as new best block, because there already exists a best block with an higher number best_number=4 block_number=3

on a systemchain collator I also saw the overseer error and crash when the relaychain was not available for a moment which also made the collator crash with a similar error
overseer-crash.txt

Curu24 · 2024-04-24T11:45:51Z

Hello,
after discussing in "system parachain collators" channel on element I'm adding adding to this thread as it is probably connected.

Our collaectives collator started crashing yesterday 6pm UTC without any prior interventions with the following error:

Apr 23 18:21:20 collectives-polkadot polkadot[601753]: 2024-04-23 18:21:20 [Relaychain] Overseer exited with error err=Generated(SubsystemStalled("collator-protocol-subsystem", "signal", "polkadot_node_subsystem_types::OverseerSignal"))
Apr 23 18:21:20 collectives-polkadot polkadot[601753]: 2024-04-23 18:21:20 [Relaychain] subsystem exited with error subsystem="network-bridge-rx" err=FromOrigin { origin: "network-bridge", source: SubsystemError(Generated(Context("Signal channel is terminated and empty."))) }
Apr 23 18:21:20 collectives-polkadot polkadot[601753]: 2024-04-23 18:21:20 [Relaychain] subsystem exited with error subsystem="chain-api" err=FromOrigin { origin: "chain-api", source: Generated(Context("Signal channel is terminated and empty.")) }
Apr 23 18:21:20 collectives-polkadot polkadot[601753]: 2024-04-23 18:21:20 [Relaychain] subsystem exited with error subsystem="network-bridge-tx" err=FromOrigin { origin: "network-bridge", source: SubsystemError(Generated(Context("Signal channel is terminated and empty."))) }
Apr 23 18:21:20 collectives-polkadot polkadot[601753]: 2024-04-23 18:21:20 [Relaychain] error receiving message from subsystem context: Generated(Context("Signal channel is terminated and empty.")) err=Generated(Context("Signal channel is terminated and empty."))
Apr 23 18:21:20 collectives-polkadot polkadot[601753]: 2024-04-23 18:21:20 [Relaychain] subsystem exited with error subsystem="availability-recovery" err=FromOrigin { origin: "availability-recovery", source: Generated(Context("Signal channel is terminated and empty.")) }
Apr 23 18:21:20 collectives-polkadot polkadot[601753]: 2024-04-23 18:21:20 [Relaychain] subsystem exited with error subsystem="runtime-api" err=Generated(Context("Signal channel is terminated and empty."))
Apr 23 18:21:20 collectives-polkadot polkadot[601753]: 2024-04-23 18:21:20 [Parachain] Essential task `aura` failed. Shutting down service.
Apr 23 18:21:20 collectives-polkadot polkadot[601753]: 2024-04-23 18:21:20 [Relaychain] Protocol command streams have been shut down
Apr 23 18:21:20 collectives-polkadot polkadot[601753]: 2024-04-23 18:21:20 [Relaychain] Essential task `overseer` failed. Shutting down service.
Apr 23 18:21:20 collectives-polkadot polkadot[601753]: Error: Service(Other("Essential task failed."))
Apr 23 18:21:20 collectives-polkadot systemd[1]: polkadot.service: Main process exited, code=exited, status=1/FAILURE

This repeated every few minutes and some other system chain collators crashed from time to time as well. After approximately 4 hours everything went back to normal. All the collators share a same datacenter and use remote relay chain.
Here is a example of systemd service flags:

ExecStart=/usr/local/bin/polkadot \
  --name Polkadotters \
  --base-path '/var/lib/collectives' \
  --telemetry-url 'wss://telemetry.polkadot.io/submit/ 0' \
  --wasm-execution Compiled \
  --collator \
  --chain /var/lib/collectives/chainspec.json \
  --trie-cache-size 0 \
  --relay-chain-rpc-url ws://57.129.1.46:9944 \
  --relay-chain-rpc-url ws://178.18.244.255:9944 \

As we have other collators in different datacenters that use the same remote relay rpc endpoints, it suggests that there might have been some minor connectivity problems in this particular datacenter which caused remote relays to fail.

skunert · 2024-04-25T07:46:52Z

Thanks for the reports!

Hi, a thought, please ignore if irrelevant; i'm having the feeling there is a problem between the relaychain and systemchain interfacing; before the overseer crashes it gives Skipping importing block as new best block, because there already exists a best block with an higher number and after that everything seems to go downhill

This message is not suspicious at all. It just means that we already know of a higher best block locally than what is included in the relay chain, hence we skip setting a lower bestblock.

Fri 2024-04-19 07:14:33 CEST sw-col-bri.service[3000723]: 2024-04-19 07:14:33 [Parachain] Trying to connect to next external relaychain node. index=1 url="wss://ksm-rpc.stakeworld.io:443/"
Fri 2024-04-19 07:14:33 CEST sw-col-bri.service[3000723]: 2024-04-19 07:14:33 [Parachain] Trying to connect to next external relaychain node. index=0 url="ws://localhost:9800/"
Fri 2024-04-19 07:14:33 CEST sw-col-bri.service[3000723]: 2024-04-19 07:14:33 [Parachain] Unable to reconnect, stopping worker. Unable to find valid external RPC server, shutting down.

Its true that this crash happened because both relay chain nodes that you specified were unavailable. This is expected and we see these proper error messages beforehand. But I think we should improve this, I opened a issue for tracking: #4278

@Curu24 Your error seems very relevant, can you post more logs? Which collator version are you running?

skunert · 2024-04-25T08:50:27Z

Apr 23 18:21:20 collectives-polkadot polkadot[601753]: 2024-04-23 18:21:20 [Relaychain] Overseer exited with error err=Generated(SubsystemStalled("collator-protocol-subsystem", "signal", "polkadot_node_subsystem_types::OverseerSignal"))

So until now this has only been reported on collators (and not full nodes). Also, it seems to be not related to usage of ondemand cores. I did a quick skim of the collator-protocol subsystem but nothing immediately striked the eye.

@sandreim Since we are seeing this just now, could this be related to the recent changes regarding elastic-scaling?

Curu24 · 2024-04-25T13:16:54Z

Thanks for the reports!

Hi, a thought, please ignore if irrelevant; i'm having the feeling there is a problem between the relaychain and systemchain interfacing; before the overseer crashes it gives Skipping importing block as new best block, because there already exists a best block with an higher number and after that everything seems to go downhill

This message is not suspicious at all. It just means that we already know of a higher best block locally than what is included in the relay chain, hence we skip setting a lower bestblock.

Fri 2024-04-19 07:14:33 CEST sw-col-bri.service[3000723]: 2024-04-19 07:14:33 [Parachain] Trying to connect to next external relaychain node. index=1 url="wss://ksm-rpc.stakeworld.io:443/"
Fri 2024-04-19 07:14:33 CEST sw-col-bri.service[3000723]: 2024-04-19 07:14:33 [Parachain] Trying to connect to next external relaychain node. index=0 url="ws://localhost:9800/"
Fri 2024-04-19 07:14:33 CEST sw-col-bri.service[3000723]: 2024-04-19 07:14:33 [Parachain] Unable to reconnect, stopping worker. Unable to find valid external RPC server, shutting down.

Its true that this crash happened because both relay chain nodes that you specified were unavailable. This is expected and we see these proper error messages beforehand. But I think we should improve this, I opened a issue for tracking: #4278

@Curu24 Your error seems very relevant, can you post more logs? Which collator version are you running?

Running 1.10.1. Sending a part of the logs when errors started here: https://pastebin.com/raw/K2rEUgMW
Let me know if it is enough.

skunert · 2024-05-02T17:35:58Z

Quick update: Came back to this and was finally able to reproduce the issue. The problem is that collator-protocol is taking a long time for one iteration of its core loop. It looks like prospective-parachains is performing runtime calls via RPC for every relay parent and every parachain, which leads to a ton of calls. They take a while via RPC, so I have seen that one loop in collator-protocol takes over 30s and the node terminates.

skunert · 2024-05-02T18:12:42Z

Specifically this line seems to be the culprit, fetching the backing state for every para id for every activated leaf:

polkadot-sdk/polkadot/node/core/prospective-parachains/src/lib.rs

Line 236 in 6580101

let backing_state = fetch_backing_state(&mut *ctx, hash, para).await?;

@alindima I propose we to introduce a parameter on prospective-parachains to specify a para id. Collators could specify their para id there and the subsystem would only maintain itself for that single para instead of all scheduled ones. This would reduce the load immensely. Does that sound reasonable?

alindima · 2024-05-08T07:10:27Z

I was quite surprised to see the collators having the prospective-parachains subsystem activated. Digging a bit, it's only used in the CollationGeneration subsystem, to fetch the minimum relay parents the collator can build collations on.

It's indeed quite an overkill to run the entire subsystem just for this piece of information.

We can either do as you suggest and add a parameter for the prospective-parachains subsystem or directly use the ChainApi for this information and duplicate this bit of code in the collation-generation.

skunert · 2024-05-08T12:25:37Z

directly use the ChainApi for this information and duplicate this bit of code in the collation-generation

I like this 👍. IMO not running the prospective-para system is preferrable to adding "collator-specific" logic to it.

Implements #4429 Collators only need to maintain the implicit view for the paraid they are collating on. In this case, bypass prospective-parachains entirely. It's still useful to use the GetMinimumRelayParents message from prospective-parachains for validators, because the data is already present there. This enables us to entirely remove the subsystem from collators, which consumed resources needlessly Aims to resolve #4167 TODO: - [x] fix unit tests

…h#4471) Implements paritytech#4429 Collators only need to maintain the implicit view for the paraid they are collating on. In this case, bypass prospective-parachains entirely. It's still useful to use the GetMinimumRelayParents message from prospective-parachains for validators, because the data is already present there. This enables us to entirely remove the subsystem from collators, which consumed resources needlessly Aims to resolve paritytech#4167 TODO: - [x] fix unit tests

Dinonard · 2024-07-24T15:41:17Z

@skunert sorry for the tag in a closed issue, but I wanted to avoid opening another for a follow-up question.

We've started experiencing similar issues after upgrading to use polkadot-release-v1.9.0 recently, on nodes that rely on the relay chain RPC interface. The fix you linked is available from polkadot-release-v1.13.0 as far as I can see.

For our particular setup, we have 8 separate Shiden nodes communicating with two Kusama nodes (private, only used by us). Before the upgrade it worked fine, but now the errors above with crash happen constantly.

I'd like to ask was there more discussion about this issue somewhere else?
Maybe a rule of thumb about how many parachain nodes each relay chain node should be able to handle?
(I'm aware this is strongly influenced by HW & networking, but still might be worth asking 🙂 )

skunert · 2024-07-25T12:55:22Z

We've started experiencing similar issues after upgrading to use polkadot-release-v1.9.0 recently, on nodes that rely on
the relay chain RPC interface. The fix you linked is available from polkadot-release-v1.13.0 as far as I can see.

From which version did you upgrade and what is the exact error message?
If is the one from #4167 (comment) the solution is indeed to upgrade to the fix.

The problem is only loosely related to the relay chain nodes themselves. The problem is a subsystem that was contained in the collator. It was doing a lot of RPC calls which led to a stall in that subsystem. The fix in the linked PRs is to not include that subsystem.

In general your setup sounds reasonable, 4 parachain nodes connecting via RPC to a relay chain node seems fine to me.

Dinonard · 2024-07-25T13:10:27Z

From which version did you upgrade and what is the exact error message? If is the one from #4167 (comment) the solution is indeed to upgrade to the fix.
In general your setup sounds reasonable, 4 parachain nodes connecting via RPC to a relay chain node seems fine to me.

We've upgraded from v1.3.0 to v1.9.0.
The error message is the same as in the comment you pointed out.

It's actually 8 parachain collators/nodes connecting to two relay chain nodes 🙂.
URLs of both Kusama nodes are passed to each Shiden client via --relay-chain-rpc-url args.

The problem is only loosely related to the relay chain nodes themselves. The problem is a subsystem that was contained in the collator. It was doing a lot of RPC calls which led to a stall in that subsystem. The fix in the linked PRs is to not include that subsystem.

Maybe I misunderstood, but from your comment I figured out the issue was that the subsystem was sending too many RPC calls (same as you repeated now) but also that the relay couldn't handle the load, i.e. reply in timely manner.

We also have a testnet parachain which relies on relay chain RPC but in this case only single node uses it and we haven't had any problems there (the client code is exactly the same).

skunert · 2024-07-29T13:24:36Z

It's actually 8 parachain collators/nodes connecting to two relay chain nodes 🙂.

Yes, I was assuming that you provided a different order so that 4 connect to the first relay node and 4 to the other. In case one fails, all 8 would connect to the same.

Maybe I misunderstood, but from your comment I figured out the issue was that the subsystem was sending too many RPC calls (same as you repeated now) but also that the relay couldn't handle the load, i.e. reply in timely manner.

Yes its a mix of both. But the main problem was the high amount of queries send. If you have your nodes exclusively running for you, I would expect them to be able to handle the normal load.

We also have a testnet parachain which relies on relay chain RPC but in this case only single node uses it and we haven't had any problems there (the client code is exactly the same).

The subsystem that was running on the collators was sending a lot of requests per parachain registered on the relay chain. This means that for testnets where you only have a couple chains its fine. But on kusama quite a few are registered, so its sending a lot more requests.

Dinonard · 2024-07-29T14:08:00Z

The subsystem that was running on the collators was sending a lot of requests per parachain registered on the relay chain. This means that for testnets where you only have a couple chains its fine. But on kusama quite a few are registered, so its sending a lot more requests.

I see, missed it before! Thanks for this explanation!

Yes, I was assuming that you provided a different order so that 4 connect to the first relay node and 4 to the other. In case one fails, all 8 would connect to the same.

I wrongly assumed that a random connection is picked out of the two but I've checked the code and see that's not true 🙈.
Thanks again for this!

…h#4471) Implements paritytech#4429 Collators only need to maintain the implicit view for the paraid they are collating on. In this case, bypass prospective-parachains entirely. It's still useful to use the GetMinimumRelayParents message from prospective-parachains for validators, because the data is already present there. This enables us to entirely remove the subsystem from collators, which consumed resources needlessly Aims to resolve paritytech#4167 TODO: - [x] fix unit tests

Mihajlo-Pavlovic · 2024-08-13T10:11:36Z

Hi @Dinonard, we encountered the same issue updating binaries to v1.9.0 Did you manage to find a fix for it? If you found an alternative solution or workaround, I’d appreciate hearing how you resolved it. Thanks in advance!

SBalaguer added I2-bug The node fails to follow expected behavior. I10-unconfirmed Issue might be valid, but it's not yet known. labels Apr 17, 2024

skunert self-assigned this Apr 17, 2024

skunert added this to SDK Node Apr 19, 2024

github-project-automation bot moved this to backlog in SDK Node Apr 19, 2024

skunert moved this from backlog to in progress in SDK Node Apr 19, 2024

This was referenced May 10, 2024

Remove prospective-parachains subsystem from collator nodes #4429

Closed

Remove the prospective-parachains subsystem from collators #4471

Merged

alindima closed this as completed in #4471 May 21, 2024

github-project-automation bot moved this from in progress to done in SDK Node May 21, 2024

skunert mentioned this issue Jun 3, 2024

Cant sync a kusama-people node from scratch with 1.12, working with 1.11, regression? #4614

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Local Collator Panic - Connecting to Rococo with Validator RPC #4167

Local Collator Panic - Connecting to Rococo with Validator RPC #4167

SBalaguer commented Apr 17, 2024

skunert commented Apr 18, 2024

skunert commented Apr 22, 2024

stakeworld commented Apr 24, 2024 •

edited

Loading

Curu24 commented Apr 24, 2024

skunert commented Apr 25, 2024

skunert commented Apr 25, 2024

Curu24 commented Apr 25, 2024

skunert commented May 2, 2024

skunert commented May 2, 2024

alindima commented May 8, 2024

skunert commented May 8, 2024 •

edited

Loading

Dinonard commented Jul 24, 2024

skunert commented Jul 25, 2024

Dinonard commented Jul 25, 2024

skunert commented Jul 29, 2024

Dinonard commented Jul 29, 2024

Mihajlo-Pavlovic commented Aug 13, 2024

Local Collator Panic - Connecting to Rococo with Validator RPC #4167

Local Collator Panic - Connecting to Rococo with Validator RPC #4167

Comments

SBalaguer commented Apr 17, 2024

Is there an existing issue?

Experiencing problems? Have you tried our Stack Exchange first?

Description of bug

Steps to reproduce

skunert commented Apr 18, 2024

skunert commented Apr 22, 2024

stakeworld commented Apr 24, 2024 • edited Loading

Curu24 commented Apr 24, 2024

skunert commented Apr 25, 2024

skunert commented Apr 25, 2024

Curu24 commented Apr 25, 2024

skunert commented May 2, 2024

skunert commented May 2, 2024

alindima commented May 8, 2024

skunert commented May 8, 2024 • edited Loading

Dinonard commented Jul 24, 2024

skunert commented Jul 25, 2024

Dinonard commented Jul 25, 2024

skunert commented Jul 29, 2024

Dinonard commented Jul 29, 2024

Mihajlo-Pavlovic commented Aug 13, 2024

stakeworld commented Apr 24, 2024 •

edited

Loading

skunert commented May 8, 2024 •

edited

Loading