Skip to content
This repository has been archived by the owner on Nov 15, 2023. It is now read-only.

Kusama nodes stop sync, p2p instability. #6696

Closed
rvalle opened this issue Feb 10, 2023 · 75 comments
Closed

Kusama nodes stop sync, p2p instability. #6696

rvalle opened this issue Feb 10, 2023 · 75 comments

Comments

@rvalle
Copy link

rvalle commented Feb 10, 2023

Hi!

We run Archive nodes for Kusama and Polkadot which we use to generate Polkawatch.app data. Our use pattern would be similar to that of an "indexer".

Normally we run our nodes with parameters that minimize resource usage: peers in 4, peers our 2, max downloads 1. This setup has worked well for us for over 1 year.

Last week running on (doker) version v0.9.33 our Kusama node stopped syncing at all. After updating to v0.9.37 the issue persisted.

After discussing with Paranodes we got it to sync again by raising p2p parameters to in 20, out 20, max downloads 10.

Note that this represents a x5 fold increase of resources for the node, yet the blockchain seems to sync much slower, often falling a few blocks behind what we observe on Polkadot.js.

Our Polkadot node is still running on (docker) v0.9.25 and syncs like a Swiss clock.

The most tangible metric that we can share to show the change of behavior is the following chart:

Polkadot p2p issue

In yellow a Polkadot archive node, peers (2+4) are stable at 6-7 all the time. Note that for the entire week peers never dropped below 6.
In green a Kusama node, last week with same settings as Polkadot stopped syncing, after updating the settings it is syncing yet the numbers of peers connected is very volatile and sync is slower than Polkadot despite the x5 increase of resources.

If I had to guess I would say something is up at p2p level, like connections are been terminated, and the node is constantly replacing peers or something similar.

@altonen
Copy link
Contributor

altonen commented Feb 10, 2023

This may be related to paritytech/polkadot-sdk#528 where the syncing basically grinds to a halt (although in that case it's warp sync). It's currently under investigation

@rvalle
Copy link
Author

rvalle commented Feb 10, 2023

@altonen good to see is already identified. I am happy to test a fix, when available, and check if behavior reverts to normal.

@karen-cb
Copy link

Also seeing the same issue (for Kusama RPC nodes) after upgrading to v0.9.37. The node would start out with >10 peers, and then lose peers over the next 30 minutes and eventually end in 0-2 peers.

Running the node with --validator flag results in ~40 peers, interested in how peering in validator mode differs from RPC nodes.

@karen-cb
Copy link

Oh actually, perhaps it's because of this (excerpts from cli output):

      --listen-addr <LISTEN_ADDR>...
          Listen on this multiaddress.

          By default: If `--validator` is passed: `/ip4/0.0.0.0/tcp/<port>` and `/ip6/[::]/tcp/<port>`. Otherwise: `/ip4/0.0.0.0/tcp/<port>/ws` and `/ip6/[::]/tcp/<port>/ws`.

From the p2p debug logs, it seems like my RPC node refuses to connect with wss addresses:

2023-02-14 22:19:23.026 TRACE tokio-runtime-worker sub-libp2p: Libp2p => Reach failure for Some(PeerId("12D3KooWBjxpFhVNM9poSsMEfdnXJaSWSZQ7otK9aV1SPA9zJp5W")) through "/dns/kusama-connect-0.parity.io/tcp/443/wss/p2p/12D3KooWBjxpFhVNM9poSsMEfdnXJaSWSZQ7otK9aV1SPA9zJp5W": Other(Custom { kind: Other, error: Other(A(A(A(A(MultiaddrNotSupported("/ip4/34.89.208.238/tcp/443/wss/p2p/12D3KooWBjxpFhVNM9poSsMEfdnXJaSWSZQ7otK9aV1SPA9zJp5W")))))) })    
2023-02-14 22:19:23.026 TRACE tokio-runtime-worker sub-libp2p: Libp2p => Dial failure for PeerId("12D3KooWBjxpFhVNM9poSsMEfdnXJaSWSZQ7otK9aV1SPA9zJp5W")    
2023-02-14 22:19:23.026 TRACE tokio-runtime-worker sub-libp2p: PSM <= Dropped(12D3KooWBjxpFhVNM9poSsMEfdnXJaSWSZQ7otK9aV1SPA9zJp5W, SetId(0))    
2023-02-14 22:19:23.026 TRACE tokio-runtime-worker sub-libp2p: Libp2p => Failed to reach PeerId("12D3KooWBjxpFhVNM9poSsMEfdnXJaSWSZQ7otK9aV1SPA9zJp5W"): Failed to negotiate transport protocol(s): [(/dns/kusama-connect-0.parity.io/tcp/443/wss/p2p/12D3KooWBjxpFhVNM9poSsMEfdnXJaSWSZQ7otK9aV1SPA9zJp5W: : Unsupported resolved address: /ip4/34.89.208.238/tcp/443/wss/p2p/12D3KooWBjxpFhVNM9poSsMEfdnXJaSWSZQ7otK9aV1SPA9zJp5W: Unsupported resolved address: /ip4/34.89.208.238/tcp/443/wss/p2p/12D3KooWBjxpFhVNM9poSsMEfdnXJaSWSZQ7otK9aV1SPA9zJp5W)]    
2023-02-14 22:19:23.026 TRACE tokio-runtime-worker sub-libp2p: PSM => Connect(12D3KooWMwX9J1zHdQXYhr7w7taZmfpTpcBrNDfkBjcJcnPPL5Am, SetId(0)): Will start to connect at until Instant { tv_sec: 10155590, tv_nsec: 584340114 }

The fix paritytech/substrate#13152 is merged to master, but is not included in polkadot-v0.9.37.

@altonen
Copy link
Contributor

altonen commented Feb 17, 2023

@rvalle can you give logs of both yellow (stable peer count) and green (unstable peer count) with -lsync,sub-libp2p=trace and the full commands you use to launch both nodes?

@rvalle
Copy link
Author

rvalle commented Mar 22, 2023

Updated to 0.9.39 and the problem persists.

@altonen I am now capturing those logs....

RPC Node FT6 - Old Release, Stable P2P with Few Connections

            "Cmd": [
                "--name",
                "privazio-substrate-node",
                "--chain",
                "polkadot",
                "--pruning",
                "archive",
                "--rpc-cors",
                "all",
                "--rpc-external",
                "--ws-external",
                "--prometheus-external",
                "--out-peers",
                "2",
                "--in-peers",
                "4",
                "--max-parallel-downloads",
                "1",
                "-lsync,sub-libp2p=trace"
            ],
            "Image": "repo.vpc.vtk.lan/parity/polkadot:v0.9.25",

RPC Node FT8 - Recent Release, Unstable P2P requires many connections

            "Cmd": [
                "--name",
                "privazio-kusama-archive-node",
                "--chain",
                "kusama",
                "--pruning",
                "archive",
                "--rpc-cors",
                "all",
                "--rpc-external",
                "--ws-external",
                "--prometheus-external",
                "--out-peers",
                "20",
                "--in-peers",
                "20",
                "--max-parallel-downloads",
                "10",
                "-lsync,sub-libp2p=trace"
            ],
            "Image": "repo.vpc.vtk.lan/parity/polkadot:v0.9.39",

And here is a sample, 10K logs for each node:

p2p-issue-logs.zip

@altonen
Copy link
Contributor

altonen commented Mar 23, 2023

I went through the logs but couldn't find anything that would explain why the peers disconnect, we definitely need to improve our logging for cases like this. There was something weird in the logs where a substream would be opened and then closed almost immediately without any apparent reason as to why that happened. I need to see if I can reproduce that behavior locally.

I did try syncing with --in-peers 4 --out-peers 2 --max-parallel-downloads 1 (latest master) and the node seems to be syncing fine and had a stable peer count of 6 for hours. Is there a particular point in syncing when the peers disconnected or how did you observe syncing stopping?

@rvalle
Copy link
Author

rvalle commented Mar 23, 2023

@altonen if it is something related to my environment, then I would say it is firewall related.

is it possible that a new port has been introduced that needs to be allowed out?

I am just going to try your same test and see what I get.

@rvalle
Copy link
Author

rvalle commented Mar 23, 2023

It is very strange, the behavior persists, the old node works like a clock:
Screenshot 2023-03-23 at 17-33-20 Edit panel - Polkadot Node Monitoring - Dashboards - Grafana

in yellow, the old release, green the recent.

@rvalle
Copy link
Author

rvalle commented Mar 23, 2023

Mine are archive nodes, but that probably does not matter.
Network wise we are in an air-gapped environment, so, outbound ports need to be explicitly open, and HTTP is via proxy.
That is probably the only difference that I have with your environment. I wonder if something was introduced in that sense.

@rvalle
Copy link
Author

rvalle commented Mar 23, 2023

Maybe I should try to dissect this problem, and find out exactly in which version is appears.

@altonen
Copy link
Contributor

altonen commented Mar 23, 2023

Maybe I should try to dissect this problem, and find out exactly in which version is appears.

if you can do that, it would be very helpful. I cannot get the v0.9.25 to compile so I'm out of luck in that sense. Anything you can share would be helpful

That is probably the only difference that I have with your environment. I wonder if something was introduced in that sense.
This could possibly explain it, there has been also a few libp2p releases which might have changed things in that regard. But I'm not sure what has changed that makes the peer count so volatile.

@nicolasochem
Copy link

I see a similar problem with 0.9.40. I originally reported the issue (with logs) in paritytech/substrate#6771

Node has peers but does not receive new blocks. This is with 4 vcpus and 16Gi RAM, but I have 2 validators running per VM.

│ 2023-03-23 21:17:41 Accepting new connection 1/100                                                                                                                                                               │
│ 2023-03-23 21:17:41 Accepting new connection 1/100                                                                                                                                                               │
│ 2023-03-23 21:17:41 Accepting new connection 1/100                                                                                                                                                               │
│ 2023-03-23 21:17:41 Accepting new connection 1/100                                                                                                                                                               │
│ 2023-03-23 21:17:41 Accepting new connection 1/100                                                                                                                                                               │
│ 2023-03-23 21:17:41 Accepting new connection 1/100                                                                                                                                                               │
│ 2023-03-23 21:17:41 Accepting new connection 1/100                                                                                                                                                               │
│ 2023-03-23 21:17:41 Accepting new connection 1/100                                                                                                                                                               │
│ 2023-03-23 21:17:41 Accepting new connection 1/100                                                                                                                                                               │
│ 2023-03-23 21:17:41 Accepting new connection 1/100                                                                                                                                                               │
│ 2023-03-23 21:17:41 Accepting new connection 1/100                                                                                                                                                               │
│ 2023-03-23 21:17:41 Accepting new connection 1/100                                                                                                                                                               │
│ 2023-03-23 21:17:42 💤 Idle (26 peers), best: #14775381 (0x803f…c677), finalized #14775308 (0xb05b…134b), ⬇ 524.8kiB/s ⬆ 512.6kiB/s                                                                              │
│ 2023-03-23 21:17:47 💤 Idle (22 peers), best: #14775381 (0x803f…c677), finalized #14775308 (0xb05b…134b), ⬇ 451.2kiB/s ⬆ 448.1kiB/s                                                                              │
│ 2023-03-23 21:17:47 ❌ Error while dialing /dns/telemetry-backend.w3f.community/tcp/443/x-parity-wss/%2Fsubmit: Custom { kind: Other, error: Timeout }                                                           │
│ 2023-03-23 21:17:52 💤 Idle (22 peers), best: #14775381 (0x803f…c677), finalized #14775308 (0xb05b…134b), ⬇ 472.4kiB/s ⬆ 459.4kiB/s                                                                              │
│ 2023-03-23 21:17:57 💤 Idle (22 peers), best: #14775381 (0x803f…c677), finalized #14775308 (0xb05b…134b), ⬇ 448.7kiB/s ⬆ 445.0kiB/s                                                                              │
│ 2023-03-23 21:18:02 💤 Idle (22 peers), best: #14775381 (0x803f…c677), finalized #14775308 (0xb05b…134b), ⬇ 396.7kiB/s ⬆ 392.3kiB/s                                                                              │
│ 2023-03-23 21:18:07 💤 Idle (21 peers), best: #14775381 (0x803f…c677), finalized #14775308 (0xb05b…134b), ⬇ 409.0kiB/s ⬆ 409.6kiB/s                                                                              │
│ 2023-03-23 21:18:11 Accepting new connection 1/100                                                                                                                                                               │
│ 2023-03-23 21:18:11 Accepting new connection 1/100                                                                                                                                                               │
│ 2023-03-23 21:18:11 Accepting new connection 1/100                                                                                                                                                               │
│ 2023-03-23 21:18:11 Accepting new connection 1/100                                                                                                                                                               │
│ 2023-03-23 21:18:11 Accepting new connection 1/100                                                                                                                                                               │
│ 2023-03-23 21:18:11 Accepting new connection 1/100                                                                                                                                                               │
│ 2023-03-23 21:18:11 Accepting new connection 1/100                                                                                                                                                               │
│ 2023-03-23 21:18:11 Accepting new connection 1/100                                                                                                                                                               │
│ 2023-03-23 21:18:12 💤 Idle (22 peers), best: #14775381 (0x803f…c677), finalized #14775308 (0xb05b…134b), ⬇ 436.6kiB/s ⬆ 438.9kiB/s                                                                              │
│ 2023-03-23 21:18:17 💤 Idle (20 peers), best: #14775381 (0x803f…c677), finalized #14775308 (0xb05b…134b), ⬇ 404.1kiB/s ⬆ 407.2kiB/s                                                                              │
│ 2023-03-23 21:18:22 💤 Idle (20 peers), best: #14775381 (0x803f…c677), finalized #14775308 (0xb05b…134b), ⬇ 433.7kiB/s ⬆ 433.9kiB/s                                                                              │
│ 2023-03-23 21:18:27 💤 Idle (20 peers), best: #14775381 (0x803f…c677), finalized #14775308 (0xb05b…134b), ⬇ 456.0kiB/s ⬆ 456.7kiB/s                                                                              │
│ 2023-03-23 21:18:27 ❌ Error while dialing /dns/telemetry-backend.w3f.community/tcp/443/x-parity-wss/%2Fsubmit: Custom { kind: Other, error: Timeout }                                                           │
│ 2023-03-23 21:18:32 💤 Idle (20 peers), best: #14775381 (0x803f…c677), finalized #14775308 (0xb05b…134b), ⬇ 450.9kiB/s ⬆ 445.0kiB/s                                                                              │
│ 2023-03-23 21:18:37 💤 Idle (21 peers), best: #14775381 (0x803f…c677), finalized #14775308 (0xb05b…134b), ⬇ 465.9kiB/s ⬆ 455.9kiB/s                                                                              │
│ 2023-03-23 21:18:41 Accepting new connection 1/100                                                                                                                                                               │
│ 2023-03-23 21:18:41 Accepting new connection 1/100                                                                                                                                                               │
│ 2023-03-23 21:18:41 Accepting new connection 1/100                                                                                                                                                               │
│ 2023-03-23 21:18:41 Accepting new connection 1/100                                                                                                                                                               │
│ 2023-03-23 21:18:41 Accepting new connection 1/100                                                                                                                                                               │
│ 2023-03-23 21:18:41 Accepting new connection 1/100                                                                                                                               

@drskalman
Copy link
Contributor

I'm having the same problem since few version ago (maybe 0.9.36) and upgrading to 0.9.40 worsen the issue. Generally it appears when the new version released and as more nodes install the new version it goes away. Nodes which are not in Europe get stuck more often. Increasing the number of peers seems to improve the situtaion.

@rvalle
Copy link
Author

rvalle commented Mar 24, 2023

@drskalman how is your setup? are you running a cluster?

@rvalle
Copy link
Author

rvalle commented Mar 24, 2023

@altonen : I have tracked down the problem and it was due to cluster Networking / NAT configuration.

Both reported nodes where resolving the same external address: public_ip/tcp/30333/NODE_ID

I reviewed the setup and ensured that:

  • each node is running on a different port
  • each node resolves the external address differently (public_ip/tcp/node_p2p_port)
  • egress port range covers typical port usage as seen in traces: 30000-40000 tcp
  • an explicit port forward rule exists for each node to its p2p port

As a result p2p works fine in all the involved node versions, with low connection number, and the 6 peers are achieved almost instantly.

The only thing I just cant get my head around is how is it possible that we managed to sync 2 archive nodes with the previous NAT setup (!!)

I guess this is the reason why I did not look into networking in depth before. Sorry for that.

@rvalle rvalle closed this as completed Mar 24, 2023
@altonen
Copy link
Contributor

altonen commented Mar 24, 2023

@rvalle it's nice to hear that the issue is now resolved. I will be looking more into this NAT issue soon but there are other problems right now that have higher priority. Something has changed either in Substrate or in libp2p that has caused the problem but after a brief look into the commit history of sc-network, I can't see any obvious culprit.

@bkchr
Copy link
Member

bkchr commented Mar 27, 2023

@rvalle could you share with us your previous configuration as detailed as possible? We would like to reproduce the issue to debug it. Anything you have for us, terraform script or whatever from the previous "broken" configuration would be a help for us! Ty!

@nicolasochem
Copy link

@bkchr you're not asking me, but I am seeing the same issue with polkadot-k8s when running several polkadot nodes in the same vm, while using a network-load balancer provisioned by the cloud provider and explicitly passing the public IP address to the chart (which configures it in the node with --public-addr). Try replicating with this.

@bkchr
Copy link
Member

bkchr commented Mar 27, 2023

@nicolasochem could you give some more detailed instructions on how to do this?

@rvalle
Copy link
Author

rvalle commented Mar 28, 2023

actually @bkchr, I am still having issues...

In the previous configuration I had 3 archive nodes (3xDOT 1xKSM) behind NAT, I had only explicitly redirected the first node port 30333 to the node.

We use docker for deployment of nodes and ansible to automate the configuration and deployment.

With this configuration, all nodes managed to sync the full archive, which seems surprising, I can remember thou, that the first one was way faster then the other two on first sync.

I did the changes mentioned above, which involved using 30333-30335 ports (one different per node) and also setting independent port forward rules. I take care of looking at Public External Address detected and look that there is no collisions (no node is getting the same external address as another).

During the Weekend one of the Polkadot nodes got bricked with:

2023-03-27 11:02:44 💔 Verification failed for block 0x15dc5ac1e227dedb134812fab0c14f4b8ba65891546fb9fd9d74ab3d8af030ce received from peer: 12D3KooWHrpHpjsVKRaRGTcPPJQPgWgoJADMnVmkqVPe3UuNsEjb, "Runtime code error: ':code' hash not found"

Which is surprising because the other Polkadot Node with exact same configuration did not have the problem, I used its DB to rsync the other and un-brick it.

Now I am looking into why our nodes keep falling behind in syncing, our alert robot is very busy, this is for yesterday:

Screenshot from 2023-03-28 09-54-52

Now I am ensuring that all nodes are deployed with specific Keys/PeerIDs from Ansible, and looking into more potential network issues.

It is difficult to diagnose anything when so many things are at play. However, during January and Feb our nodes run without a single issue despite the poor NAT configuration.

@altonen
Copy link
Contributor

altonen commented Mar 28, 2023

@rvalle are you able to get the nodes to sync if you play with --in-peers/--out-peers by decreasing the former and increasing the latter? Or are you having syncing issues with this --in-peers 4 --out-peers 2 config you mentioned earlier?

@rvalle
Copy link
Author

rvalle commented Mar 28, 2023

@altonen the easiest way to get a node to catch up is a simple restart. In seconds can sync up thousand of blocks.

I don't know what is going on.... perhaps a lot of peers out they are not in sync? perhaps the node's peer2peer lib does not renew peers when they are out of sync.

Can you reproduce this situation?

I think peer settings help, but the issue still persists. Besides the fact that during the first months of the year we had very limited peer settings and sync worked like a clock. So I am testing both types of settings on different nodes.

On the errors avobe Kusama node was running with 20 in 20 out 10 download, Polkadot with 6 in 2 out and 2 downloads.

@rvalle rvalle reopened this Mar 28, 2023
@altonen
Copy link
Contributor

altonen commented Mar 28, 2023

Do you have access to an environment where you can run the node without any extra NAT configurations and without having to run the node inside a Docker container, i.e., just vanilla target/release/polkadot ...?

One thing that could be contributing to syncing falling behind is a problem we're facing with NetworkWorker getting overloaded. Can you check what substrate_unbounded_channel_len says for mpsc_network_worker in Grafana?

I'm trying to see if I can reproduce it but so far no luck

@rvalle
Copy link
Author

rvalle commented Mar 28, 2023

@altonen is this what you need?

Screenshot 2023-03-28 at 11-17-11 View panel - Polkadot Node Monitoring - Dashboards - Grafana

Note that I have, obviously, been restarting nodes all the time.... in particular yesterday afternoon.

@rvalle
Copy link
Author

rvalle commented Mar 28, 2023

And this is how we can represent our current problem, if we look at the evolution of finalized block over time:

Screenshot 2023-03-28 at 11-35-14 Polkadot Node Monitoring - Dashboards - Grafana

Synchronization is not fluid

@altonen
Copy link
Contributor

altonen commented Apr 3, 2023

Actually now that I'm looking at the numbers again, they're not completely horrible because they're reported in different units. One is reported in Megabits per second and the second is reported in kibibytes per second. There's still a large enough difference to warrant further research but not as bad as I initially thought.

@rvalle
Copy link
Author

rvalle commented Apr 3, 2023

@altonen I would compare bandwith usage versus previous versions... and in particular when sync status has been achieved. I mean, just in case we are talking about some kind of regression.

I cannot remember any kind of bandwidth issues in the last year, not even when fully syncing the archive nodes, which are taking TBs. Also the initial download we performed on full p2p settings, as it took long time.

On the other hand, I notice high bandwidth usage while the chain is also stalled. Say, 3Mb/s and stalled at the same time. The server can be stalled there for hour(s). If you then restart the node, it will sync literally in seconds, before bandwidth usage has fully ramped up. Which is very strange.

Yes I will add the p2p settings and collect some logs for you.

Did you test running your node with bandwidth limits? you could limit to 2x the reported figure, or 1.5x. I think it would be ideal if you can reproduce the condition, so you can study in detail.

With regards to bandwidth usage, I guess the key point its to keep a healthy balance between the traffic received and the write to disk bandwidth. For example, how much bandwidth do you consume to write 1mb on chain/node storage?

Perhaps it affects the node only after it has been it is fully sync, and perhaps it affects only archive nodes.

@altonen
Copy link
Contributor

altonen commented Apr 3, 2023

I did compare between versions, went all the way back to v0.9.33 but didn't see anything alarming. I suspected there might have been something fishy going on between v0.9.36 and v0.9.37 since in the latter release we updated libp2p but there was nothing weird in the bandwidth usage between those versions. Furthermore, if syncing works with Polkadot but not with Kusama, and since they're running the same network implementations, I'm leaning towards there being some network characteristic in Kusama that negatively contributes syncing if Polkadot syncs fine on the same machine but Kusama doesn't.

I will experiment with artificially slowing down the speeds but I think the answer in case of lower network speeds is to reduce the network consumption of the node by reducing peer counts and in the (near) future, the number of blocks per request.

On the other hand, I notice high bandwidth usage while the chain is also stalled. Say, 3Mb/s and stalled at the same time. The server can be stalled there for hour(s). If you then restart the node, it will sync literally in seconds, before bandwidth usage has fully ramped up. Which is very strange.

This is an interesting observation. If the node was sync in one point and then fell behind, it could be that there is a bug in the keep-up sync. But it's very interesting because if syncing has stalled, what is the 3 Mbps worth of bandwidth being consumed on. I'll try to get this to reproduce.

@nicolasochem
Copy link

It's not only Kusama. I am pretty sure I saw the same behavior on polkadot (lots of peers but does not receive new blocks)

@altonen
Copy link
Contributor

altonen commented Apr 3, 2023

I may have found something. I still don't experience syncing stalling completely but I do sometimes experience syncing getting 0.0 bps speed for a brief period of time even though there are incoming blocks. I decided to debug that with hopes that it would reveal something and it definitely did, here are some debug prints when receiving block responses from peers:

2023-04-04 00:06:16 BlockResponse 0 from 12D3KooWKvdDyRKqUfSAaUCbYiLwKY8uK3wDWpCuy2FiDLbkPTDJ with 64 blocks  (10142104..10142041)
2023-04-04 00:06:16 downloading new
2023-04-04 00:06:16 remove previous range
2023-04-04 00:06:16 insert blocks into block queue
2023-04-04 00:06:16 insert 64 blocks, starting at 10142041, 32
2023-04-04 00:06:16 64 blocks inserted, returned None, total len 33
2023-04-04 00:06:16 total amount of queued entries 33, max 2112, min 33
2023-04-04 00:06:16 0 blocks ready for import
2023-04-04 00:06:16 import 0 blocks, queue length 0
2023-04-04 00:06:16 got block response
2023-04-04 00:06:16 BlockResponse 0 from 12D3KooWCrqHSt4Tv62kYd95NkkwWde87BzypGwBfaZC4jV1e1eP with 1 blocks  (10140696)
2023-04-04 00:06:16 import 0 blocks, queue length 0
2023-04-04 00:06:16 got block response
2023-04-04 00:06:16 BlockResponse 0 from 12D3KooWMbLAGqEwyNi29J16i6xmgtFgPw4cN74qUcGma2Qd2Go4 with 64 blocks  (10142168..10142105)
2023-04-04 00:06:16 downloading new
2023-04-04 00:06:16 remove previous range
2023-04-04 00:06:16 insert blocks into block queue
2023-04-04 00:06:16 insert 64 blocks, starting at 10142105, 32
2023-04-04 00:06:16 64 blocks inserted, returned None, total len 33
2023-04-04 00:06:16 total amount of queued entries 33, max 2112, min 33
2023-04-04 00:06:16 0 blocks ready for import
2023-04-04 00:06:16 import 0 blocks, queue length 0
2023-04-04 00:06:16 got block response
2023-04-04 00:06:16 BlockResponse 0 from 12D3KooWFN2mhgpkJsDBuNuE5427AcDrsib8EoqGMZmkxWwx3Md4 with 64 blocks  (10142040..10141977)
2023-04-04 00:06:16 downloading new
2023-04-04 00:06:16 remove previous range
2023-04-04 00:06:16 insert blocks into block queue
2023-04-04 00:06:16 insert 64 blocks, starting at 10141977, 32
2023-04-04 00:06:16 64 blocks inserted, returned None, total len 33
2023-04-04 00:06:16 total amount of queued entries 33, max 2112, min 33
2023-04-04 00:06:16 0 blocks ready for import
2023-04-04 00:06:16 import 0 blocks, queue length 0

At some point the node gets into this loop where it adds blocks to the queue but the queue length stays at 33. It happens because the code calls clear_peer_download() here before inserting the newly downloaded block range into the queue and removes the previous state downloaded from the peer. So basically it received bunch of data from the peer, then received some more in another request, removes the previously downloaded data and schedules another block request for the peer. Sometimes it's able to escape this loop and progress. I don't know yet why this happens but it doesn't look right and it could explain why syncing keeps stalling.

The peer may be getting incorrectly scheduled for another block request which then erases the previous, now-completed state or there is some operation omitted that previously added the downloaded state into the queue.

@rvalle
Copy link
Author

rvalle commented Apr 4, 2023

@nicolasochem you are right.
@altonen finally the polkadot node stalled yesterday night.

Screenshot 2023-04-04 at 08-20-06 Polkadot Node Monitoring - Dashboards - Grafana

I restarted it and re-synced in 4 minutes.

@altonen your finding sounds very promising, but note I don't get 0 bps reported during stalls. however, I am using latest docker, 0.9.40, not the main branch.

I will get you those logs today....

@bkchr
Copy link
Member

bkchr commented Apr 4, 2023

So basically it received bunch of data from the peer, then received some more in another request, removes the previously downloaded data and schedules another block request for the peer.

But we always only send one request per peer?

I don't know yet why this happens but it doesn't look right and it could explain why syncing keeps stalling.

I always explained this to myself that there is probably some chunk/range we are waiting to download that is coming from a slow peer and thus, all the other blocks are getting scheduled.

@altonen
Copy link
Contributor

altonen commented Apr 4, 2023

@bkchr

But we always only send one request per peer?

I think there is a bug, either multiple requests or some other state issue.

I always explained this to myself that there is probably some chunk/range we are waiting to download that is coming from a slow peer and thus, all the other blocks are getting scheduled.

Yeah exactly. The issue not with the cache itself I believe but something in the block request code.

@rvalle

I don't understand. Are you saying that when syncing stalls it is still able to import stuff? It's showing Syncing x.x Bps where x.x > 0 and it still stalls. That doesn't make sense unless I'm conflating two different numbers.

@bkchr
Copy link
Member

bkchr commented Apr 4, 2023

I think there is a bug, either multiple requests or some other state issue.

Maybe we should pass the start_block to clear_peer_download to detect these issues?

@rvalle
Copy link
Author

rvalle commented Apr 4, 2023

@altonen

I don't understand. Are you saying that when syncing stalls it is still able to import stuff? It's showing Syncing x.x Bps where x.x > 0 and it still stalls. That doesn't make sense unless I'm conflating two different numbers.

Look at the logs screenshot. When the node is stalled the block counters do not increase (sometimes for many hours) yet the bandwidth counters always show a normal rate.

like this: https://user-images.githubusercontent.com/412837/228508301-32c27fd5-a744-42f0-8549-5606e8d64e2d.png

@altonen
Copy link
Contributor

altonen commented Apr 4, 2023

I think there is a bug, either multiple requests or some other state issue.

Maybe we should pass the start_block to clear_peer_download to detect these issues?

I think in general we need more logging in the entire crate. I remembered I fixed one race condition in ancestry search few months ago that had been there for who knows how long and it was only reproducible in a virtual terminal so it could be that converting syncing into an async runner that surfaces this issue. Or something in the recent block request refactoring. I'm looking into it.

@rvalle can you check in Grafana if you're getting notifications from /block-announces/1 protocol? If not, that would explain why the syncing is idling. If you are getting them but nothing is downloaded, then it's something else. Not sure what yet.

@rvalle
Copy link
Author

rvalle commented Apr 4, 2023

@altonen here are the logs.

All-Messages-search-result-ft8.zip

@altonen
Copy link
Contributor

altonen commented Apr 4, 2023

Thanks. I'm looking at the logs and the node is getting block announcements, is requesting the block bodies, receiving them and then importing the received blocks so it seems to be working. At least according to these logs the best and finalized blocks are increasing.

Is there something wrong with the timestamps? The time keeps jumping backwards and forwards which then looks like the best block is first finalized at X and then further in the logs it's finalized at X - N which doesn't make sense.

@rvalle
Copy link
Author

rvalle commented Apr 4, 2023

@altonen let me look into it.

@rvalle
Copy link
Author

rvalle commented Apr 5, 2023

Is there something wrong with the timestamps?

@altonen I can see that I left the TRACES for too long and the Log server was overflowed by 4GB of logs... it is likely that there was some mess. I will repeat the trace but have it for a few minutes only. Lets see if I get your a cleaner cut.

can you check in Grafana if you're getting notifications from /block-announces/1 protocol?

No, I have not setup any alerting. I mean, we use a higher level system. I will have a look at adding it.

@altonen
Copy link
Contributor

altonen commented Apr 5, 2023

But even if there were some issues with logs, they still showed that best and finalized blocks were increasing. Are you able to give logs where that doesn't happen, e.g., from node "ft6" which was idling in the this image: #6696 (comment)?

@rvalle
Copy link
Author

rvalle commented Apr 5, 2023

@altonen that is a good point I will try to capture a segment in both circunstances while and while not experiencing the issue.

@rvalle
Copy link
Author

rvalle commented Apr 5, 2023

@altonen is it possible to change this log flags on a running node? remember that when we restart the node it will sync up quickly....

@altonen
Copy link
Contributor

altonen commented Apr 5, 2023

Hmm I see the issue. I don't think it's possible sadly. You could configure systemd-logrotate if you're running out of space, also just logging -lsync=trace may be fine.

@bkchr changing logging during runtime is not possible, right?

@rvalle
Copy link
Author

rvalle commented Apr 5, 2023

yes I look it up... we have graylog, already integrated and automated.... I will see if I can workaround somehow this...

@bkchr
Copy link
Member

bkchr commented Apr 5, 2023

@bkchr changing logging during runtime is not possible, right?

Running with --enable-log-reloading and then using the RPC system_addLogFilter to add new log filters should work.

@rvalle
Copy link
Author

rvalle commented Apr 20, 2023

thanks @bkchr I managed to capture the logs while stuck, I had to allow unsecure RPC as it is a remote node.

cc @altonen

All-Messages-search-result-dot-stuck.zip

I will now try to get another trace file without syncing issues.

@altonen
Copy link
Contributor

altonen commented Apr 20, 2023

Thanks @rvalle

From the logs it looks like your node is not getting any block announcements. We were able to reproduce this issue and have a fix for it which will be included in the next release. This will hopefully fix the issue for you as well

@rvalle
Copy link
Author

rvalle commented May 9, 2023

I upgraded to 0.9.42, now testing and collecting metrics. First impression is good.

@rvalle
Copy link
Author

rvalle commented May 10, 2023

After monitoring for over 16h, it all seems very good:

Sync is now smooth, even with very small p2p settings: in 2 out 2 download 1, which was our original setting for archive application, updated yesterday around 15.00:

KSM Finalized during before after update

The bandwidth is now about 1/3 compared to previous version, and much more in line with the reported figures in the logs.

KSM Archive Node Net

@altonen @bkchr I think you can close the issue.

@altonen
Copy link
Contributor

altonen commented May 10, 2023

@rvalle very nice to hear that it works

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
Status: Done
Development

No branches or pull requests

6 participants