-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Kusama nodes stop sync, p2p instability. #6696
Comments
This may be related to paritytech/polkadot-sdk#528 where the syncing basically grinds to a halt (although in that case it's warp sync). It's currently under investigation |
@altonen good to see is already identified. I am happy to test a fix, when available, and check if behavior reverts to normal. |
Also seeing the same issue (for Kusama RPC nodes) after upgrading to Running the node with |
Oh actually, perhaps it's because of this (excerpts from cli output):
From the p2p debug logs, it seems like my RPC node refuses to connect with
The fix paritytech/substrate#13152 is merged to |
@rvalle can you give logs of both yellow (stable peer count) and green (unstable peer count) with |
Updated to @altonen I am now capturing those logs.... RPC Node FT6 - Old Release, Stable P2P with Few Connections
RPC Node FT8 - Recent Release, Unstable P2P requires many connections
And here is a sample, 10K logs for each node: |
I went through the logs but couldn't find anything that would explain why the peers disconnect, we definitely need to improve our logging for cases like this. There was something weird in the logs where a substream would be opened and then closed almost immediately without any apparent reason as to why that happened. I need to see if I can reproduce that behavior locally. I did try syncing with |
@altonen if it is something related to my environment, then I would say it is firewall related. is it possible that a new port has been introduced that needs to be allowed out? I am just going to try your same test and see what I get. |
Mine are archive nodes, but that probably does not matter. |
Maybe I should try to dissect this problem, and find out exactly in which version is appears. |
if you can do that, it would be very helpful. I cannot get the
|
I see a similar problem with 0.9.40. I originally reported the issue (with logs) in paritytech/substrate#6771 Node has peers but does not receive new blocks. This is with 4 vcpus and 16Gi RAM, but I have 2 validators running per VM.
|
I'm having the same problem since few version ago (maybe 0.9.36) and upgrading to 0.9.40 worsen the issue. Generally it appears when the new version released and as more nodes install the new version it goes away. Nodes which are not in Europe get stuck more often. Increasing the number of peers seems to improve the situtaion. |
@drskalman how is your setup? are you running a cluster? |
@altonen : I have tracked down the problem and it was due to cluster Networking / NAT configuration. Both reported nodes where resolving the same external address: public_ip/tcp/30333/NODE_ID I reviewed the setup and ensured that:
As a result p2p works fine in all the involved node versions, with low connection number, and the 6 peers are achieved almost instantly. The only thing I just cant get my head around is how is it possible that we managed to sync 2 archive nodes with the previous NAT setup (!!) I guess this is the reason why I did not look into networking in depth before. Sorry for that. |
@rvalle it's nice to hear that the issue is now resolved. I will be looking more into this NAT issue soon but there are other problems right now that have higher priority. Something has changed either in Substrate or in libp2p that has caused the problem but after a brief look into the commit history of |
@rvalle could you share with us your previous configuration as detailed as possible? We would like to reproduce the issue to debug it. Anything you have for us, terraform script or whatever from the previous "broken" configuration would be a help for us! Ty! |
@bkchr you're not asking me, but I am seeing the same issue with polkadot-k8s when running several polkadot nodes in the same vm, while using a network-load balancer provisioned by the cloud provider and explicitly passing the public IP address to the chart (which configures it in the node with |
@nicolasochem could you give some more detailed instructions on how to do this? |
actually @bkchr, I am still having issues... In the previous configuration I had 3 archive nodes (3xDOT 1xKSM) behind NAT, I had only explicitly redirected the first node port 30333 to the node. We use docker for deployment of nodes and ansible to automate the configuration and deployment. With this configuration, all nodes managed to sync the full archive, which seems surprising, I can remember thou, that the first one was way faster then the other two on first sync. I did the changes mentioned above, which involved using 30333-30335 ports (one different per node) and also setting independent port forward rules. I take care of looking at Public External Address detected and look that there is no collisions (no node is getting the same external address as another). During the Weekend one of the Polkadot nodes got bricked with:
Which is surprising because the other Polkadot Node with exact same configuration did not have the problem, I used its DB to rsync the other and un-brick it. Now I am looking into why our nodes keep falling behind in syncing, our alert robot is very busy, this is for yesterday: Now I am ensuring that all nodes are deployed with specific Keys/PeerIDs from Ansible, and looking into more potential network issues. It is difficult to diagnose anything when so many things are at play. However, during January and Feb our nodes run without a single issue despite the poor NAT configuration. |
@rvalle are you able to get the nodes to sync if you play with |
@altonen the easiest way to get a node to catch up is a simple restart. In seconds can sync up thousand of blocks. I don't know what is going on.... perhaps a lot of peers out they are not in sync? perhaps the node's peer2peer lib does not renew peers when they are out of sync. Can you reproduce this situation? I think peer settings help, but the issue still persists. Besides the fact that during the first months of the year we had very limited peer settings and sync worked like a clock. So I am testing both types of settings on different nodes. On the errors avobe Kusama node was running with 20 in 20 out 10 download, Polkadot with 6 in 2 out and 2 downloads. |
Do you have access to an environment where you can run the node without any extra NAT configurations and without having to run the node inside a Docker container, i.e., just vanilla One thing that could be contributing to syncing falling behind is a problem we're facing with I'm trying to see if I can reproduce it but so far no luck |
@altonen is this what you need? Note that I have, obviously, been restarting nodes all the time.... in particular yesterday afternoon. |
Actually now that I'm looking at the numbers again, they're not completely horrible because they're reported in different units. One is reported in Megabits per second and the second is reported in kibibytes per second. There's still a large enough difference to warrant further research but not as bad as I initially thought. |
@altonen I would compare bandwith usage versus previous versions... and in particular when sync status has been achieved. I mean, just in case we are talking about some kind of regression. I cannot remember any kind of bandwidth issues in the last year, not even when fully syncing the archive nodes, which are taking TBs. Also the initial download we performed on full p2p settings, as it took long time. On the other hand, I notice high bandwidth usage while the chain is also stalled. Say, 3Mb/s and stalled at the same time. The server can be stalled there for hour(s). If you then restart the node, it will sync literally in seconds, before bandwidth usage has fully ramped up. Which is very strange. Yes I will add the p2p settings and collect some logs for you. Did you test running your node with bandwidth limits? you could limit to 2x the reported figure, or 1.5x. I think it would be ideal if you can reproduce the condition, so you can study in detail. With regards to bandwidth usage, I guess the key point its to keep a healthy balance between the traffic received and the write to disk bandwidth. For example, how much bandwidth do you consume to write 1mb on chain/node storage? Perhaps it affects the node only after it has been it is fully sync, and perhaps it affects only archive nodes. |
I did compare between versions, went all the way back to v0.9.33 but didn't see anything alarming. I suspected there might have been something fishy going on between v0.9.36 and v0.9.37 since in the latter release we updated libp2p but there was nothing weird in the bandwidth usage between those versions. Furthermore, if syncing works with Polkadot but not with Kusama, and since they're running the same network implementations, I'm leaning towards there being some network characteristic in Kusama that negatively contributes syncing if Polkadot syncs fine on the same machine but Kusama doesn't. I will experiment with artificially slowing down the speeds but I think the answer in case of lower network speeds is to reduce the network consumption of the node by reducing peer counts and in the (near) future, the number of blocks per request.
This is an interesting observation. If the node was sync in one point and then fell behind, it could be that there is a bug in the keep-up sync. But it's very interesting because if syncing has stalled, what is the 3 Mbps worth of bandwidth being consumed on. I'll try to get this to reproduce. |
It's not only Kusama. I am pretty sure I saw the same behavior on polkadot (lots of peers but does not receive new blocks) |
I may have found something. I still don't experience syncing stalling completely but I do sometimes experience syncing getting 0.0 bps speed for a brief period of time even though there are incoming blocks. I decided to debug that with hopes that it would reveal something and it definitely did, here are some debug prints when receiving block responses from peers:
At some point the node gets into this loop where it adds blocks to the queue but the queue length stays at 33. It happens because the code calls The peer may be getting incorrectly scheduled for another block request which then erases the previous, now-completed state or there is some operation omitted that previously added the downloaded state into the queue. |
@nicolasochem you are right. I restarted it and re-synced in 4 minutes. @altonen your finding sounds very promising, but note I don't get 0 bps reported during stalls. however, I am using latest docker, I will get you those logs today.... |
But we always only send one request per peer?
I always explained this to myself that there is probably some chunk/range we are waiting to download that is coming from a slow peer and thus, all the other blocks are getting scheduled. |
I think there is a bug, either multiple requests or some other state issue.
Yeah exactly. The issue not with the cache itself I believe but something in the block request code. I don't understand. Are you saying that when syncing stalls it is still able to import stuff? It's showing |
Maybe we should pass the |
Look at the logs screenshot. When the node is stalled the block counters do not increase (sometimes for many hours) yet the bandwidth counters always show a normal rate. like this: https://user-images.githubusercontent.com/412837/228508301-32c27fd5-a744-42f0-8549-5606e8d64e2d.png |
I think in general we need more logging in the entire crate. I remembered I fixed one race condition in ancestry search few months ago that had been there for who knows how long and it was only reproducible in a virtual terminal so it could be that converting syncing into an async runner that surfaces this issue. Or something in the recent block request refactoring. I'm looking into it. @rvalle can you check in Grafana if you're getting notifications from |
@altonen here are the logs. |
Thanks. I'm looking at the logs and the node is getting block announcements, is requesting the block bodies, receiving them and then importing the received blocks so it seems to be working. At least according to these logs the best and finalized blocks are increasing. Is there something wrong with the timestamps? The time keeps jumping backwards and forwards which then looks like the best block is first finalized at X and then further in the logs it's finalized at X - N which doesn't make sense. |
@altonen let me look into it. |
@altonen I can see that I left the TRACES for too long and the Log server was overflowed by 4GB of logs... it is likely that there was some mess. I will repeat the trace but have it for a few minutes only. Lets see if I get your a cleaner cut.
No, I have not setup any alerting. I mean, we use a higher level system. I will have a look at adding it. |
But even if there were some issues with logs, they still showed that best and finalized blocks were increasing. Are you able to give logs where that doesn't happen, e.g., from node "ft6" which was idling in the this image: #6696 (comment)? |
@altonen that is a good point I will try to capture a segment in both circunstances while and while not experiencing the issue. |
@altonen is it possible to change this log flags on a running node? remember that when we restart the node it will sync up quickly.... |
Hmm I see the issue. I don't think it's possible sadly. You could configure systemd-logrotate if you're running out of space, also just logging @bkchr changing logging during runtime is not possible, right? |
yes I look it up... we have graylog, already integrated and automated.... I will see if I can workaround somehow this... |
Running with |
thanks @bkchr I managed to capture the logs while stuck, I had to allow unsecure RPC as it is a remote node. cc @altonen All-Messages-search-result-dot-stuck.zip I will now try to get another trace file without syncing issues. |
I upgraded to |
After monitoring for over 16h, it all seems very good: Sync is now smooth, even with very small p2p settings: in 2 out 2 download 1, which was our original setting for archive application, updated yesterday around 15.00: The bandwidth is now about 1/3 compared to previous version, and much more in line with the reported figures in the logs. |
@rvalle very nice to hear that it works |
Hi!
We run Archive nodes for Kusama and Polkadot which we use to generate Polkawatch.app data. Our use pattern would be similar to that of an "indexer".
Normally we run our nodes with parameters that minimize resource usage: peers in 4, peers our 2, max downloads 1. This setup has worked well for us for over 1 year.
Last week running on (doker) version v0.9.33 our Kusama node stopped syncing at all. After updating to v0.9.37 the issue persisted.
After discussing with Paranodes we got it to sync again by raising p2p parameters to in 20, out 20, max downloads 10.
Note that this represents a x5 fold increase of resources for the node, yet the blockchain seems to sync much slower, often falling a few blocks behind what we observe on Polkadot.js.
Our Polkadot node is still running on (docker) v0.9.25 and syncs like a Swiss clock.
The most tangible metric that we can share to show the change of behavior is the following chart:
In yellow a Polkadot archive node, peers (2+4) are stable at 6-7 all the time. Note that for the entire week peers never dropped below 6.
In green a Kusama node, last week with same settings as Polkadot stopped syncing, after updating the settings it is syncing yet the numbers of peers connected is very volatile and sync is slower than Polkadot despite the x5 increase of resources.
If I had to guess I would say something is up at p2p level, like connections are been terminated, and the node is constantly replacing peers or something similar.
The text was updated successfully, but these errors were encountered: