-
Notifications
You must be signed in to change notification settings - Fork 699
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Node massively misses paravalidation votes. #3613
Comments
CC @alexggh |
Looking at the logs, it seems block-sync times are off I see plenty of occasions where blocks are imported 1-2s after their creation, that will make you miss statement because your node won't participate. To give you a few examples, block 19797352 gets created at After restart and downgrade to
How did you reach this conclusion? Logs
|
Roping in Node SDK as well: @andresilva . |
Can we get more data on who is affected by this and what machines + operating systems + versions everybody affected is running? Also what is "similar" problems? Like how similar? Would be good if they could share their data as elaborate as possible as well. |
Also how does CPU/network usage compare between those versions for you? |
Thanks for the quick reply, I apologize for not being able to reply right away..... I don't think the restart helped, because when this problem was first discovered (it was on February 2 on a our Kusama node), I reloaded version 1.6.0, but the node kept skipping votes. The problem only disappeared when I reverted to version 1.5.0. Logs and screenshots of statistics are attached. Events on February 2: |
Hi, I am attaching an archive with hardware and operating system data. Please note that by communicating with other 1KV program members it has already become clear that the problem happened on different operating systems and on powerful servers. Unfortunately, I don't have data on the difference in CPU/network usage on these versions, but I have requested more data from other 1KV members. |
The node is for sure running with a synchronized clock? Otherwise, we will need |
Also, do you run any special configuration? Any CLI arguments? |
I see you first sent us an analysis on polkadot and then an analysis on kusama, the reason I ask is that on kusama we do have this item 2) #3314 (comment) which is currently not root caused which seems like the issue you seem to have problem with. On kusama you do see 0 backing rewards which is consistent with the bug we are trying to track down, but on polkadot you still getting 20% of the rewards, so might be different issues. |
Ok, looked at the Not clear to me how you got into that state, but I don't think your authority would be discovered by others with just that external address, @altonen @dmitry-markin thoughts or is it just a red-herring ?
|
It shouldn't have a WebSocket listening address if the node is run as a validator. Might be worth running the node with |
Not sure if this will be helpful on this issue, but it might so I'm posting my logs. |
Ok, looked a bit more closely @anignatev and it seems you node start with different roles:
When it is |
Yeah not detected dynamically. The role is based on passing |
Yeah, ignore my previous comment, seems the log file contains logs from multiple nodes, I assume infsysgroup is the one of interest @anignatev ?
|
@tunkas this seems fresh, have you got a link in app.turbo.flakes to your validator ? |
Yes, this is from 18 hours ago. |
@tunkas Since it seems it happens often to you would you be able to run your node with:
or even trace if is is not too much too ask
And give me the logs for a time when that happens ? |
Here are the logs from this morning, nothing dubious in them imho. But that one was not such a bad one as the one last night. |
Could you explain more about this flag? Thanks! |
This are arguments to the Both will increase considerably the size of your logfiles. |
Ok, understood. So I just add them in my service file alongside other arguments? For Ubuntu it would be '--lparachain=trace'? |
70gb, might not be enough if it runs for a long time, let's use debug not trace. Yes, adding it next to the other arguments should be enough. |
OK, I'll try to time it for my next entrance into the active set, and will report back here with logs if I get the same result again 👍 Thanks again for your assistance! |
Ignore the different startup modes. To ensure uninterrupted operation, we have a main and backup server for each node. When we were experimenting, we transferred the validation to another server, and this one was switched to database synchronization only mode. I apologize for missing this moment, I am attaching the missing log (Feb 2 14:09:10 - Feb 2 18:52:18). CLI arguments KUSAMA: CLI arguments POLKADOT: |
In my case, node missed all votes and authored one block.
|
What does the node say about hardware specs?
In particular, what does it say about minimum requirements being met? |
We're obviously not talking about same metric. |
The polkadot binary defaults to 8 outgoing connections and 32 inbound, for a combined maximum of 40. From polkadot --help:
This is the peer count I'm referring to, which I understand to be the peers I have a direct tcp connection with, which is likely a subset of the active validator set. Actually that is an interesting question -- if my validator is not directly talking to the 2 other members of the validator group to which it belongs, can it participate? Is it failing to connect to those specific validators for some reason? |
No, that's on top of the reserved_peer_set and in the reserved peer-set for the validation protocol we add all authorithies in the active set + the next ones and the ones in the past few sessions. So your node will always be connected to all authorities in the active set.
Which logs and how are you counting that ? |
I did not collect stats in prometheus on this new server. I am now collecting stats and I'll revert as soon as I have anything further |
OK, netstat -an | grep ESTABLISHED | wc confirms I have about 1700 connections on this host (with 1 KSM and 1 DOT validator running) so this checks out. This suggests that the peer count reported in syslog is not very meaningful. Every 5 seconds the polkadot binary logs a message like this one to syslog:
I have restarted my node with -lparachain=debug, however I haven't seen the issue frequently so I'm not sure when it might pay off. |
Thank you, let me know when/if happens, I can't reproduce it on our test networks, so any data I get on this is good data.
Yeah, those are the connections on a different protocol, not the ones used for gossiping backed statements. |
So I noticed my monitoring software is tracking tcp connection stats, this chart shows a spike starting right around the beginning of session 37483, the one my validator struggled with. Both ESTABLISHED and TIME_WAIT jump at the beginning of the session. The TIME_WAIT count stays high for one hour then drops back down to normal levels. tcp connections stay in TIME_WAIT for 60 seconds after they are closed, so this steady high number for a full hour suggests maybe a lot of churn in tcp connections for some reason? Maybe my node was somehow triggering bans/disconnects from the nodes it was talking to? The other detail I noticed is that the validator logged two different hashes for the starting block for this session. Looking through the logs, that does happen occasionally. I don't know if that is significant or not. (Note the lines that say "New epoch 37483 launching at...") Any chance that the paravalidator work would be using the old/wrong hash, and therefore unable to participate with peers?
|
I found a different kusama session where my validator participated (successfully) and one of the other nodes in the group scored 0. This was also a case where the 'bad' validator was just joining the active set. Take a look at https://apps.turboflakes.io/?chain=kusama&app=onet#/validator/5CHw7xnARj89p8xQAC86d5e7mjU6jFgsApbM8PWxp5NKbV1k, session 37531. My node, which was in the same validator group as the bad node, has logs shown below. 1) This was another case where two different block hashes were logged as the start of the epoch and 2) There is an additional log entry here about "Trying to remove unknown reserved node..." which might be referencing the validator that got a 0 score for the session. 3) The new authority set isn't applied by my node until 15 seconds later. Perhaps it is rejecting the other member of the validator group because it does not yet recognize it as a member of the authority set?
|
OK, so we were definitely talking about different peers :) |
Yes, I noticed the same the consensus was that it shouldn't matter, but now that it happened twice I'm not that sure anymore. Interesting enough if you look at session 37531 you will find a 3 more nodes in the same situation so I don't think that's a coincidence anymore. |
Regarding the high |
Understood about TIME_WAIT. I think the fact a lot of connections are being terminated might be significant, since it started at the beginning of the bad session and ended as soon as the session was over. Just speculating, but perhaps my validator is repeatedly trying to connect and one or more other validators are repeatedly disconnecting it for some reason. I wonder if the log message I pointed out in my previous comment is significant? |
Yeah, ignore that one, it is a red-herring it has been there since like forever. |
I left this server running, so the node-Id is not changed, and we just got another F when getting active |
@dcolley ok, what timezone are you having Could you please provide the logs as well.
They are running versions with async backing because they are getting points. |
@alexggh Prometheus dashboard running on UK server, and running UTC on the node - will load logs tonight will share logs covering all times |
Ok, that's even worse seems like for 2h your node connectivity is really bad from 12:30 until 14:30 when it reaches the expected peer-count. Also don't want to be a nag but can you double check in right hand corner in your browser the used timezone, I know that bit me several times, grafana usually uses the timezone of your connecting device. I would have expected 13:30 or 14:30 to be the time when your node enters the active set. |
@dcolley: Can you also get me the values for the bellow metric around the time this problem happened? Thank you!
|
Ok, I think I figure it out why this happens for new nodes entering the active set, long story-short, the node will start advertising its AuthorthyId on the DHT only after it becomes active, so the other nodes won't be able to detect at the beginning of the session. More details about this in: https://github.com/paritytech/polkadot-sdk/pull/3722/files |
Yes, they confirm the findings from here: #3613 (comment). |
Hi guys, There seems to be a problem with version 1.5.0 as well. Our Kusama node went into active set today, but missed all votes in first session. Logs are attached. |
Yes, the problem has been there since forever, it will get fixed with: #3722 in the next runtime release for kusama. On a side note, there is no reason for you to run |
This is great news! Yes, we have already uncommented the update scripts... Thank you! |
We are running on 1.9.0 and I just got another F. |
The issue with low grades in the first session after becoming active will be fixed with the 1.2 runtime upgrade. |
1.2 runtime got deployed with the fix of this issue, closing it. |
Is there an existing issue?
Experiencing problems? Have you tried our Stack Exchange first?
Description of bug
Starting from version 1.6.0 node massively misses paravalidation votes. The problem is in versions 1.6.0, 1.7.0, 1.7.1, 1.8.0. Returned version 1.5.0 and everything works correctly.
Attached are the logs of the validator node:
18:36 The node became active, version 1.8.0 was running.
18:44 We have returned version 1.5.0.
rustup show
Default host: x86_64-unknown-freebsd
rustup home: /root/.rustup
installed toolchains
stable-x86_64-unknown-freebsd (default)
nightly-x86_64-unknown-freebsd
installed targets for active toolchain
wasm32-unknown-unknown
x86_64-unknown-freebsd
active toolchain
stable-x86_64-unknown-freebsd (default)
rustc 1.76.0 (07dca489a 2024-02-04)
P.S. Other linux-based members have also complained about a similar problem.
polkadot.log.gz
Steps to reproduce
The text was updated successfully, but these errors were encountered: