Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Node massively misses paravalidation votes. #3613

Closed
2 tasks done
anignatev opened this issue Mar 7, 2024 · 71 comments
Closed
2 tasks done

Node massively misses paravalidation votes. #3613

anignatev opened this issue Mar 7, 2024 · 71 comments
Labels
I2-bug The node fails to follow expected behavior. I10-unconfirmed Issue might be valid, but it's not yet known.

Comments

@anignatev
Copy link

Is there an existing issue?

  • I have searched the existing issues

Experiencing problems? Have you tried our Stack Exchange first?

  • This is not a support question.

Description of bug

Starting from version 1.6.0 node massively misses paravalidation votes. The problem is in versions 1.6.0, 1.7.0, 1.7.1, 1.8.0. Returned version 1.5.0 and everything works correctly.

Attached are the logs of the validator node:
18:36 The node became active, version 1.8.0 was running.
18:44 We have returned version 1.5.0.


rustup show

Default host: x86_64-unknown-freebsd
rustup home: /root/.rustup

installed toolchains

stable-x86_64-unknown-freebsd (default)
nightly-x86_64-unknown-freebsd

installed targets for active toolchain

wasm32-unknown-unknown
x86_64-unknown-freebsd

active toolchain

stable-x86_64-unknown-freebsd (default)
rustc 1.76.0 (07dca489a 2024-02-04)

P.S. Other linux-based members have also complained about a similar problem.

polkadot.log.gz

IMG_7557

Steps to reproduce

  1. Compile the version containing the bug from the sources.
  2. Launch the version and wait for activity your node.
  3. Analyze the result of processing paravalidation sessions with statistics system (for example: apps.turboflakes.io).
@anignatev anignatev added I10-unconfirmed Issue might be valid, but it's not yet known. I2-bug The node fails to follow expected behavior. labels Mar 7, 2024
@sandreim
Copy link
Contributor

sandreim commented Mar 7, 2024

CC @alexggh

@alexggh
Copy link
Contributor

alexggh commented Mar 8, 2024

Looking at the logs, it seems block-sync times are off I see plenty of occasions where blocks are imported 1-2s after their creation, that will make you miss statement because your node won't participate.

To give you a few examples, block 19797352 gets created at 07/03/2024, 18:38:48 and your node imports at 18:38:58 �M-^\� Imported #19797353, so you definitely missed statements there. Same with 19797358 it gets created at 39:24 and your node imports it 3 seconds later 39:27, so clearly you will miss statements next block.

After restart and downgrade to 1.5 things seems to have improved(you can spot blocks being imported at seconds that divide by 6, which is close to block creation time), I'm not sure if it is 1.5 itself that fixed or simply the fact that you restarted.

The problem is in versions 1.6.0, 1.7.0, 1.7.1, 1.8.0

How did you reach this conclusion?
Maybe it was just the restart that improved the situation ?
What version were you running in the logs, before downgrading to 1.5.0 ? 1.8.0

Logs

  2810,65: Mar  7 18:37:07 dt2 polkadot[88598]: 2024-03-07 18:37:06 �M-^\� Imported #19797335 (0x0186�M-^@�ee4b)    
  2845,71: Mar  7 18:38:07 dt2 polkadot[88598]: 2024-03-07 18:38:07 �M-^Y��M-^O  Reorg on #19797335,0x0186�M-^@�ee4b to #19797335,0x1b13�M-^@�153e, common ancestor #19797334,0xd98d�M-^@�a2b7    
  2846,65: Mar  7 18:38:07 dt2 polkadot[88598]: 2024-03-07 18:38:07 �M-^\� Imported #19797335 (0x1b13�M-^@�153e)    
  2859,65: Mar  7 18:38:07 dt2 polkadot[88598]: 2024-03-07 18:38:07 �M-^\� Imported #19797336 (0xeccd�M-^@�80a9)    
  2874,65: Mar  7 18:38:08 dt2 polkadot[88598]: 2024-03-07 18:38:08 �M-^\� Imported #19797337 (0x378d�M-^@�6357)    
  2887,65: Mar  7 18:38:08 dt2 polkadot[88598]: 2024-03-07 18:38:08 �M-^\� Imported #19797337 (0x18ff�M-^@�4c74)    
  2900,65: Mar  7 18:38:08 dt2 polkadot[88598]: 2024-03-07 18:38:08 �M-^\� Imported #19797338 (0x5f95�M-^@�59a1)    
  2913,65: Mar  7 18:38:08 dt2 polkadot[88598]: 2024-03-07 18:38:08 �M-^\� Imported #19797338 (0x95bc�M-^@�9caf)    
  2926,65: Mar  7 18:38:09 dt2 polkadot[88598]: 2024-03-07 18:38:08 �M-^\� Imported #19797339 (0x4a80�M-^@�f99a)    
  2940,65: Mar  7 18:38:09 dt2 polkadot[88598]: 2024-03-07 18:38:09 �M-^\� Imported #19797340 (0xbe6c�M-^@�3600)    
  2952,65: Mar  7 18:38:09 dt2 polkadot[88598]: 2024-03-07 18:38:09 �M-^\� Imported #19797341 (0x5cb9�M-^@�8039)    
  2968,65: Mar  7 18:38:13 dt2 polkadot[88598]: 2024-03-07 18:38:13 �M-^\� Imported #19797346 (0x323f�M-^@�2114)    
  2979,71: Mar  7 18:38:13 dt2 polkadot[88598]: 2024-03-07 18:38:13 �M-^Y��M-^O  Reorg on #19797346,0x323f�M-^@�2114 to #19797346,0x47ac�M-^@�8d52, common ancestor #19797345,0x25b3�M-^@�d1a5    
  2980,65: Mar  7 18:38:13 dt2 polkadot[88598]: 2024-03-07 18:38:13 �M-^\� Imported #19797346 (0x47ac�M-^@�8d52)    
  2981,65: Mar  7 18:38:13 dt2 polkadot[88598]: 2024-03-07 18:38:13 �M-^\� Imported #19797346 (0xd213�M-^@�279d)    
  2993,71: Mar  7 18:38:18 dt2 polkadot[88598]: 2024-03-07 18:38:18 �M-^Y��M-^O  Reorg on #19797346,0x47ac�M-^@�8d52 to #19797347,0xe363�M-^@�4ca6, common ancestor #19797345,0x25b3�M-^@�d1a5    
  2994,65: Mar  7 18:38:18 dt2 polkadot[88598]: 2024-03-07 18:38:18 �M-^\� Imported #19797347 (0xe363�M-^@�4ca6)    
 
 .... snipped  cut ...

  3051,65: Mar  7 18:38:42 dt2 polkadot[88598]: 2024-03-07 18:38:42 �M-^\� Imported #19797351 (0x70fa�M-^@�6010)    
  3059,65: Mar  7 18:38:58 dt2 polkadot[88598]: 2024-03-07 18:38:58 �M-^\� Imported #19797352 (0x8b53�M-^@�c651)    
  3062,65: Mar  7 18:38:58 dt2 polkadot[88598]: 2024-03-07 18:38:58 �M-^\� Imported #19797353 (0x5f6e�M-^@�2d15)    
  3065,65: Mar  7 18:39:00 dt2 polkadot[88598]: 2024-03-07 18:39:00 �M-^\� Imported #19797354 (0x7eaf�M-^@�04db)    
  3069,65: Mar  7 18:39:06 dt2 polkadot[88598]: 2024-03-07 18:39:06 �M-^\� Imported #19797355 (0x69a4�M-^@�8428)    
  3076,65: Mar  7 18:39:12 dt2 polkadot[88598]: 2024-03-07 18:39:12 �M-^\� Imported #19797356 (0xd45e�M-^@�ae38)    
  3083,65: Mar  7 18:39:19 dt2 polkadot[88598]: 2024-03-07 18:39:19 �M-^\� Imported #19797357 (0xfa3f�M-^@�926e)    
  3090,65: Mar  7 18:39:27 dt2 polkadot[88598]: 2024-03-07 18:39:27 �M-^\� Imported #19797358 (0x478a�M-^@�c9a5)    
  3095,65: Mar  7 18:39:30 dt2 polkadot[88598]: 2024-03-07 18:39:30 �M-^\� Imported #19797359 (0xa414�M-^@�fed8)    
  3100,65: Mar  7 18:39:36 dt2 polkadot[88598]: 2024-03-07 18:39:36 �M-^\� Imported #19797360 (0xe055�M-^@�aeb9)    
  3108,65: Mar  7 18:39:42 dt2 polkadot[88598]: 2024-03-07 18:39:42 �M-^\� Imported #19797361 (0xade9�M-^@�1855)    
  3113,65: Mar  7 18:39:42 dt2 polkadot[88598]: 2024-03-07 18:39:42 �M-^\� Imported #19797361 (0xe9ff�M-^@�dcc2)    
  3119,65: Mar  7 18:39:49 dt2 polkadot[88598]: 2024-03-07 18:39:48 �M-^\� Imported #19797362 (0x33be�M-^@�a219)    
  3129,65: Mar  7 18:40:05 dt2 polkadot[88598]: 2024-03-07 18:40:05 �M-^\� Imported #19797363 (0xe732�M-^@�99c4)    
  3134,71: Mar  7 18:40:05 dt2 polkadot[88598]: 2024-03-07 18:40:05 �M-^Y��M-^O  Reorg on #19797363,0xe732�M-^@�99c4 to #19797363,0x231e�M-^@�3c60, common ancestor #19797362,0x33be�M-^@�a219    
  3135,65: Mar  7 18:40:05 dt2 polkadot[88598]: 2024-03-07 18:40:05 �M-^\� Imported #19797363 (0x231e�M-^@�3c60)    
  3140,65: Mar  7 18:40:05 dt2 polkadot[88598]: 2024-03-07 18:40:05 �M-^\� Imported #19797364 (0xfefa�M-^@�794d)    
  3146,65: Mar  7 18:40:14 dt2 polkadot[88598]: 2024-03-07 18:40:14 �M-^\� Imported #19797365 (0xa17b�M-^@�2e81)    
  3152,65: Mar  7 18:40:14 dt2 polkadot[88598]: 2024-03-07 18:40:14 �M-^\� Imported #19797366 (0x1092�M-^@�344e)    
  3158,65: Mar  7 18:40:20 dt2 polkadot[88598]: 2024-03-07 18:40:20 �M-^\� Imported #19797367 (0x7a44�M-^@�7fef)    
  3170,65: Mar  7 18:41:50 dt2 polkadot[88598]: 2024-03-07 18:41:50 �M-^\� Imported #19797368 (0xa70e�M-^@�7b08)    
  3173,71: Mar  7 18:41:51 dt2 polkadot[88598]: 2024-03-07 18:41:51 �M-^Y��M-^O  Reorg on #19797368,0xa70e�M-^@�7b08 to #19797368,0x1068�M-^@�b2c9, common ancestor #19797367,0x7a44�M-^@�7fef    
  3174,65: Mar  7 18:41:51 dt2 polkadot[88598]: 2024-03-07 18:41:51 �M-^\� Imported #19797368 (0x1068�M-^@�b2c9)    
  3197,65: Mar  7 18:42:02 dt2 polkadot[88598]: 2024-03-07 18:42:02 �M-^\� Imported #19797383 (0xc816�M-^@�d7d4)    
  3198,65: Mar  7 18:42:02 dt2 polkadot[88598]: 2024-03-07 18:42:02 �M-^\� Imported #19797384 (0x8d4c�M-^@�56df)    
  3201,65: Mar  7 18:42:09 dt2 polkadot[88598]: 2024-03-07 18:42:09 �M-^\� Imported #19797385 (0x3835�M-^@�8a04)    
  3202,58: Mar  7 18:44:22 dt2 polkadot[59947]: 2024-03-07 18:44:22 Parity Polkadot    
  3267,65: Mar  7 18:48:37 dt2 polkadot[59947]: 2024-03-07 18:48:36 �M-^\� Imported #19797443 (0x0a81�M-^@�0a24)    
  5450,65: Mar  7 18:48:42 dt2 polkadot[59947]: 2024-03-07 18:48:42 �M-^\� Imported #19797451 (0x0981�M-^@�308e)    

@eskimor
Copy link
Member

eskimor commented Mar 8, 2024

Roping in Node SDK as well: @andresilva .

@eskimor
Copy link
Member

eskimor commented Mar 8, 2024

Can we get more data on who is affected by this and what machines + operating systems + versions everybody affected is running? Also what is "similar" problems? Like how similar? Would be good if they could share their data as elaborate as possible as well.

@eskimor
Copy link
Member

eskimor commented Mar 8, 2024

Also how does CPU/network usage compare between those versions for you?

@anignatev
Copy link
Author

anignatev commented Mar 10, 2024

Looking at the logs, it seems block-sync times are off I see plenty of occasions where blocks are imported 1-2s after their creation, that will make you miss statement because your node won't participate.

To give you a few examples, block 19797352 gets created at 07/03/2024, 18:38:48 and your node imports at 18:38:58 �M-^\� Imported #19797353, so you definitely missed statements there. Same with 19797358 it gets created at 39:24 and your node imports it 3 seconds later 39:27, so clearly you will miss statements next block.

After restart and downgrade to 1.5 things seems to have improved(you can spot blocks being imported at seconds that divide by 6, which is close to block creation time), I'm not sure if it is 1.5 itself that fixed or simply the fact that you restarted.

The problem is in versions 1.6.0, 1.7.0, 1.7.1, 1.8.0

How did you reach this conclusion? Maybe it was just the restart that improved the situation ? What version were you running in the logs, before downgrading to 1.5.0 ? 1.8.0

Thanks for the quick reply, I apologize for not being able to reply right away.....

I don't think the restart helped, because when this problem was first discovered (it was on February 2 on a our Kusama node), I reloaded version 1.6.0, but the node kept skipping votes. The problem only disappeared when I reverted to version 1.5.0. Logs and screenshots of statistics are attached.

Events on February 2:
00:04:59 - Node updated to version 1.6.0 but not yet active.
15:28:00 - Node has moved to the active state. Missed a lot of votes and received an F rating.
18:47:59 - Node has been rebooted, version is still 1.6.0.
18:52:36 - Returned version 1.5.0, with session 36645 started working normally.
kusama.log.gz
IMG_7145 2

@anignatev
Copy link
Author

Can we get more data on who is affected by this and what machines + operating systems + versions everybody affected is running? Also what is "similar" problems? Like how similar? Would be good if they could share their data as elaborate as possible as well.

Hi, I am attaching an archive with hardware and operating system data. Please note that by communicating with other 1KV program members it has already become clear that the problem happened on different operating systems and on powerful servers. Unfortunately, I don't have data on the difference in CPU/network usage on these versions, but I have requested more data from other 1KV members.

system_info.zip

@bkchr
Copy link
Member

bkchr commented Mar 10, 2024

and your node imports at 18:38:58

The node is for sure running with a synchronized clock?

Otherwise, we will need sync=trace logs (be aware that this generates a lot of logs)

@bkchr
Copy link
Member

bkchr commented Mar 10, 2024

Also, do you run any special configuration? Any CLI arguments?

@alexggh
Copy link
Contributor

alexggh commented Mar 11, 2024

I see you first sent us an analysis on polkadot and then an analysis on kusama, the reason I ask is that on kusama we do have this item 2) #3314 (comment) which is currently not root caused which seems like the issue you seem to have problem with. On kusama you do see 0 backing rewards which is consistent with the bug we are trying to track down, but on polkadot you still getting 20% of the rewards, so might be different issues.

@alexggh
Copy link
Contributor

alexggh commented Mar 11, 2024

Ok, looked at the kusama.log.gz. logs are really sparse, but I noticed is that you restarted your node at 14.09.44 before it enters the active set and at that point the node seems to have just an websocket external address, and not a P2P one, see the ws/p2p prefix instead of just p2p.

Not clear to me how you got into that state, but I don't think your authority would be discovered by others with just that external address, @altonen @dmitry-markin thoughts or is it just a red-herring ?

Feb  2 00:04:59 ks1 polkadot[2100]: 2024-02-02 00:04:59 �M-^\M-^L�M-^O  version 1.6.0-5c79ed37dc6    
Feb  2 00:05:07 ks1 polkadot[2100]: 2024-02-02 00:05:07 �M-^_M-^TM-^M Discovered new external address for our node: /ip4/62.77.130.13/tcp/30333/p2p/12D3KooWJryJLfeM1eJGhGgLggxQebUaBKRTgmTooANwteC2YpZQ    
Feb  2 14:09:44 ks1 polkadot[81405]: 2024-02-02 14:09:44 �M-^\M-^L�M-^O  version 1.6.0-5c79ed37dc6    
Feb  2 14:09:47 ks1 polkadot[81405]: 2024-02-02 14:09:47 �M-^_M-^TM-^M Discovered new external address for our node: /ip4/62.77.130.13/tcp/30333/ws/p2p/12D3KooWJryJLfeM1eJGhGgLggxQebUaBKRTgmTooANwteC2YpZQ    
Feb  2 18:47:59 ks1 polkadot[6035]: 2024-02-02 18:47:59 �M-^\M-^L�M-^O  version 1.6.0-5c79ed37dc6    
Feb  2 18:48:02 ks1 polkadot[6035]: 2024-02-02 18:48:02 �M-^_M-^TM-^M Discovered new external address for our node: /ip4/62.77.130.13/tcp/30331/ws/p2p/12D3KooWJryJLfeM1eJGhGgLggxQebUaBKRTgmTooANwteC2YpZQ    
Feb  2 18:52:36 ks1 polkadot[6543]: 2024-02-02 18:52:36 �M-^\M-^L�M-^O  version 1.5.0-e5b2adac7ad    
Feb  2 18:52:42 ks1 polkadot[6543]: 2024-02-02 18:52:42 �M-^_M-^TM-^M Discovered new external address for our node: /ip4/62.77.130.13/tcp/30331/p2p/12D3KooWJryJLfeM1eJGhGgLggxQebUaBKRTgmTooANwteC2YpZQ    
Feb  2 18:54:28 ks1 polkadot[6734]: 2024-02-02 18:54:28 �M-^\M-^L�M-^O  version 1.5.0-e5b2adac7ad    
Feb  2 18:54:32 ks1 polkadot[6734]: 2024-02-02 18:54:32 �M-^_M-^TM-^M Discovered new external address for our node: /ip4/62.77.130.13/tcp/30331/p2p/12D3KooWJryJLfeM1eJGhGgLggxQebUaBKRTgmTooANwteC2YpZQ    

@altonen
Copy link
Contributor

altonen commented Mar 11, 2024

It shouldn't have a WebSocket listening address if the node is run as a validator. Might be worth running the node with -lsub-libp2p=trace,libp2p=trace to figure out why that's happening.

@tunkas
Copy link

tunkas commented Mar 11, 2024

Not sure if this will be helpful on this issue, but it might so I'm posting my logs.
It happened on my Kusama validator, in the first session I paravalidated after getting into the active set.
I was missing all the votes in that session so 45 minutes into the active set restart was performed. Not sure if this had any influence on resolving the matter since I still missed all the votes within that session.
Next session went without missing significat amount of votes.
Here are the logs for that session, and subsequent one.
I'm running on the latest version, of course.
Kusama error report.txt

@alexggh
Copy link
Contributor

alexggh commented Mar 11, 2024

Ok, looked a bit more closely @anignatev and it seems you node start with different roles:

00:04:59 ks1 polkadot[2100]: 2024-02-02 00:04:59 �M-^_M-^Q� Role: AUTHORITY    
14:09:44 ks1 polkadot[81405]: 2024-02-02 14:09:44 �M-^_M-^Q� Role: FULL    
18:47:59 ks1 polkadot[6035]: 2024-02-02 18:47:59 �M-^_M-^Q� Role: FULL    
18:52:36 ks1 polkadot[6543]: 2024-02-02 18:52:36 �M-^_M-^Q� Role: AUTHORITY    
18:54:28 ks1 polkadot[6734]: 2024-02-02 18:54:28 �M-^_M-^Q� Role: AUTHORITY    

When it is FULL it doesn't seem to work, did you change the CLI arguments somehow ? Why would it use a different role ? I don't think it is detected dynamically or is it @bkchr ?

@bkchr
Copy link
Member

bkchr commented Mar 11, 2024

I don't think it is detected dynamically or is it @bkchr ?

Yeah not detected dynamically. The role is based on passing --validator.

@alexggh
Copy link
Contributor

alexggh commented Mar 11, 2024

Yeah, ignore my previous comment, seems the log file contains logs from multiple nodes, I assume infsysgroup is the one of interest @anignatev ?


Feb  2 00:04:59 ks1 polkadot[2100]: 2024-02-02 00:04:59 �M-^_M-^O�  Node name: infsysgroup    
Feb  2 00:04:59 ks1 polkadot[2100]: 2024-02-02 00:04:59 �M-^_M-^Q� Role: AUTHORITY    
Feb  2 14:09:44 ks1 polkadot[81405]: 2024-02-02 14:09:44 �M-^_M-^O�  Node name: overrated-marble-6094    
Feb  2 14:09:44 ks1 polkadot[81405]: 2024-02-02 14:09:44 �M-^_M-^Q� Role: FULL    
Feb  2 18:47:59 ks1 polkadot[6035]: 2024-02-02 18:47:59 �M-^_M-^O�  Node name: uptight-sister-2530    
Feb  2 18:47:59 ks1 polkadot[6035]: 2024-02-02 18:47:59 �M-^_M-^Q� Role: FULL    
Feb  2 18:52:36 ks1 polkadot[6543]: 2024-02-02 18:52:36 �M-^_M-^O�  Node name: infsysgroup    
Feb  2 18:52:36 ks1 polkadot[6543]: 2024-02-02 18:52:36 �M-^_M-^Q� Role: AUTHORITY    
Feb  2 18:54:28 ks1 polkadot[6734]: 2024-02-02 18:54:28 �M-^_M-^O�  Node name: infsysgroup    
Feb  2 18:54:28 ks1 polkadot[6734]: 2024-02-02 18:54:28 �M-^_M-^Q� Role: AUTHORITY    

@alexggh
Copy link
Contributor

alexggh commented Mar 11, 2024

Not sure if this will be helpful on this issue, but it might so I'm posting my logs. It happened on my Kusama validator, in the first session I paravalidated after getting into the active set. I was missing all the votes in that session so 45 minutes into the active set restart was performed. Not sure if this had any influence on resolving the matter since I still missed all the votes within that session. Next session went without missing significat amount of votes. Here are the logs for that session, and subsequent one. I'm running on the latest version, of course. Kusama error report.txt

@tunkas this seems fresh, have you got a link in app.turbo.flakes to your validator ?

@tunkas
Copy link

tunkas commented Mar 11, 2024

Yes, this is from 18 hours ago.
I had another one 5 hours ago :(
Here is the link DBfT2GUqHX89afMhTzGCCbAc44zX33d4XySWX2qAPxZ35KE
Sending the latest logs soon also.

@alexggh
Copy link
Contributor

alexggh commented Mar 11, 2024

Yes, this is from 18 hours ago. I had another one 5 hours ago :( Here is the link DBfT2GUqHX89afMhTzGCCbAc44zX33d4XySWX2qAPxZ35KE Sending the latest logs soon also.

@tunkas Since it seems it happens often to you would you be able to run your node with:

-lparachain=debug

or even trace if is is not too much too ask

-lparachain=trace

And give me the logs for a time when that happens ?

@tunkas
Copy link

tunkas commented Mar 11, 2024

Here are the logs from this morning, nothing dubious in them imho. But that one was not such a bad one as the one last night.
Kusama report 2.txt

@tunkas
Copy link

tunkas commented Mar 11, 2024

Yes, this is from 18 hours ago. I had another one 5 hours ago :( Here is the link DBfT2GUqHX89afMhTzGCCbAc44zX33d4XySWX2qAPxZ35KE Sending the latest logs soon also.

@tunkas Since it seems it happens often to you would you be able to run your node with:

-lparachain=debug

or even trace if is is not too much too ask

-lparachain=trace

And give me the logs for a time when that happens ?

Could you explain more about this flag?
What does it do exactly?
Where do I put in my service file?

Thanks!

@alexggh
Copy link
Contributor

alexggh commented Mar 11, 2024

Yes, this is from 18 hours ago. I had another one 5 hours ago :( Here is the link DBfT2GUqHX89afMhTzGCCbAc44zX33d4XySWX2qAPxZ35KE Sending the latest logs soon also.

@tunkas Since it seems it happens often to you would you be able to run your node with:

-lparachain=debug

or even trace if is is not too much too ask

-lparachain=trace

And give me the logs for a time when that happens ?

Could you explain more about this flag? What does it do exactly? Where do I put in my service file?

Thanks!

This are arguments to the polkadot binary, they enhance the accepted log levels debug is the minimum level we need to be able to understand what is going one, trace is the most verbose.

Both will increase considerably the size of your logfiles.

@tunkas
Copy link

tunkas commented Mar 11, 2024

Ok, understood.
What would be the amount of this 'significant increase'? Is it something that would require me to upgrade my disk (my current remaing space is around 70 GB)?

So I just add them in my service file alongside other arguments? For Ubuntu it would be '--lparachain=trace'?
Thanks!

@alexggh
Copy link
Contributor

alexggh commented Mar 11, 2024

Ok, understood. What would be the amount of this 'significant increase'? Is it something that would require me to upgrade my disk (my current remaing space is around 70 GB)?

So I just add them in my service file alongside other arguments? For Ubuntu it would be '--lparachain=trace'? Thanks!

70gb, might not be enough if it runs for a long time, let's use debug not trace.

Yes, adding it next to the other arguments should be enough.

@tunkas
Copy link

tunkas commented Mar 11, 2024

OK, I'll try to time it for my next entrance into the active set, and will report back here with logs if I get the same result again 👍

Thanks again for your assistance!

@anignatev
Copy link
Author

Yeah, ignore my previous comment, seems the log file contains logs from multiple nodes, I assume infsysgroup is the one of interest @anignatev ?


Feb  2 00:04:59 ks1 polkadot[2100]: 2024-02-02 00:04:59 �M-^_M-^O�  Node name: infsysgroup    
Feb  2 00:04:59 ks1 polkadot[2100]: 2024-02-02 00:04:59 �M-^_M-^Q� Role: AUTHORITY    
Feb  2 14:09:44 ks1 polkadot[81405]: 2024-02-02 14:09:44 �M-^_M-^O�  Node name: overrated-marble-6094    
Feb  2 14:09:44 ks1 polkadot[81405]: 2024-02-02 14:09:44 �M-^_M-^Q� Role: FULL    
Feb  2 18:47:59 ks1 polkadot[6035]: 2024-02-02 18:47:59 �M-^_M-^O�  Node name: uptight-sister-2530    
Feb  2 18:47:59 ks1 polkadot[6035]: 2024-02-02 18:47:59 �M-^_M-^Q� Role: FULL    
Feb  2 18:52:36 ks1 polkadot[6543]: 2024-02-02 18:52:36 �M-^_M-^O�  Node name: infsysgroup    
Feb  2 18:52:36 ks1 polkadot[6543]: 2024-02-02 18:52:36 �M-^_M-^Q� Role: AUTHORITY    
Feb  2 18:54:28 ks1 polkadot[6734]: 2024-02-02 18:54:28 �M-^_M-^O�  Node name: infsysgroup    
Feb  2 18:54:28 ks1 polkadot[6734]: 2024-02-02 18:54:28 �M-^_M-^Q� Role: AUTHORITY    

Ignore the different startup modes. To ensure uninterrupted operation, we have a main and backup server for each node. When we were experimenting, we transferred the validation to another server, and this one was switched to database synchronization only mode. I apologize for missing this moment, I am attaching the missing log (Feb 2 14:09:10 - Feb 2 18:52:18).

kusama_another_server.log.gz

CLI arguments KUSAMA:
/usr/local/bin/polkadot --database paritydb --no-hardware-benchmarks --validator --name infsysgroup --chain=kusama
--keystore-path ${keystore_dir} --base-path ${data_dir} --telemetry-url 'wss://telemetry-backend.w3f.community/submit 1' --insecure-validator-i-know-what-i-do --port=30331

CLI arguments POLKADOT:
/usr/local/bin/polkadot --database paritydb --validator --name infsysgroup
--keystore-path ${keystore_dir} --base-path ${data_dir} --telemetry-url 'wss://telemetry-backend.w3f.community/submit 1' --insecure-validator-i-know-what-i-do --port=30330

@vilija19
Copy link

In my case, node missed all votes and authored one block.
It happens when NONactive validators become active and are assigned as para-validator in the first session.
The next session was normal - no missed votes
Node version 1.8.0
Anomaly in logs during first session:

Mar 07 18:40:13 ns547523 polkadot[3156224]: 2024-03-07 18:40:13 Cluster has too many pending statements, something wrong with our connection to our group peers
Mar 07 18:40:13 ns547523 polkadot[3156224]: Restart might be needed if validator gets 0 backing rewards for more than 3-4 consecutive sessions pending_statements={ValidatorIndex(60): {(ValidatorIndex(61), CompactStatement::Seconded(0x3f4504bad2ce6949089562e71ae0bfae9b59aa4fdb67b5b2f8b10e82a54273d7))}, ValidatorIndex(61): {(ValidatorIndex(61), CompactStatement::Seconded(0x3f4504bad2ce6949089562e71ae0bfae9b59aa4fdb67b5b2f8b10e82a54273d7))}, ValidatorIndex(62): {(ValidatorIndex(61), CompactStatement::Seconded(0x3f4504bad2ce6949089562e71ae0bfae9b59aa4fdb67b5b2f8b10e82a54273d7))}} parent_hash=0x53a67567c66031913d84f4a96ca7531abc6c510098bec9bc5216c50656fc026d

Mar 07 18:45:43 ns547523 polkadot[4035256]: 2024-03-07 18:45:43 received notification from 12D3KooWFqS1BZn3g8nc6NuMtLtrQMjoRmF3CzkjVduTTXBNaUPS who had been earlier refused by `SyncingEngine

@eskimor
Copy link
Member

eskimor commented Mar 11, 2024

What does the node say about hardware specs?

2024-03-11 21:45:50 🏁 CPU score: 1.04 GiBs    
2024-03-11 21:45:50 🏁 Memory score: 10.77 GiBs    
2024-03-11 21:45:50 🏁 Disk score (seq. writes): 898.52 MiBs    
2024-03-11 21:45:50 🏁 Disk score (rand. writes): 265.97 MiBs    

In particular, what does it say about minimum requirements being met?

@tunkas
Copy link

tunkas commented Mar 13, 2024

We're obviously not talking about same metric.
Peers I'm talking about are visible in almost every line on our logs and I never saw it higher than 40, as Orange mentioned above.

@validorange
Copy link

The polkadot binary defaults to 8 outgoing connections and 32 inbound, for a combined maximum of 40.

From polkadot --help:

      --out-peers <COUNT>
          Number of outgoing connections we're trying to maintain

          [default: 8]

      --in-peers <COUNT>
          Maximum number of inbound full nodes peers

          [default: 32]

This is the peer count I'm referring to, which I understand to be the peers I have a direct tcp connection with, which is likely a subset of the active validator set. Actually that is an interesting question -- if my validator is not directly talking to the 2 other members of the validator group to which it belongs, can it participate? Is it failing to connect to those specific validators for some reason?

@alexggh
Copy link
Contributor

alexggh commented Mar 13, 2024

This is the peer count I'm referring to, which I understand to be the peers I have a direct tcp connection with, which is likely a subset of the active validator set. Actually that is an interesting question -- if my validator is not directly talking to the 2 other members of the validator group to which it belongs, can it participate? Is it failing to connect to those specific validators for some reason?

No, that's on top of the reserved_peer_set and in the reserved peer-set for the validation protocol we add all authorithies in the active set + the next ones and the ones in the past few sessions. So your node will always be connected to all authorities in the active set.

Peers I'm talking about are visible in almost every line on our logs and I never saw it higher than 40, as Orange mentioned above.

Which logs and how are you counting that ?

@dcolley
Copy link
Contributor

dcolley commented Mar 13, 2024

@dcolley any chance you can look at this metric polkadot_parachain_peer_count{protocol=~"validation/.*"} around that time ?

I did not collect stats in prometheus on this new server. I am now collecting stats and I'll revert as soon as I have anything further

@validorange
Copy link

OK, netstat -an | grep ESTABLISHED | wc confirms I have about 1700 connections on this host (with 1 KSM and 1 DOT validator running) so this checks out. This suggests that the peer count reported in syslog is not very meaningful.

Every 5 seconds the polkadot binary logs a message like this one to syslog:

Mar 13 14:46:58 vo-ash1 polkadot[1011]: 2024-03-13 14:46:58 💤 Idle (17 peers), best: #19882624 (0x1457…0e8d), finalized #19882620 (0xa6c2…14f0), ⬇ 506.7kiB/s ⬆ 321.5kiB/s

I have restarted my node with -lparachain=debug, however I haven't seen the issue frequently so I'm not sure when it might pay off.

@alexggh
Copy link
Contributor

alexggh commented Mar 13, 2024

I have restarted my node with -lparachain=debug, however I haven't seen the issue frequently so I'm not sure when it might pay off.

Thank you, let me know when/if happens, I can't reproduce it on our test networks, so any data I get on this is good data.

Idle (17 peers)

Yeah, those are the connections on a different protocol, not the ones used for gossiping backed statements.

@validorange
Copy link

image

So I noticed my monitoring software is tracking tcp connection stats, this chart shows a spike starting right around the beginning of session 37483, the one my validator struggled with. Both ESTABLISHED and TIME_WAIT jump at the beginning of the session. The TIME_WAIT count stays high for one hour then drops back down to normal levels. tcp connections stay in TIME_WAIT for 60 seconds after they are closed, so this steady high number for a full hour suggests maybe a lot of churn in tcp connections for some reason? Maybe my node was somehow triggering bans/disconnects from the nodes it was talking to?

The other detail I noticed is that the validator logged two different hashes for the starting block for this session. Looking through the logs, that does happen occasionally. I don't know if that is significant or not. (Note the lines that say "New epoch 37483 launching at...") Any chance that the paravalidator work would be using the old/wrong hash, and therefore unable to participate with peers?

Mar  8 12:27:54 vo-ash1 polkadot[2066807]: 2024-03-08 12:27:54 👶 New epoch 37483 launching at block 0xc12a…4a9b (block slot 284983479 >= start slot 284983479).
Mar  8 12:27:54 vo-ash1 polkadot[2066807]: 2024-03-08 12:27:54 👶 Next epoch starts at slot 284984079
Mar  8 12:27:54 vo-ash1 polkadot[2066807]: 2024-03-08 12:27:54 ✨ Imported #22203701 (0xc12a…4a9b)
Mar  8 12:27:54 vo-ash1 polkadot[2066807]: 2024-03-08 12:27:54 Node uses the following topology indices index_in_gossip_topology=Some(ValidatorIndex(162)) index_in_parachain_authorities=Some(Active(ValidatorIndex(162)))
Mar  8 12:27:55 vo-ash1 polkadot[2066807]: 2024-03-08 12:27:55 👶 New epoch 37483 launching at block 0x1366…0905 (block slot 284983479 >= start slot 284983479).
Mar  8 12:27:55 vo-ash1 polkadot[2066807]: 2024-03-08 12:27:55 👶 Next epoch starts at slot 284984079
Mar  8 12:27:55 vo-ash1 polkadot[2066807]: 2024-03-08 12:27:55 ♻️  Reorg on #22203701,0xc12a…4a9b to #22203701,0x1366…0905, common ancestor #22203700,0x089c…d11e
Mar  8 12:27:55 vo-ash1 polkadot[2066807]: 2024-03-08 12:27:55 ✨ Imported #22203701 (0x1366…0905)

@validorange
Copy link

I found a different kusama session where my validator participated (successfully) and one of the other nodes in the group scored 0. This was also a case where the 'bad' validator was just joining the active set. Take a look at https://apps.turboflakes.io/?chain=kusama&app=onet#/validator/5CHw7xnARj89p8xQAC86d5e7mjU6jFgsApbM8PWxp5NKbV1k, session 37531.

My node, which was in the same validator group as the bad node, has logs shown below. 1) This was another case where two different block hashes were logged as the start of the epoch and 2) There is an additional log entry here about "Trying to remove unknown reserved node..." which might be referencing the validator that got a 0 score for the session. 3) The new authority set isn't applied by my node until 15 seconds later. Perhaps it is rejecting the other member of the validator group because it does not yet recognize it as a member of the authority set?


Mar 10 12:27:55 vo-ash1 polkadot[2066807]: 2024-03-10 12:27:55 👶 New epoch 37531 launching at block 0xc712…ec70 (block slot 285012279 >= start slot 285012279).
Mar 10 12:27:55 vo-ash1 polkadot[2066807]: 2024-03-10 12:27:55 👶 Next epoch starts at slot 285012879
Mar 10 12:27:55 vo-ash1 polkadot[2066807]: 2024-03-10 12:27:55 ✨ Imported #22232385 (0xc712…ec70)
Mar 10 12:27:55 vo-ash1 polkadot[2066807]: 2024-03-10 12:27:55 Trying to remove unknown reserved node 12D3KooWMvwDw55sdVK5e5A8qFPthp3SQDk1RW4Bznd7PMf8D9Pd from SetId(3).
Mar 10 12:27:55 vo-ash1 polkadot[2066807]: 2024-03-10 12:27:55 Node uses the following topology indices index_in_gossip_topology=Some(ValidatorIndex(225)) index_in_parachain_authorities=Some(Active(ValidatorIndex(225)))
Mar 10 12:27:56 vo-ash1 polkadot[2066807]: 2024-03-10 12:27:56 👶 New epoch 37531 launching at block 0xc397…dfd5 (block slot 285012279 >= start slot 285012279).
Mar 10 12:27:56 vo-ash1 polkadot[2066807]: 2024-03-10 12:27:56 👶 Next epoch starts at slot 285012879
Mar 10 12:27:56 vo-ash1 polkadot[2066807]: 2024-03-10 12:27:56 ♻️  Reorg on #22232385,0xc712…ec70 to #22232385,0xc397…dfd5, common ancestor #22232384,0x4197…23f3
Mar 10 12:27:56 vo-ash1 polkadot[2066807]: 2024-03-10 12:27:56 ✨ Imported #22232385 (0xc397…dfd5)
Mar 10 12:27:56 vo-ash1 polkadot[2066807]: 2024-03-10 12:27:56 💤 Idle (25 peers), best: #22232385 (0xc397…dfd5), finalized #22232382 (0x4bb9…3a7a), ⬇ 3.5MiB/s ⬆ 5.4MiB/s
Mar 10 12:28:00 vo-ash1 polkadot[2066807]: 2024-03-10 12:28:00 ✨ Imported #22232386 (0x4088…0257)
Mar 10 12:28:00 vo-ash1 polkadot[2066807]: 2024-03-10 12:28:00 ✨ Imported #22232386 (0x7b76…309b)
Mar 10 12:28:01 vo-ash1 polkadot[2066807]: 2024-03-10 12:28:01 💤 Idle (25 peers), best: #22232386 (0x4088…0257), finalized #22232383 (0x4cdf…e167), ⬇ 3.7MiB/s ⬆ 6.3MiB/s
Mar 10 12:28:06 vo-ash1 polkadot[2066807]: 2024-03-10 12:28:06 💤 Idle (25 peers), best: #22232386 (0x4088…0257), finalized #22232384 (0x4197…23f3), ⬇ 3.0MiB/s ⬆ 3.8MiB/s
Mar 10 12:28:06 vo-ash1 polkadot[2066807]: 2024-03-10 12:28:06 ✨ Imported #22232387 (0x6073…3506)
Mar 10 12:28:06 vo-ash1 polkadot[2066807]: 2024-03-10 12:28:06 ✨ Imported #22232387 (0xdd7c…cfec)
Mar 10 12:28:10 vo-ash1 polkadot[2066807]: 2024-03-10 12:28:10 👴 Applying authority set change scheduled at block #22232385
Mar 10 12:28:10 vo-ash1 polkadot[2066807]: 2024-03-10 12:28:10 👴 Applying GRANDPA set change to new set with 1000 authorities
Mar 10 12:28:10 vo-ash1 polkadot[2066807]: 2024-03-10 12:28:10 🥩 New Rounds for validator set id: 2953 with session_start 22232385
Mar 10 12:28:10 vo-ash1 polkadot[2066807]: 2024-03-10 12:28:10 🥩 Concluded mandatory round #22232385
Mar 10 12:28:11 vo-ash1 polkadot[2066807]: 2024-03-10 12:28:11 💤 Idle (25 peers), best: #22232387 (0x6073…3506), finalized #22232385 (0xc397…dfd5), ⬇ 4.2MiB/s ⬆ 4.3MiB/s
Mar 10 12:28:12 vo-ash1 polkadot[2066807]: 2024-03-10 12:28:12 ✨ Imported #22232388 (0x85de…293c)

@tunkas
Copy link

tunkas commented Mar 13, 2024

Yeah, those are the connections on a different protocol, not the ones used for gossiping backed statements.

OK, so we were definitely talking about different peers :)

@alexggh
Copy link
Contributor

alexggh commented Mar 14, 2024

The other detail I noticed is that the validator logged two different hashes for the starting block for this session. Looking through the logs, that does happen occasionally. I don't know if that is significant or not. (Note the lines that say "New epoch 37483 launching at...") Any chance that the paravalidator work would be using the old/wrong hash, and therefore unable to participate with peers?

Yes, I noticed the same the consensus was that it shouldn't matter, but now that it happened twice I'm not that sure anymore.

Interesting enough if you look at session 37531 you will find a 3 more nodes in the same situation so I don't think that's a coincidence anymore.

@sandreim
Copy link
Contributor

sandreim commented Mar 14, 2024

Regarding the high TIME_WAIT indicates that the node is terminating a lot of connections. The purpose of this state is to ensure reliability. We make sure to receive the ack of the connection termination.

@validorange
Copy link

Understood about TIME_WAIT. I think the fact a lot of connections are being terminated might be significant, since it started at the beginning of the bad session and ended as soon as the session was over. Just speculating, but perhaps my validator is repeatedly trying to connect and one or more other validators are repeatedly disconnecting it for some reason.

I wonder if the log message I pointed out in my previous comment is significant? Trying to remove unknown reserved node... I see this repeating in my logs about every 6 hours, which lines up with new authority sets. But it is consistently logged 15-20 seconds before the authority set is applied. Sometimes it is logged just once, other times it logs dozens of times (all for different node hashes).

@alexggh
Copy link
Contributor

alexggh commented Mar 14, 2024

Trying to remove unknown reserved node

Yeah, ignore that one, it is a red-herring it has been there since like forever.

@dcolley
Copy link
Contributor

dcolley commented Mar 14, 2024

I left this server running, so the node-Id is not changed, and we just got another F when getting active
https://apps.turboflakes.io/?chain=kusama#/validator/HyLisujX7Cr6D7xzb6qadFdedLt8hmArB6ZVGJ6xsCUHqmx?mode=history

@dcolley
Copy link
Contributor

dcolley commented Mar 14, 2024

What stats can I share from prom?
image

image

@dcolley
Copy link
Contributor

dcolley commented Mar 14, 2024

Other validators in the group:
ANAMIX/03 is on 1.7.1
I can't tell what the other node is running

image

@alexggh
Copy link
Contributor

alexggh commented Mar 14, 2024

@dcolley ok, what timezone are you having 37627 seems to have started at 12:27:55.811 UTC is that equivalent with 13:30 -> 14.30 on your dashboard ?

Could you please provide the logs as well.

I can't tell what the other node is running

They are running versions with async backing because they are getting points.

@dcolley
Copy link
Contributor

dcolley commented Mar 14, 2024

@alexggh Prometheus dashboard running on UK server, and running UTC on the node - will load logs tonight will share logs covering all times

@alexggh
Copy link
Contributor

alexggh commented Mar 14, 2024

@alexggh Prometheus dashboard running on UK server, and running UTC on the node - will load logs tonight will share logs covering all times

Ok, that's even worse seems like for 2h your node connectivity is really bad from 12:30 until 14:30 when it reaches the expected peer-count.

Also don't want to be a nag but can you double check in right hand corner in your browser the used timezone, I know that bit me several times, grafana usually uses the timezone of your connecting device. I would have expected 13:30 or 14:30 to be the time when your node enters the active set.

@alexggh
Copy link
Contributor

alexggh commented Mar 15, 2024

@dcolley: Can you also get me the values for the bellow metric around the time this problem happened? Thank you!

substrate_authority_discovery_known_authorities_count{}

@alexggh
Copy link
Contributor

alexggh commented Mar 17, 2024

Ok, I think I figure it out why this happens for new nodes entering the active set, long story-short, the node will start advertising its AuthorthyId on the DHT only after it becomes active, so the other nodes won't be able to detect at the beginning of the session. More details about this in: https://github.com/paritytech/polkadot-sdk/pull/3722/files

@dcolley
Copy link
Contributor

dcolley commented Mar 18, 2024

substrate_authority_discovery_known_authorities_count{}

Does this help?

image

@alexggh
Copy link
Contributor

alexggh commented Mar 18, 2024

Does this help?

Yes, they confirm the findings from here: #3613 (comment).

@anignatev
Copy link
Author

Hi guys,

There seems to be a problem with version 1.5.0 as well. Our Kusama node went into active set today, but missed all votes in first session. Logs are attached.

IMG_7640

kusama_1.5.0.log.gz

@alexggh
Copy link
Contributor

alexggh commented Mar 19, 2024

There seems to be a problem with version 1.5.0 as well. Our Kusama node went into active set today, but missed all votes in first session. Logs are attached.

Yes, the problem has been there since forever, it will get fixed with: #3722 in the next runtime release for kusama.

On a side note, there is no reason for you to run 1.5.0 anymore, because it won't help.

@anignatev
Copy link
Author

There seems to be a problem with version 1.5.0 as well. Our Kusama node went into active set today, but missed all votes in first session. Logs are attached.

Yes, the problem has been there since forever, it will get fixed with: #3722 in the next runtime release for kusama.

On a side note, there is no reason for you to run 1.5.0 anymore, because it won't help.

This is great news! Yes, we have already uncommented the update scripts... Thank you!

@dcolley
Copy link
Contributor

dcolley commented Mar 21, 2024

We are running on 1.9.0 and I just got another F.
Node ID has not changed and the upgrade was done days ago.
We had an A+ since the upgrade, I guess we were not paravalidating for that session.
https://apps.turboflakes.io/?chain=kusama#/validator/JKhBBSWkr8BJKh5eFBtRux4hsDq4sAxvvmMU426qUA9aqEQ?mode=history

@eskimor
Copy link
Member

eskimor commented Mar 21, 2024

We are running on 1.9.0 and I just got another F. Node ID has not changed and the upgrade was done days ago. We had an A+ since the upgrade, I guess we were not paravalidating for that session. https://apps.turboflakes.io/?chain=kusama#/validator/JKhBBSWkr8BJKh5eFBtRux4hsDq4sAxvvmMU426qUA9aqEQ?mode=history

The issue with low grades in the first session after becoming active will be fixed with the 1.2 runtime upgrade.

@alexggh
Copy link
Contributor

alexggh commented Apr 29, 2024

1.2 runtime got deployed with the fix of this issue, closing it.

@alexggh alexggh closed this as completed Apr 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
I2-bug The node fails to follow expected behavior. I10-unconfirmed Issue might be valid, but it's not yet known.
Projects
None yet
Development

No branches or pull requests

10 participants