Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

growing block production delay on epoch switch (ep360-362) #4421

Closed
gufmar opened this issue Sep 8, 2022 · 14 comments
Closed

growing block production delay on epoch switch (ep360-362) #4421

gufmar opened this issue Sep 8, 2022 · 14 comments
Labels
status: needs more info Insufficient information, needs clarification. type: bug Something is not working user type: external Created by an external user

Comments

@gufmar
Copy link
Contributor

gufmar commented Sep 8, 2022

Internal/External
External otherwise.

Area
Other Any other topic (Delegation, Ranking, ...).

Summary
During the last two epoch transitions (361, 362), the period without new blocks has increased significantly again.

Expected behavior
no extraordinary long delay between new blocks

System info (please complete the following information):
seen on all kind of systems and setups. relays, block producers, node version 1.35.3 and 1.34.1

Screenshots and attachments

ep359>360
2022-09-08_09h02_40
2022-09-08_09h17_49

ep360>361
2022-09-08_09h04_15
2022-09-08_09h18_28

ep361>362
2022-09-08_09h05_05
2022-09-08_09h18_57

Additional context
on all 3 epoch transitions one relay node ran on 1.34.1 and all other blockheight lines come from 1.35.3 nodes.
CPU graph below each epoch transition is from one relay node

@gufmar gufmar added the bug Something isn't working label Sep 8, 2022
@zandeck
Copy link

zandeck commented Sep 8, 2022

You can see it on the explorer as well

image

image

@gufmar gufmar changed the title [BUG] - [BUG] - growing block production delay on epoch switch (ep360-362) Sep 8, 2022
@gufmar
Copy link
Contributor Author

gufmar commented Sep 8, 2022

by looking at v6 vs v7 blocks produced in the epochs we see

ep359 60% v7 no extraordinary delay towards ep360
ep360 88% v7 5 minutes of delay towards ep361
ep361 94% v7 12 minutes of delay towards ep362

@njd42
Copy link

njd42 commented Sep 8, 2022

It happens that the team are all together today.

We've had a detailed look at the logs so as to understand whether this was due to a change (eg. due to software version) or an emergent interaction between the higher cpu load that occurs at the start of the epoch and the leadership schedule (basically a "height battle between nodes that are under temporary processing strain").

Our conclusion is that, from those logs (and the other evidence) that it was the latter.

Epoch 360/361 boundary

As can can be seen in the block 7728110 the system is diffusing and adopting blocks in its usual time range. The diffusion/adoption time of block 7728109 was around 40s indicative of the extra cpu cost not just of performing the epoch boundary crossing but that the diffusion of such a block was slowed (blocks are only forwarded after having being adopted)

There was more than one candidate for block 7728108 - they were created 16s apart; node logs indicate other blocks were being create (as would be expected) that were not forwarded as they were not better "candidates" for the chain extension (by the normal resolution rules)

Yes, 12 minutes is long, but 5-6 min for the first block adopted in an epoch is not that unusual. We don't see this as a bug.

We have already scheduled a meeting to review the situation at the next epoch boundary to double check our conclusion here.

It should be noted that this is does not represent any risk to the integrity of chain growth

@njd42 njd42 closed this as completed Sep 8, 2022
@njd42 njd42 removed the bug Something isn't working label Sep 8, 2022
@njd42 njd42 changed the title [BUG] - growing block production delay on epoch switch (ep360-362) growing block production delay on epoch switch (ep360-362) Sep 8, 2022
@rdlrt
Copy link

rdlrt commented Sep 8, 2022

@njd42 If there's a follow-up meeting to discuss / review / confirm the above explanation , prolly best to keep this open and not close?

@papacarp
Copy link

papacarp commented Sep 8, 2022

@njd42 I expanded the details in realtime view a bit for more context. Perhaps this will be useful for your follow up meeting or for next epoch.
image

  1. Its important to note that in the height battle, the CFLOW block was only reported by 1 node and actually that node was CFLOW. So it wasn't really a height battle, the CFLOW block never made it anywhere.
  2. The BASHO block was made at slot 674, but the first report of it was 29 seconds later with median prop time of 52 seconds. The full story is in the prop histogram for that block:
    image
    So under the "heavy load" hypothesis, this may demonstrate when the nodes were coming online after the epoch switch and starting to process blocks again. note that this means epoch processing takes at minimum 674+29=703 seconds with the bulk of the nodes done by 674+100=774 seconds after epoch switch. So a little over 1 minute of variability between nodes and over 11 minutes on average.
  3. The follow on AZUR3 block is in slot 742 but had a very interesting prop delay chart with almost all nodes reporting it at the same time at 742+40=780 seconds after epoch switch.
    image
    I guess the hypothesis here is that AZUR3 was also finishing epoch processing and produced the block late and it then got distributed quickly since everyone was ready. Might be good to have a look at AZUR3 logs to see what time they actually produced their block.

@gufmar
Copy link
Contributor Author

gufmar commented Sep 8, 2022

By looking into number of established tcp connection at that ep switch time I would have expected some small decline, caused by remote p2p logic dropping and re-establishing connections.
Instead I noted on all my public relays a significant spike in tcp connections starting at 21:45utc

Screenshot_2022-09-08-16-00-01-79_e589e2e5eef7246e2b9e3e77d02e4167

If of interest I can provide more detailed views

@CryptoBlocks-pro
Copy link

@njd42 I expanded the details in realtime view a bit for more context. Perhaps this will be useful for your follow up meeting or for next epoch. image

  1. Its important to note that in the height battle, the CFLOW block was only reported by 1 node and actually that node was CFLOW. So it wasn't really a height battle, the CFLOW block never made it anywhere.
  2. The BASHO block was made at slot 674, but the first report of it was 29 seconds later with median prop time of 52 seconds. The full story is in the prop histogram for that block:
    image
    So under the "heavy load" hypothesis, this may demonstrate when the nodes were coming online after the epoch switch and starting to process blocks again. note that this means epoch processing takes at minimum 674+29=703 seconds with the bulk of the nodes done by 674+100=774 seconds after epoch switch. So a little over 1 minute of variability between nodes and over 11 minutes on average.
  3. The follow on AZUR3 block is in slot 742 but had a very interesting prop delay chart with almost all nodes reporting it at the same time at 742+40=780 seconds after epoch switch.
    image
    I guess the hypothesis here is that AZUR3 was also finishing epoch processing and produced the block late and it then got distributed quickly since everyone was ready. Might be good to have a look at AZUR3 logs to see what time they actually produced their block.

AZUR3 produced the block at 2022-09-07T21:57:13.03Z
block71021542.txt

@CryptoBlocks-pro
Copy link

CryptoBlocks-pro commented Sep 8, 2022

Logs do look a little strange for that AZUR3 block. Above attached file was grepped filter for only lines with that log. This one is for the whole time period, which does seem unusually long, spanning over 200 lines of logs.
block71021542-full-logs.txt

@papacarp
Copy link

papacarp commented Sep 8, 2022

AZUR3 produced the block at 2022-09-07T21:57:13.03Z block71021542.txt

So you produced your block on time, but for whatever reason we had a rollback and then your block was re-applied at 2022-09-07T21:57:53.15Z which corresponds exactly to when everyone else applied your block as well. The "ExceededTimeLimit" and resulting "ErrorPolicySuspendConsumer" seems to cause problems.

@Straightpool
Copy link

Straightpool commented Sep 8, 2022

So you produced your block on time, but for whatever reason we had a rollback and then your block was re-applied at 2022-09-07T21:57:53.15Z which corresponds exactly to when everyone else applied your block as well. The "ExceededTimeLimit" and resulting "ErrorPolicySuspendConsumer" seems to cause problems.

I agree. One of my relays even "blacked" out for a minute

image

When this relay was incapable of reporting cnode metrics CPU was at max
image

Different view
image

In comparison this is 2 epochs prior on 28th of August also on 1.35.3:
image

In the logfiles the relay never crashed but the log is completely saturated with messages like these pretty much for the duration of the blackout

{"app":[],"at":"2022-09-07T21:59:15.26Z","data":{"address":"","event":"ErrorPolicySuspendPeer (Just (ApplicationExceptionTrace (MuxError (MuxIOException Network.Socket.recvBuf: resource vanished (Connection reset by peer)) "....
{"app":[],"at":"2022-09-07T21:58:57.97Z","data":{"kind":"PeersFetch","peers":[{"declined":"FetchDeclineChainNotPlausible","kind":"FetchDecision declined","peer":{"local":{"addr"...
{"app":[],"at":"2022-09-07T21:58:57.05Z","data":{"domain":"""","event":"Failed to start all required subscriptions","kind":"SubscriptionTrace"},"env":"1.35.3:950c4","....
{"app":[],"at":"2022-09-07T21:58:50.77Z","data":{"address":"","event":"ErrorPolicySuspendConsumer (Just (ApplicationExceptionTrace ExceededTimeLimit (Handshake) (ServerAgency TokConfirm))) 20s","kind":"ErrorPolicyTrace"},"env":"1.35...

@StakeWithPride
Copy link

This is still an issue today. 8m27s epoch boundary pause.

@3DMintLab
Copy link

This is still an issue today. 8m27s epoch boundary pause.
I had a lot of DNSErrors in my relay log during the switch to 363. Could this be related to the long pause?
[2022-09-11 20:47:27.71 UTC] Domain: "relays-new.cardano-mainnet.iohk.io" Application Exception: 18.133.43.175:3001 ExceededTimeLimit

@disassembler
Copy link
Contributor

This is a ledger task that we are working on. we've benchmarked some things https://github.com/input-output-hk/cardano-ledger/pulls?q=is%3Apr+is%3Aclosed+author%3ATimSheard and thanks to Frisby have a great new plan IntersectMBO/cardano-ledger#3141

The reason for closing this was it doesn't pertain directly to the node code base and we're trying to clean up issues at the moment.

@disassembler
Copy link
Contributor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status: needs more info Insufficient information, needs clarification. type: bug Something is not working user type: external Created by an external user
Projects
None yet
Development

No branches or pull requests