growing block production delay on epoch switch (ep360-362) #4421

gufmar · 2022-09-08T07:22:48Z

Internal/External
External otherwise.

Area
Other Any other topic (Delegation, Ranking, ...).

Summary
During the last two epoch transitions (361, 362), the period without new blocks has increased significantly again.

Expected behavior
no extraordinary long delay between new blocks

System info (please complete the following information):
seen on all kind of systems and setups. relays, block producers, node version 1.35.3 and 1.34.1

Screenshots and attachments

ep359>360

ep360>361

ep361>362

Additional context
on all 3 epoch transitions one relay node ran on 1.34.1 and all other blockheight lines come from 1.35.3 nodes.
CPU graph below each epoch transition is from one relay node

zandeck · 2022-09-08T08:32:29Z

You can see it on the explorer as well

gufmar · 2022-09-08T11:43:52Z

by looking at v6 vs v7 blocks produced in the epochs we see

ep359 60% v7 no extraordinary delay towards ep360
ep360 88% v7 5 minutes of delay towards ep361
ep361 94% v7 12 minutes of delay towards ep362

njd42 · 2022-09-08T11:47:37Z

It happens that the team are all together today.

We've had a detailed look at the logs so as to understand whether this was due to a change (eg. due to software version) or an emergent interaction between the higher cpu load that occurs at the start of the epoch and the leadership schedule (basically a "height battle between nodes that are under temporary processing strain").

Our conclusion is that, from those logs (and the other evidence) that it was the latter.

As can can be seen in the block 7728110 the system is diffusing and adopting blocks in its usual time range. The diffusion/adoption time of block 7728109 was around 40s indicative of the extra cpu cost not just of performing the epoch boundary crossing but that the diffusion of such a block was slowed (blocks are only forwarded after having being adopted)

There was more than one candidate for block 7728108 - they were created 16s apart; node logs indicate other blocks were being create (as would be expected) that were not forwarded as they were not better "candidates" for the chain extension (by the normal resolution rules)

Yes, 12 minutes is long, but 5-6 min for the first block adopted in an epoch is not that unusual. We don't see this as a bug.

We have already scheduled a meeting to review the situation at the next epoch boundary to double check our conclusion here.

It should be noted that this is does not represent any risk to the integrity of chain growth

rdlrt · 2022-09-08T13:57:10Z

@njd42 If there's a follow-up meeting to discuss / review / confirm the above explanation , prolly best to keep this open and not close?

papacarp · 2022-09-08T14:33:14Z

@njd42 I expanded the details in realtime view a bit for more context. Perhaps this will be useful for your follow up meeting or for next epoch.

Its important to note that in the height battle, the CFLOW block was only reported by 1 node and actually that node was CFLOW. So it wasn't really a height battle, the CFLOW block never made it anywhere.
The BASHO block was made at slot 674, but the first report of it was 29 seconds later with median prop time of 52 seconds. The full story is in the prop histogram for that block:

So under the "heavy load" hypothesis, this may demonstrate when the nodes were coming online after the epoch switch and starting to process blocks again. note that this means epoch processing takes at minimum 674+29=703 seconds with the bulk of the nodes done by 674+100=774 seconds after epoch switch. So a little over 1 minute of variability between nodes and over 11 minutes on average.
The follow on AZUR3 block is in slot 742 but had a very interesting prop delay chart with almost all nodes reporting it at the same time at 742+40=780 seconds after epoch switch.

I guess the hypothesis here is that AZUR3 was also finishing epoch processing and produced the block late and it then got distributed quickly since everyone was ready. Might be good to have a look at AZUR3 logs to see what time they actually produced their block.

gufmar · 2022-09-08T15:06:17Z

By looking into number of established tcp connection at that ep switch time I would have expected some small decline, caused by remote p2p logic dropping and re-establishing connections.
Instead I noted on all my public relays a significant spike in tcp connections starting at 21:45utc

If of interest I can provide more detailed views

CryptoBlocks-pro · 2022-09-08T15:33:17Z

@njd42 I expanded the details in realtime view a bit for more context. Perhaps this will be useful for your follow up meeting or for next epoch.

Its important to note that in the height battle, the CFLOW block was only reported by 1 node and actually that node was CFLOW. So it wasn't really a height battle, the CFLOW block never made it anywhere.

The BASHO block was made at slot 674, but the first report of it was 29 seconds later with median prop time of 52 seconds. The full story is in the prop histogram for that block:

So under the "heavy load" hypothesis, this may demonstrate when the nodes were coming online after the epoch switch and starting to process blocks again. note that this means epoch processing takes at minimum 674+29=703 seconds with the bulk of the nodes done by 674+100=774 seconds after epoch switch. So a little over 1 minute of variability between nodes and over 11 minutes on average.

The follow on AZUR3 block is in slot 742 but had a very interesting prop delay chart with almost all nodes reporting it at the same time at 742+40=780 seconds after epoch switch.

I guess the hypothesis here is that AZUR3 was also finishing epoch processing and produced the block late and it then got distributed quickly since everyone was ready. Might be good to have a look at AZUR3 logs to see what time they actually produced their block.

AZUR3 produced the block at 2022-09-07T21:57:13.03Z
block71021542.txt

CryptoBlocks-pro · 2022-09-08T15:38:02Z

Logs do look a little strange for that AZUR3 block. Above attached file was grepped filter for only lines with that log. This one is for the whole time period, which does seem unusually long, spanning over 200 lines of logs.
block71021542-full-logs.txt

papacarp · 2022-09-08T16:04:20Z

AZUR3 produced the block at 2022-09-07T21:57:13.03Z block71021542.txt

So you produced your block on time, but for whatever reason we had a rollback and then your block was re-applied at 2022-09-07T21:57:53.15Z which corresponds exactly to when everyone else applied your block as well. The "ExceededTimeLimit" and resulting "ErrorPolicySuspendConsumer" seems to cause problems.

Straightpool · 2022-09-08T16:57:54Z

So you produced your block on time, but for whatever reason we had a rollback and then your block was re-applied at 2022-09-07T21:57:53.15Z which corresponds exactly to when everyone else applied your block as well. The "ExceededTimeLimit" and resulting "ErrorPolicySuspendConsumer" seems to cause problems.

I agree. One of my relays even "blacked" out for a minute

When this relay was incapable of reporting cnode metrics CPU was at max

Different view

In comparison this is 2 epochs prior on 28th of August also on 1.35.3:

In the logfiles the relay never crashed but the log is completely saturated with messages like these pretty much for the duration of the blackout

{"app":[],"at":"2022-09-07T21:59:15.26Z","data":{"address":"","event":"ErrorPolicySuspendPeer (Just (ApplicationExceptionTrace (MuxError (MuxIOException Network.Socket.recvBuf: resource vanished (Connection reset by peer)) "....
{"app":[],"at":"2022-09-07T21:58:57.97Z","data":{"kind":"PeersFetch","peers":[{"declined":"FetchDeclineChainNotPlausible","kind":"FetchDecision declined","peer":{"local":{"addr"...
{"app":[],"at":"2022-09-07T21:58:57.05Z","data":{"domain":"""","event":"Failed to start all required subscriptions","kind":"SubscriptionTrace"},"env":"1.35.3:950c4","....
{"app":[],"at":"2022-09-07T21:58:50.77Z","data":{"address":"","event":"ErrorPolicySuspendConsumer (Just (ApplicationExceptionTrace ExceededTimeLimit (Handshake) (ServerAgency TokConfirm))) 20s","kind":"ErrorPolicyTrace"},"env":"1.35...

StakeWithPride · 2022-09-12T21:58:08Z

This is still an issue today. 8m27s epoch boundary pause.

3DMintLab · 2022-09-12T22:49:36Z

This is still an issue today. 8m27s epoch boundary pause.
I had a lot of DNSErrors in my relay log during the switch to 363. Could this be related to the long pause?
[2022-09-11 20:47:27.71 UTC] Domain: "relays-new.cardano-mainnet.iohk.io" Application Exception: 18.133.43.175:3001 ExceededTimeLimit

disassembler · 2022-11-16T17:09:25Z

This is a ledger task that we are working on. we've benchmarked some things https://github.com/input-output-hk/cardano-ledger/pulls?q=is%3Apr+is%3Aclosed+author%3ATimSheard and thanks to Frisby have a great new plan IntersectMBO/cardano-ledger#3141

The reason for closing this was it doesn't pertain directly to the node code base and we're trying to clean up issues at the moment.

disassembler · 2022-11-16T17:10:40Z

Also see IntersectMBO/cardano-ledger#3132 (comment)

gufmar added the bug Something isn't working label Sep 8, 2022

gufmar changed the title ~~[BUG] -~~ [BUG] - growing block production delay on epoch switch (ep360-362) Sep 8, 2022

njd42 closed this as completed Sep 8, 2022

njd42 removed the bug Something isn't working label Sep 8, 2022

njd42 changed the title ~~[BUG] - growing block production delay on epoch switch (ep360-362)~~ growing block production delay on epoch switch (ep360-362) Sep 8, 2022

JaredCorduan mentioned this issue Sep 12, 2022

Use of calculatePoolDistr on the epoch boundary IntersectMBO/cardano-ledger#3034

Open

dnadales mentioned this issue Oct 4, 2022

Cache computation of TICKF rule IntersectMBO/ouroboros-network#4054

Closed

dorin100 added type: bug Something is not working user type: external Created by an external user status: needs more info Insufficient information, needs clarification. labels Oct 21, 2022

aniketd mentioned this issue Nov 11, 2022

TICKF calculatePoolDistr IntersectMBO/cardano-ledger#3141

Closed

4 tasks

nfrisby mentioned this issue Nov 16, 2022

earlier thunk for the per-stake-pool stake distribution IntersectMBO/cardano-ledger#3132

Closed

JaredCorduan mentioned this issue Jan 11, 2023

CIP-1694? | A First Step Towards On-Chain Decentralized Governance cardano-foundation/CIPs#380

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

growing block production delay on epoch switch (ep360-362) #4421

growing block production delay on epoch switch (ep360-362) #4421

gufmar commented Sep 8, 2022 •

edited

Loading

zandeck commented Sep 8, 2022

gufmar commented Sep 8, 2022

njd42 commented Sep 8, 2022 •

edited

Loading

rdlrt commented Sep 8, 2022

papacarp commented Sep 8, 2022

gufmar commented Sep 8, 2022

CryptoBlocks-pro commented Sep 8, 2022

CryptoBlocks-pro commented Sep 8, 2022 •

edited

Loading

papacarp commented Sep 8, 2022

Straightpool commented Sep 8, 2022 •

edited

Loading

StakeWithPride commented Sep 12, 2022

3DMintLab commented Sep 12, 2022

disassembler commented Nov 16, 2022

disassembler commented Nov 16, 2022

growing block production delay on epoch switch (ep360-362) #4421

growing block production delay on epoch switch (ep360-362) #4421

Comments

gufmar commented Sep 8, 2022 • edited Loading

zandeck commented Sep 8, 2022

gufmar commented Sep 8, 2022

njd42 commented Sep 8, 2022 • edited Loading

rdlrt commented Sep 8, 2022

papacarp commented Sep 8, 2022

gufmar commented Sep 8, 2022

CryptoBlocks-pro commented Sep 8, 2022

CryptoBlocks-pro commented Sep 8, 2022 • edited Loading

papacarp commented Sep 8, 2022

Straightpool commented Sep 8, 2022 • edited Loading

StakeWithPride commented Sep 12, 2022

3DMintLab commented Sep 12, 2022

disassembler commented Nov 16, 2022

disassembler commented Nov 16, 2022

gufmar commented Sep 8, 2022 •

edited

Loading

njd42 commented Sep 8, 2022 •

edited

Loading

CryptoBlocks-pro commented Sep 8, 2022 •

edited

Loading

Straightpool commented Sep 8, 2022 •

edited

Loading