-
Notifications
You must be signed in to change notification settings - Fork 720
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
growing block production delay on epoch switch (ep360-362) #4421
Comments
by looking at v6 vs v7 blocks produced in the epochs we see ep359 60% v7 no extraordinary delay towards ep360 |
It happens that the team are all together today. We've had a detailed look at the logs so as to understand whether this was due to a change (eg. due to software version) or an emergent interaction between the higher cpu load that occurs at the start of the epoch and the leadership schedule (basically a "height battle between nodes that are under temporary processing strain"). Our conclusion is that, from those logs (and the other evidence) that it was the latter. As can can be seen in the block 7728110 the system is diffusing and adopting blocks in its usual time range. The diffusion/adoption time of block 7728109 was around 40s indicative of the extra cpu cost not just of performing the epoch boundary crossing but that the diffusion of such a block was slowed (blocks are only forwarded after having being adopted) There was more than one candidate for block 7728108 - they were created 16s apart; node logs indicate other blocks were being create (as would be expected) that were not forwarded as they were not better "candidates" for the chain extension (by the normal resolution rules) Yes, 12 minutes is long, but 5-6 min for the first block adopted in an epoch is not that unusual. We don't see this as a bug. We have already scheduled a meeting to review the situation at the next epoch boundary to double check our conclusion here. It should be noted that this is does not represent any risk to the integrity of chain growth |
@njd42 If there's a follow-up meeting to discuss / review / confirm the above explanation , prolly best to keep this open and not close? |
@njd42 I expanded the details in realtime view a bit for more context. Perhaps this will be useful for your follow up meeting or for next epoch.
|
By looking into number of established tcp connection at that ep switch time I would have expected some small decline, caused by remote p2p logic dropping and re-establishing connections. If of interest I can provide more detailed views |
AZUR3 produced the block at 2022-09-07T21:57:13.03Z |
Logs do look a little strange for that AZUR3 block. Above attached file was grepped filter for only lines with that log. This one is for the whole time period, which does seem unusually long, spanning over 200 lines of logs. |
So you produced your block on time, but for whatever reason we had a rollback and then your block was re-applied at 2022-09-07T21:57:53.15Z which corresponds exactly to when everyone else applied your block as well. The "ExceededTimeLimit" and resulting "ErrorPolicySuspendConsumer" seems to cause problems. |
This is still an issue today. 8m27s epoch boundary pause. |
|
This is a ledger task that we are working on. we've benchmarked some things https://github.com/input-output-hk/cardano-ledger/pulls?q=is%3Apr+is%3Aclosed+author%3ATimSheard and thanks to Frisby have a great new plan IntersectMBO/cardano-ledger#3141 The reason for closing this was it doesn't pertain directly to the node code base and we're trying to clean up issues at the moment. |
Internal/External
External otherwise.
Area
Other Any other topic (Delegation, Ranking, ...).
Summary
During the last two epoch transitions (361, 362), the period without new blocks has increased significantly again.
Expected behavior
no extraordinary long delay between new blocks
System info (please complete the following information):
seen on all kind of systems and setups. relays, block producers, node version 1.35.3 and 1.34.1
Screenshots and attachments
ep359>360
ep360>361
ep361>362
Additional context
on all 3 epoch transitions one relay node ran on 1.34.1 and all other blockheight lines come from 1.35.3 nodes.
CPU graph below each epoch transition is from one relay node
The text was updated successfully, but these errors were encountered: