Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VC: Hardening and optimizing time handling. #4743

Merged
merged 8 commits into from
Apr 17, 2023
Merged

Conversation

cheatfate
Copy link
Contributor

No description provided.

@github-actions
Copy link

github-actions bot commented Mar 17, 2023

Unit Test Results

         9 files  ±0    1 074 suites  ±0   30m 18s ⏱️ - 1m 8s
  3 675 tests ±0    3 396 ✔️ ±0  279 💤 ±0  0 ±0 
15 668 runs  ±0  15 363 ✔️ ±0  305 💤 ±0  0 ±0 

Results for commit 447332f. ± Comparison against base commit 29b431e.

♻️ This comment has been updated with latest results.

@@ -69,19 +73,39 @@ proc fromNow*(c: BeaconClock, t: BeaconTime): tuple[inFuture: bool, offset: Dura
proc fromNow*(c: BeaconClock, slot: Slot): tuple[inFuture: bool, offset: Duration] =
c.fromNow(slot.start_beacon_time())

proc toDuration*(c: BeaconTime): Duration =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

toDuration doesn't really make sense - BeaconTime is a point in time, you cannot logically convert it to a duration without a reference point - we don't really need toDuration since we always have a reference point available when working with time (most of the time, the slot but if need be, we can introduce GENESIS_BEACON_TIME just like we have GENESIS_SLOT)

Copy link
Contributor Author

@cheatfate cheatfate Mar 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You think that BeaconTime is a point at time, but it is not, because BeaconTime is a point in time with adjusted start. So when you convert it to Duration it means that you want to calculate distance to the start.
This procedure is used in this scenario:

info "Initializing beacon clock",
     genesis_time = vc.beaconGenesis.genesis_time,
     current_slot = "<n/a>", current_epoch = "<n/a>",
     time_to_genesis = currentTime.toDuration()

notice "Waiting for genesis",
       genesis_time = vc.beaconGenesis.genesis_time,
       time_to_genesis = currentTime.toDuration()

So do you want me to but this calculation everywhere?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well, what this function returns is the "time-since-genesis-or-zero" - it's not a conversion (as toDuration implies in its name) which is not meaningful, but rather a time one-sided time-arithmetic-function - its correct signature would be fromGenesis(t: BeaconTime): tuple[future: bool, dur: Duration] - then the caller can choose whether to saturate it or not, but at least the name correctly represents what the function does.

Throwing away the sign here will lead to misuse because it relies on the caller to check "future-or-not" before calling this function which is easy to forget - it's better to return the sign and have the caller decide about discarding it.

chronos.nanoseconds(int64(c.ns_since_genesis))
else:
chronos.nanoseconds(int64(-c.ns_since_genesis))

proc durationToNextSlot*(c: BeaconClock): Duration =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally we would remove durationToNextSlot and fromNow - both of these are compound operations that mix "current time" from the global clock with "time arithmetic" which leads to problems and confusion in code that uses them, since getting the "current time" generally should only be done in a controlled way:

  • once per "block" of arithmetic (ie we should not get the current time twice when performing arithmetic because there's an inherent race condition when the clock between the calls)
  • not reused between await boundaries

therefore:

  • remove fromNow, durationToNext* etc
  • use now consistently when the current time is needed as the only function that accesses the global clock
  • introduce a time difference operation like durationFrom(a, b: BeaconTime): tuple[future: bool, dur: Duration] as is done in fromNow
  • replace all code in both bn and vc to use the above functions
    • fromNow becomes slot.start_beacon_time().durationFrom(clock.now())
    • durationToNextSlot becomes let now = clock.now(); saturate((now.slotOrZero() + 1).start_beacon_time().durationFrom(now)) more or less (slotOrZero is incorrect for pre-genesis - it needs to return duration to genesis for any time before genesis)
    • the above can be added as helpers that take a BeaconTime and do the saturation if need be, but these helpers should never access the clock itself
    • more durationFrom can be added for slot vs time etc to make it less verbose

durationFrom can also be durationTo or some other name or shape - doesn't matter which direction it goes in, but since Duration is undefined for negative values, we need to return a tuple - we also sometimes need to know that genesis hasn't happened yet, so we don't always want to saturate.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Proposed solution do not support negative Slot values which is happened when you are running before genesis. So it become impossible to provide such UX experience like it was done now. Because its impossible to supply negative Slot values it is very hard to find proper argument value for durationFrom/durationTo. Its why durationToNextSlot looks the most appropriate name for the procedure which is able to calculate duration to next slot which is either before or after genesis.

currentSlot = currentTime.slotOrZero()
currentEpoch = currentSlot.epoch()

if currentTime.ns_since_genesis >= 0:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with the above removal of fromNow, this code can still use inFuture etc by calling durationFrom

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code do use only now() as you asked to do, nothing else. So why should i apply any arithmetic in here if i just trying to obtain current time and trying to understand is this time before genesis or after genesis? Why should i call to arithmetic operation?

Copy link
Member

@arnetheduck arnetheduck Mar 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we've generally avoided the use of specific time granularity in the code (ns_since_genesis exposes "nanoseconds") - ie we've strived to not expose the granularity of the clock in code that does time computations, instead using "abstract" types which keep this information as an internal implementation detail.

In fact, I'm not sure why ns_since_genesis is public at all - I would generally not have considered it something that callers should use / rely on.

That said, iff we want to use ns_since_genesis, it's important to verify that we use it consistently everywhere across the codebase (both bn and vc) - using "slightly different" ways of dealing with time in different parts of the codebase becomes technical debt.

let
currentTime = vc.beaconClock.now()
currentSlot = currentTime.slotOrZero()
chronosOffset = chronos.nanoseconds(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto - durationFrom gives inFuture so there is no need to revert to nanoseconds or other internal implementation details here

Copy link
Contributor Author

@cheatfate cheatfate Mar 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

chronosOffset do not related to any BeaconClock arithmetic, it just converts TimeDiff object. Otherwise it will be 2 calls to durationTo/durationFrom.

…fter Genesis, but also before Genesis.

Change VC pre-genesis behavior, add runPreGenesisWaitingLoop() and runGenesisWaitingLoop().
Add checkedWaitForSlot() and checkedWaitForNextSlot() to strictly check current time and print warnings.
Fix VC main loop to use checkedWaitForNextSlot().
Fix attestation_service to run attestations processing only until the end of the duty slot.
Change attestation_service main loop to use checkedWaitForNextSlot().
Change block_service to properly cancel all the pending proposer tasks.
Use checkedWaitForSlot to wait for block proposal.
Fix block_service waitForBlockPublished() to be compatible with BN.
Fix sync_committee_service to avoid asyncSpawn.
Fix sync_committee_service to run only until the end of the duty slot.
Fix sync_committee_service to use checkedWaitForNextSlot().
Fix aggregated attestation publishing missing delay.
Fix fallback service sync status spam.
Fix false `sync committee subnets subscription error`.

var currentSlot: Opt[Slot]
while true:
# This loop could look much more nicer/better, when
# https://github.com/nim-lang/Nim/issues/19911 will be fixed, so it could
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this bug is fixed now btw

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe, but i have never seen extended test suite about loops and exceptions in Nim language, so i wish to be safe here.

elif wallSlot > destinationSlot + SLOTS_PER_EPOCH:
if showLogs:
warn "Time moved forwards by more than an epoch, skipping ahead"
return Opt.some(wallSlot - SLOTS_PER_EPOCH)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why does this return an "old" slot? the beacon node does this because it "replays" attestations for the old slots but I don't think the VC does? It's also inconsistent with the elif wallSlot > destination return

@arnetheduck arnetheduck enabled auto-merge (squash) April 17, 2023 19:00
@arnetheduck arnetheduck merged commit b511521 into unstable Apr 17, 2023
@arnetheduck arnetheduck deleted the vc-op-timeouts branch April 17, 2023 21:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants