-
Notifications
You must be signed in to change notification settings - Fork 747
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Check da_checker before doing a block lookup request #5681
Conversation
Squashed commit of the following: commit 4a23356 Author: dapplion <35266934+dapplion@users.noreply.github.com> Date: Wed May 1 20:23:56 2024 +0900 Ensure consistent handling of lookup result commit 2a314ee Author: dapplion <35266934+dapplion@users.noreply.github.com> Date: Wed May 1 18:30:50 2024 +0900 Check da_checker before doing a block lookup request
) -> Result<bool, &'static str> { | ||
if self.chain.data_availability_checker.has_block(&block_root) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't we want to use the reqresp_pre_import_cache
? Otherwise we'll miss blocks that we're in the process of verifying execution
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the check in stable
is essentially this because data_availability_checker.has_block()
used to check the processing cache
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lion agreed offline, so I've pushed the change here: 5db3e1b
@mergify queue |
✅ The pull request has been merged automaticallyThe pull request has been merged automatically at 01e4e35 |
* Fix execution integration tests (#5647) * update waiting status * revert to old nethermind version * Add electra presets to beacon API (#5630) * add presets to API * add extra fields to config spec in beacon API * remove unused * add mainnet presets for gnosis and fix minimal preset default values * Rename `Merge` to `Bellatrix` (#5601) * Rename Merge to Bellatrix * Remove tree-hash-cache which got readded from the rebase * Deterministic block generation for tests (#5654) * Deterministic block generation for tests * Electra other containers (#5652) * add new fields to execution payload and header * beacon state changes * partial beacon state * safe arith in upgrade to electra * initialize balances cache in interop genesis state * Revert "initialize balances cache in interop genesis state" This reverts commit c60b522. * always initialize balances cache if necessary in electra upgrade * build cache earlier * fix block test * per fork NUM_FIELDS_POW2 * Merge branch 'unstable' of https://github.com/sigp/lighthouse into electra-other-containers * fix lints * get fields based on post state, as is spec'd * fix type and move cache build * Add more electra helpers (#5653) * Add new helpers * Fix some stuff * Fix compilation errors * lint * Address review * Ignore gossip blob already imported (#5656) * Ignore gossip blob already imported * Beta compiler fix (#5659) * fix beta compiler compilation * remove unused import * Revert "remove unused import" This reverts commit 0bef36b. * Revert "fix beta compiler compilation" This reverts commit 23152cf. * rename ununsed fields * allow dead code on some error variants * remove unused blob download queue * add back debug to backfill error * more allow dead code on errors * fix(validator_client): raise soft fd limit (#4796) * fix(validator_client): raise soft fd limit * Merge branch 'unstable' of https://github.com/sigp/lighthouse into rkrasiuk/raise-vc-fdlimit * cargo lock * Merge branch 'unstable' of https://github.com/sigp/lighthouse into rkrasiuk/raise-vc-fdlimit * Proposer and attester slashing sse events (#5327) * default vc to block v3 endpoint and deprecate block-v3 flag * Merge branch 'unstable' of https://github.com/sigp/lighthouse into unstable * add proposer and attester event variants * add TOOOs * add tests, event triggers * Merge branch 'unstable' of https://github.com/sigp/lighthouse into proposer-and-attester-slashing-sse-events * revert * revert * remove double event tracking * Merge branch 'unstable' into proposer-and-attester-slashing-sse-events * remove todo, fix test * resolve merge conflicts * Merge branch 'proposer-and-attester-slashing-sse-events' of https://github.com/eserilev/lighthouse into proposer-and-attester-slashing-sse-events * leftover debugging * Merge branch 'unstable' of https://github.com/sigp/lighthouse into proposer-and-attester-slashing-sse-events * Merge branch 'unstable' of https://github.com/sigp/lighthouse into proposer-and-attester-slashing-sse-events * pin macos release runner to `macos-13` (#5665) * pin macos release runner to `macos-13` * Update .github/workflows/release.yml * Remove snapshot cache related code (#5661) * Remove snapshot cache and other references. * Fix default state cache size in docs * Remove cache miss comment entirely * Add state cache CLI tests * Uncomment self_hosted_runner after PR Merge #5137 (#5291) * Uncomment self_hosted_runner after PR Merge #5137 * Merge branch 'unstable' into fix_todo * Merge branch 'unstable' of https://github.com/sigp/lighthouse into fix_todo * Only `portable` builds (binaries) (#5615) * release workflow: portable builds by default * Delete outdated comment * Merge branch 'unstable' into portable-builds-binaries # Conflicts: # .github/workflows/release.yml * Merge parent and current sync lookups (#5655) * Drop lookup type trait for a simple arg * Drop reconstructed for processing * Send parent blocks one by one * Merge current and parent lookups * Merge current and parent lookups clean up todos * Merge current and parent lookups tests * Merge remote-tracking branch 'origin/unstable' into sync-merged-lookup * Merge branch 'unstable' of https://github.com/sigp/lighthouse into sync-merged-lookup * fix compile after merge * #5655 pr review (#26) * fix compile after merge * remove todos, fix typos etc * fix compile * stable rng * delete TODO and unfilled out test * make download result a struct * enums instead of bools as params * fix comment * Various fixes * Track ignored child components * Track dropped lookup reason as metric * fix test * add comment describing behavior of avail check error * update ordering * delete spammy log (#5672) * delete spammy log * Ensure block only range requests don't fail on download (#5675) * ensure pruned blobs don't fail on download * Typo * Improve ENR updates (#5483) * Improve ENR updates * forever fmt * Appease my old friend clippy * Merge network unstable * Check da_checker before doing a block lookup request (#5681) * Check da_checker before doing a block lookup request * Ensure consistent handling of lookup result * use req resp pre import cache rather than da checker * Update Cargo.lock (#5670) * update rust-yamux * update Cargo.lock * Merge branch 'unstable' of github.com:jxs/lighthouse into update-cargo * Merge branch 'unstable' of github.com:sigp/lighthouse into update-cargo * update to new libp2p versions * Add metric for current epoch total balance (#5688) * Add metric for current epoch total balance --------- Co-authored-by: realbigsean <sean@sigmaprime.io> Co-authored-by: Mac L <mjladson@pm.me> Co-authored-by: Lion - dapplion <35266934+dapplion@users.noreply.github.com> Co-authored-by: Pawan Dhananjay <pawandhananjay@gmail.com> Co-authored-by: Roman Krasiuk <rokrassyuk@gmail.com> Co-authored-by: Eitan Seri-Levi <eserilev@ucsc.edu> Co-authored-by: antondlr <anton@delaruelle.net> Co-authored-by: Jimmy Chen <jchen.tc@gmail.com> Co-authored-by: Ærvin <arvin.morawej@gmail.com> Co-authored-by: Age Manning <Age@AgeManning.com> Co-authored-by: João Oliveira <hello@jxs.pt> Co-authored-by: Michael Sproul <michael@sigmaprime.io>
* Check da_checker before doing a block lookup request * Ensure consistent handling of lookup result * use req resp pre import cache rather than da checker
if self | ||
.chain | ||
.reqresp_pre_import_cache | ||
.read() | ||
.contains_key(&block_root) | ||
{ | ||
return Ok(false); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is the cause of the recent de-sync issues.
Consider the following sequence of events:
- Block arrives on gossip, we add it to the
reqresp_pre_import_cache
. - We send the block's payload to the EL for verification.
- We receive blobs on gossip and add them to the DA cache.
- The EL times out. Block processing is aborted as a result.
- Subsequent attempts to look up the block and re-process it are blocked by the entry in the
reqresp_pre_import_cache
. - We stall and are unable to reprocess the block until range sync and/or a blocks by range request is made in spite of the check above.
I've confirmed that this happened for block 0xfc4a1afe39d84a410c61f213cf15cb405fdd15b96ca54f0c826d4054671b1275 at slot 1574856 on Holesky. We get the block and all the blobs, then the EL times out and everything gets stuck:
May 04 05:31:20.803 DEBG Failed to verify execution payload error: ExecutionPayloadError(RequestFailed(EngineError(Api { error: HttpClient(url: http://127.0.0.1:8551/, kind: timeout, detail: operation timed out) })))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually scratch that, @jimmygchen pointed out that we remove from the reqresp_pre_import_cache
on failure here:
lighthouse/beacon_node/beacon_chain/src/beacon_chain.rs
Lines 2945 to 2950 in 63fad7e
let has_missing_components = | |
matches!(r, Ok(AvailabilityProcessingStatus::MissingComponents(_, _))); | |
if !has_missing_components { | |
self.reqresp_pre_import_cache.write().remove(block_root); | |
} | |
r |
Issue Addressed
Testing of v5.2.0 shows many more
rpc_block
than usual. This is due this sequence of events:UnknownBlockHashFromAttestation
is sent at least once every block (this is likely a bug and should be fixed separately)In current stable the single lookup checks against the da_checker if the block is known before sending a request, but after
this check was dropped by mistake
Proposed Changes
SyncNetworkContext::block_lookup_request
and skip the download if the block is already knownHowever, now we need to handle the case where a lookup goes directly from created -> completed. This takes a bit of refactoring, and to handle this gracefully and safely for all cases I have introduced the
LookupResult
with amust_use
directive.