Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check da_checker before doing a block lookup request #5681

Merged
merged 3 commits into from
May 1, 2024

Conversation

dapplion
Copy link
Collaborator

@dapplion dapplion commented May 1, 2024

Issue Addressed

Testing of v5.2.0 shows many more rpc_block than usual. This is due this sequence of events:

  • Block is known to the da_checker but not yet imported into fork-choice
  • The event UnknownBlockHashFromAttestation is sent at least once every block (this is likely a bug and should be fixed separately)
  • A new lookup is created for the block, downloads block, and sends for processing

In current stable the single lookup checks against the da_checker if the block is known before sending a request, but after

this check was dropped by mistake

Proposed Changes

  • Add a check in SyncNetworkContext::block_lookup_request and skip the download if the block is already known

However, now we need to handle the case where a lookup goes directly from created -> completed. This takes a bit of refactoring, and to handle this gracefully and safely for all cases I have introduced the LookupResult with a must_use directive.

@jimmygchen jimmygchen mentioned this pull request May 1, 2024
@jimmygchen jimmygchen added ready-for-review The code is ready for review v5.2.0 Q2 2024 labels May 1, 2024
jimmygchen added a commit that referenced this pull request May 1, 2024
Squashed commit of the following:

commit 4a23356
Author: dapplion <35266934+dapplion@users.noreply.github.com>
Date:   Wed May 1 20:23:56 2024 +0900

    Ensure consistent handling of lookup result

commit 2a314ee
Author: dapplion <35266934+dapplion@users.noreply.github.com>
Date:   Wed May 1 18:30:50 2024 +0900

    Check da_checker before doing a block lookup request
) -> Result<bool, &'static str> {
if self.chain.data_availability_checker.has_block(&block_root) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't we want to use the reqresp_pre_import_cache? Otherwise we'll miss blocks that we're in the process of verifying execution

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the check in stable is essentially this because data_availability_checker.has_block() used to check the processing cache

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lion agreed offline, so I've pushed the change here: 5db3e1b

@realbigsean realbigsean added ready-for-merge This PR is ready to merge. and removed ready-for-review The code is ready for review labels May 1, 2024
@realbigsean
Copy link
Member

@mergify queue

Copy link

mergify bot commented May 1, 2024

queue

✅ The pull request has been merged automatically

The pull request has been merged automatically at 01e4e35

mergify bot added a commit that referenced this pull request May 1, 2024
@mergify mergify bot merged commit 01e4e35 into sigp:unstable May 1, 2024
27 checks passed
@dapplion dapplion deleted the lookup-sync-check-block-req branch May 1, 2024 14:54
ethDreamer added a commit that referenced this pull request May 2, 2024
* Fix execution integration tests (#5647)

* update waiting status

* revert to old nethermind version

* Add electra presets to beacon API (#5630)

* add presets to API

* add extra fields to config spec in beacon API

* remove unused

* add mainnet presets for gnosis and fix minimal preset default values

* Rename `Merge` to `Bellatrix` (#5601)

* Rename Merge to Bellatrix

* Remove tree-hash-cache which got readded from the rebase

* Deterministic block generation for tests (#5654)

* Deterministic block generation for tests

* Electra other containers (#5652)

* add new fields to execution payload and header

* beacon state changes

* partial beacon state

* safe arith in upgrade to electra

* initialize balances cache in interop genesis state

* Revert "initialize balances cache in interop genesis state"

This reverts commit c60b522.

* always initialize balances cache if necessary in electra upgrade

* build cache earlier

* fix block test

* per fork NUM_FIELDS_POW2

* Merge branch 'unstable' of https://github.com/sigp/lighthouse into electra-other-containers

* fix lints

* get fields based on post state, as is spec'd

* fix type and move cache build

* Add more electra helpers (#5653)

* Add new helpers

* Fix some stuff

* Fix compilation errors

* lint

* Address review

* Ignore gossip blob already imported (#5656)

* Ignore gossip blob already imported

* Beta compiler fix (#5659)

* fix beta compiler compilation

* remove unused import

* Revert "remove unused import"

This reverts commit 0bef36b.

* Revert "fix beta compiler compilation"

This reverts commit 23152cf.

* rename ununsed fields

* allow dead code on some error variants

* remove unused blob download queue

* add back debug to backfill error

* more allow dead code on errors

* fix(validator_client): raise soft fd limit (#4796)

* fix(validator_client): raise soft fd limit

* Merge branch 'unstable' of https://github.com/sigp/lighthouse into rkrasiuk/raise-vc-fdlimit

* cargo lock

* Merge branch 'unstable' of https://github.com/sigp/lighthouse into rkrasiuk/raise-vc-fdlimit

* Proposer and attester slashing sse events (#5327)

* default vc to block v3 endpoint and deprecate block-v3 flag

* Merge branch 'unstable' of https://github.com/sigp/lighthouse into unstable

* add proposer and attester event variants

* add TOOOs

* add tests, event triggers

* Merge branch 'unstable' of https://github.com/sigp/lighthouse into proposer-and-attester-slashing-sse-events

* revert

* revert

* remove double event tracking

* Merge branch 'unstable' into proposer-and-attester-slashing-sse-events

* remove todo, fix test

* resolve merge conflicts

* Merge branch 'proposer-and-attester-slashing-sse-events' of https://github.com/eserilev/lighthouse into proposer-and-attester-slashing-sse-events

* leftover debugging

* Merge branch 'unstable' of https://github.com/sigp/lighthouse into proposer-and-attester-slashing-sse-events

* Merge branch 'unstable' of https://github.com/sigp/lighthouse into proposer-and-attester-slashing-sse-events

* pin macos release runner to `macos-13` (#5665)

* pin macos release runner to `macos-13`

* Update .github/workflows/release.yml

* Remove snapshot cache related code (#5661)

* Remove snapshot cache and other references.

* Fix default state cache size in docs

* Remove cache miss comment entirely

* Add state cache CLI tests

* Uncomment self_hosted_runner after PR Merge #5137 (#5291)

* Uncomment self_hosted_runner after PR Merge #5137

* Merge branch 'unstable' into fix_todo

* Merge branch 'unstable' of https://github.com/sigp/lighthouse into fix_todo

* Only `portable` builds (binaries) (#5615)

* release workflow: portable builds by default

* Delete outdated comment

* Merge branch 'unstable' into portable-builds-binaries

# Conflicts:
#	.github/workflows/release.yml

* Merge parent and current sync lookups (#5655)

* Drop lookup type trait for a simple arg

* Drop reconstructed for processing

* Send parent blocks one by one

* Merge current and parent lookups

* Merge current and parent lookups clean up todos

* Merge current and parent lookups tests

* Merge remote-tracking branch 'origin/unstable' into sync-merged-lookup

* Merge branch 'unstable' of https://github.com/sigp/lighthouse into sync-merged-lookup

* fix compile after merge

* #5655 pr review (#26)

* fix compile after merge

* remove todos, fix typos etc

* fix compile

* stable rng

* delete TODO and unfilled out test

* make download result a struct

* enums instead of bools as params

* fix comment

* Various fixes

* Track ignored child components

* Track dropped lookup reason as metric

* fix test

* add comment describing behavior of avail check error

*  update ordering

*  delete spammy log (#5672)

*  delete spammy log

* Ensure block only range requests don't fail on download (#5675)

* ensure pruned blobs don't fail on download

* Typo

* Improve ENR updates (#5483)

* Improve ENR updates

* forever fmt

* Appease my old friend clippy

* Merge network unstable

* Check da_checker before doing a block lookup request (#5681)

* Check da_checker before doing a block lookup request

* Ensure consistent handling of lookup result

* use req resp pre import cache rather than da checker

* Update Cargo.lock (#5670)

* update rust-yamux

* update Cargo.lock

* Merge branch 'unstable' of github.com:jxs/lighthouse into update-cargo

* Merge branch 'unstable' of github.com:sigp/lighthouse into update-cargo

* update to new libp2p versions

* Add metric for current epoch total balance (#5688)

* Add metric for current epoch total balance

---------

Co-authored-by: realbigsean <sean@sigmaprime.io>
Co-authored-by: Mac L <mjladson@pm.me>
Co-authored-by: Lion - dapplion <35266934+dapplion@users.noreply.github.com>
Co-authored-by: Pawan Dhananjay <pawandhananjay@gmail.com>
Co-authored-by: Roman Krasiuk <rokrassyuk@gmail.com>
Co-authored-by: Eitan Seri-Levi <eserilev@ucsc.edu>
Co-authored-by: antondlr <anton@delaruelle.net>
Co-authored-by: Jimmy Chen <jchen.tc@gmail.com>
Co-authored-by: Ærvin <arvin.morawej@gmail.com>
Co-authored-by: Age Manning <Age@AgeManning.com>
Co-authored-by: João Oliveira <hello@jxs.pt>
Co-authored-by: Michael Sproul <michael@sigmaprime.io>
ethDreamer pushed a commit to ethDreamer/lighthouse that referenced this pull request May 3, 2024
* Check da_checker before doing a block lookup request

* Ensure consistent handling of lookup result

* use req resp pre import cache rather than da checker
Comment on lines +309 to +316
if self
.chain
.reqresp_pre_import_cache
.read()
.contains_key(&block_root)
{
return Ok(false);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is the cause of the recent de-sync issues.

Consider the following sequence of events:

  1. Block arrives on gossip, we add it to the reqresp_pre_import_cache.
  2. We send the block's payload to the EL for verification.
  3. We receive blobs on gossip and add them to the DA cache.
  4. The EL times out. Block processing is aborted as a result.
  5. Subsequent attempts to look up the block and re-process it are blocked by the entry in the reqresp_pre_import_cache.
  6. We stall and are unable to reprocess the block until range sync and/or a blocks by range request is made in spite of the check above.

I've confirmed that this happened for block 0xfc4a1afe39d84a410c61f213cf15cb405fdd15b96ca54f0c826d4054671b1275 at slot 1574856 on Holesky. We get the block and all the blobs, then the EL times out and everything gets stuck:

May 04 05:31:20.803 DEBG Failed to verify execution payload error: ExecutionPayloadError(RequestFailed(EngineError(Api { error: HttpClient(url: http://127.0.0.1:8551/, kind: timeout, detail: operation timed out) })))

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually scratch that, @jimmygchen pointed out that we remove from the reqresp_pre_import_cache on failure here:

let has_missing_components =
matches!(r, Ok(AvailabilityProcessingStatus::MissingComponents(_, _)));
if !has_missing_components {
self.reqresp_pre_import_cache.write().remove(block_root);
}
r

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready-for-merge This PR is ready to merge. v5.2.0 Q2 2024
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants