Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent Overflow LRU Cache from Exploding #4801

Merged
merged 10 commits into from
Oct 11, 2023

Conversation

ethDreamer
Copy link
Member

Issue Addressed

In a nutshell, the problem with the OverflowLRUCache before this PR is that it caches an AvailabilityPendingExecutedBlock for each entry, which contains the entire BeaconState. If we end up with multiple forks and many unavailable blocks, this cache can have many entries, causing it to consume large amounts of memory.

I've addressed this by creating a small StateLRUCache which will accept the AvailabilityPendingExecutedBlock, move the BeaconState to a very limited LRU cache and return a DietAvailablityPendingExecutedBlock which contains all the same data except the BeaconState has been replaced with a root.

If many unavailable blocks are stored at the same time in the DataAvailabiltyCache, the excess states will be dropped. If those states are needed later they are recovered by loading the parent BeaconState from disk and replaying the block.

@ethDreamer ethDreamer changed the title Initial Commit of State LRU Cache Prevent Overflow LRU Cache from Exploding Oct 3, 2023
@ethDreamer ethDreamer added the ready-for-review The code is ready for review label Oct 3, 2023
Copy link
Member

@michaelsproul michaelsproul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! This is looking great on the whole. I just had a few minor perf tips and style-tweaks to get in before we merge

@michaelsproul michaelsproul added waiting-on-author The reviewer has suggested changes and awaits thier implementation. optimization Something to make Lighthouse run more efficiently. and removed ready-for-review The code is ready for review labels Oct 4, 2023
bors bot pushed a commit that referenced this pull request Oct 5, 2023
## Issue Addressed

While reviewing #4801 I noticed that our use of `take_while` in the block replayer means that if a state root iterator _with gaps_ is provided, some additonal state roots will be dropped unnecessarily. In practice the impact is small, because once there's _one_ state root miss, the whole tree hash cache needs to be built anyway, and subsequent misses are less costly. However this was still a little inefficient, so I figured it's better to fix it.

## Proposed Changes

Use [`peeking_take_while`](https://docs.rs/itertools/latest/itertools/trait.Itertools.html#method.peeking_take_while) to avoid consuming the next element when checking whether it satisfies the slot predicate.

## Additional Info

There's a gist here that shows the basic dynamics in isolation: https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=40b623cc0febf9ed51705d476ab140c5. Changing the `peeking_take_while` to a `take_while` causes the assert to fail. Similarly I've added a new test `block_replayer_peeking_state_roots` which fails if the same change is applied inside `get_state_root`.
@ethDreamer ethDreamer added ready-for-review The code is ready for review and removed waiting-on-author The reviewer has suggested changes and awaits thier implementation. labels Oct 6, 2023
.apply_blocks(vec![diet_executed_block.block.clone_as_blinded()], None)
.map(|block_replayer| block_replayer.into_state())
.and_then(|mut state| {
state
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we're explicit about building all caches here, is it possible we're doing too much work? for example if we're not verifying signatures, do we need the pubkey cache? and if we're not crossing an epoch boundary do we need the next epoch's committee cache? I think all the caches should be built on demand, so maybe just removing the explicit building would be better. What do you think @michaelsproul

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I guess in most cases we will need these caches when we process the next block, but in the case of a reorg or a block on a side chain, we may not. So it might be a bit more resilient if we don't build any caches here (less DoS risk).

In the past we've had some issues with caches not getting auto-built when they're required, but I think we're past that now, and hopefully Hydra helps flush out any cases we've missed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I did some messing around and it looks like we only need to build the exit cache & the tree hash cache in order to have equality with the original state. Should I build those or build nothing?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess in most cases we will need these caches when we process the next block

Oh true. Well I guess the scenario this mechanism is designed for is when we have a bunch of heads. So if we don't get a next block on a head it'd still be wasted work, right? Also building caches on-demand as we get each head's "next block" might be better to spread out this work over a longer period of time.

So I did some messing around and it looks like we only need to build the exit cache & the tree hash cache in order to have equality with the original state. Should I build those or build nothing?

Yea sure, building the ones we know we need makes sense to me

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay fixed

@ethDreamer ethDreamer merged commit 8660043 into sigp:deneb-free-blobs Oct 11, 2023
21 of 24 checks passed
Woodpile37 pushed a commit to Woodpile37/lighthouse that referenced this pull request Jan 6, 2024
## Issue Addressed

While reviewing sigp#4801 I noticed that our use of `take_while` in the block replayer means that if a state root iterator _with gaps_ is provided, some additonal state roots will be dropped unnecessarily. In practice the impact is small, because once there's _one_ state root miss, the whole tree hash cache needs to be built anyway, and subsequent misses are less costly. However this was still a little inefficient, so I figured it's better to fix it.

## Proposed Changes

Use [`peeking_take_while`](https://docs.rs/itertools/latest/itertools/trait.Itertools.html#method.peeking_take_while) to avoid consuming the next element when checking whether it satisfies the slot predicate.

## Additional Info

There's a gist here that shows the basic dynamics in isolation: https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=40b623cc0febf9ed51705d476ab140c5. Changing the `peeking_take_while` to a `take_while` causes the assert to fail. Similarly I've added a new test `block_replayer_peeking_state_roots` which fails if the same change is applied inside `get_state_root`.
Woodpile37 pushed a commit to Woodpile37/lighthouse that referenced this pull request Jan 6, 2024
## Issue Addressed

While reviewing sigp#4801 I noticed that our use of `take_while` in the block replayer means that if a state root iterator _with gaps_ is provided, some additonal state roots will be dropped unnecessarily. In practice the impact is small, because once there's _one_ state root miss, the whole tree hash cache needs to be built anyway, and subsequent misses are less costly. However this was still a little inefficient, so I figured it's better to fix it.

## Proposed Changes

Use [`peeking_take_while`](https://docs.rs/itertools/latest/itertools/trait.Itertools.html#method.peeking_take_while) to avoid consuming the next element when checking whether it satisfies the slot predicate.

## Additional Info

There's a gist here that shows the basic dynamics in isolation: https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=40b623cc0febf9ed51705d476ab140c5. Changing the `peeking_take_while` to a `take_while` causes the assert to fail. Similarly I've added a new test `block_replayer_peeking_state_roots` which fails if the same change is applied inside `get_state_root`.
Woodpile37 pushed a commit to Woodpile37/lighthouse that referenced this pull request Jan 6, 2024
## Issue Addressed

While reviewing sigp#4801 I noticed that our use of `take_while` in the block replayer means that if a state root iterator _with gaps_ is provided, some additonal state roots will be dropped unnecessarily. In practice the impact is small, because once there's _one_ state root miss, the whole tree hash cache needs to be built anyway, and subsequent misses are less costly. However this was still a little inefficient, so I figured it's better to fix it.

## Proposed Changes

Use [`peeking_take_while`](https://docs.rs/itertools/latest/itertools/trait.Itertools.html#method.peeking_take_while) to avoid consuming the next element when checking whether it satisfies the slot predicate.

## Additional Info

There's a gist here that shows the basic dynamics in isolation: https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=40b623cc0febf9ed51705d476ab140c5. Changing the `peeking_take_while` to a `take_while` causes the assert to fail. Similarly I've added a new test `block_replayer_peeking_state_roots` which fails if the same change is applied inside `get_state_root`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
deneb optimization Something to make Lighthouse run more efficiently. ready-for-review The code is ready for review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants