-
Notifications
You must be signed in to change notification settings - Fork 747
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prevent Overflow LRU Cache from Exploding #4801
Prevent Overflow LRU Cache from Exploding #4801
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! This is looking great on the whole. I just had a few minor perf tips and style-tweaks to get in before we merge
beacon_node/beacon_chain/src/data_availability_checker/overflow_lru_cache.rs
Outdated
Show resolved
Hide resolved
beacon_node/beacon_chain/src/data_availability_checker/state_lru_cache.rs
Outdated
Show resolved
Hide resolved
beacon_node/beacon_chain/src/data_availability_checker/state_lru_cache.rs
Outdated
Show resolved
Hide resolved
beacon_node/beacon_chain/src/data_availability_checker/state_lru_cache.rs
Outdated
Show resolved
Hide resolved
beacon_node/beacon_chain/src/data_availability_checker/state_lru_cache.rs
Outdated
Show resolved
Hide resolved
## Issue Addressed While reviewing #4801 I noticed that our use of `take_while` in the block replayer means that if a state root iterator _with gaps_ is provided, some additonal state roots will be dropped unnecessarily. In practice the impact is small, because once there's _one_ state root miss, the whole tree hash cache needs to be built anyway, and subsequent misses are less costly. However this was still a little inefficient, so I figured it's better to fix it. ## Proposed Changes Use [`peeking_take_while`](https://docs.rs/itertools/latest/itertools/trait.Itertools.html#method.peeking_take_while) to avoid consuming the next element when checking whether it satisfies the slot predicate. ## Additional Info There's a gist here that shows the basic dynamics in isolation: https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=40b623cc0febf9ed51705d476ab140c5. Changing the `peeking_take_while` to a `take_while` causes the assert to fail. Similarly I've added a new test `block_replayer_peeking_state_roots` which fails if the same change is applied inside `get_state_root`.
beacon_node/beacon_chain/src/data_availability_checker/state_lru_cache.rs
Outdated
Show resolved
Hide resolved
.apply_blocks(vec![diet_executed_block.block.clone_as_blinded()], None) | ||
.map(|block_replayer| block_replayer.into_state()) | ||
.and_then(|mut state| { | ||
state |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we're explicit about building all caches here, is it possible we're doing too much work? for example if we're not verifying signatures, do we need the pubkey cache? and if we're not crossing an epoch boundary do we need the next epoch's committee cache? I think all the caches should be built on demand, so maybe just removing the explicit building would be better. What do you think @michaelsproul
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I guess in most cases we will need these caches when we process the next block, but in the case of a reorg or a block on a side chain, we may not. So it might be a bit more resilient if we don't build any caches here (less DoS risk).
In the past we've had some issues with caches not getting auto-built when they're required, but I think we're past that now, and hopefully Hydra helps flush out any cases we've missed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I did some messing around and it looks like we only need to build the exit cache & the tree hash cache in order to have equality with the original state. Should I build those or build nothing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess in most cases we will need these caches when we process the next block
Oh true. Well I guess the scenario this mechanism is designed for is when we have a bunch of heads. So if we don't get a next block on a head it'd still be wasted work, right? Also building caches on-demand as we get each head's "next block" might be better to spread out this work over a longer period of time.
So I did some messing around and it looks like we only need to build the exit cache & the tree hash cache in order to have equality with the original state. Should I build those or build nothing?
Yea sure, building the ones we know we need makes sense to me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay fixed
## Issue Addressed While reviewing sigp#4801 I noticed that our use of `take_while` in the block replayer means that if a state root iterator _with gaps_ is provided, some additonal state roots will be dropped unnecessarily. In practice the impact is small, because once there's _one_ state root miss, the whole tree hash cache needs to be built anyway, and subsequent misses are less costly. However this was still a little inefficient, so I figured it's better to fix it. ## Proposed Changes Use [`peeking_take_while`](https://docs.rs/itertools/latest/itertools/trait.Itertools.html#method.peeking_take_while) to avoid consuming the next element when checking whether it satisfies the slot predicate. ## Additional Info There's a gist here that shows the basic dynamics in isolation: https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=40b623cc0febf9ed51705d476ab140c5. Changing the `peeking_take_while` to a `take_while` causes the assert to fail. Similarly I've added a new test `block_replayer_peeking_state_roots` which fails if the same change is applied inside `get_state_root`.
## Issue Addressed While reviewing sigp#4801 I noticed that our use of `take_while` in the block replayer means that if a state root iterator _with gaps_ is provided, some additonal state roots will be dropped unnecessarily. In practice the impact is small, because once there's _one_ state root miss, the whole tree hash cache needs to be built anyway, and subsequent misses are less costly. However this was still a little inefficient, so I figured it's better to fix it. ## Proposed Changes Use [`peeking_take_while`](https://docs.rs/itertools/latest/itertools/trait.Itertools.html#method.peeking_take_while) to avoid consuming the next element when checking whether it satisfies the slot predicate. ## Additional Info There's a gist here that shows the basic dynamics in isolation: https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=40b623cc0febf9ed51705d476ab140c5. Changing the `peeking_take_while` to a `take_while` causes the assert to fail. Similarly I've added a new test `block_replayer_peeking_state_roots` which fails if the same change is applied inside `get_state_root`.
## Issue Addressed While reviewing sigp#4801 I noticed that our use of `take_while` in the block replayer means that if a state root iterator _with gaps_ is provided, some additonal state roots will be dropped unnecessarily. In practice the impact is small, because once there's _one_ state root miss, the whole tree hash cache needs to be built anyway, and subsequent misses are less costly. However this was still a little inefficient, so I figured it's better to fix it. ## Proposed Changes Use [`peeking_take_while`](https://docs.rs/itertools/latest/itertools/trait.Itertools.html#method.peeking_take_while) to avoid consuming the next element when checking whether it satisfies the slot predicate. ## Additional Info There's a gist here that shows the basic dynamics in isolation: https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=40b623cc0febf9ed51705d476ab140c5. Changing the `peeking_take_while` to a `take_while` causes the assert to fail. Similarly I've added a new test `block_replayer_peeking_state_roots` which fails if the same change is applied inside `get_state_root`.
Issue Addressed
In a nutshell, the problem with the
OverflowLRUCache
before this PR is that it caches anAvailabilityPendingExecutedBlock
for each entry, which contains the entireBeaconState
. If we end up with multiple forks and many unavailable blocks, this cache can have many entries, causing it to consume large amounts of memory.I've addressed this by creating a small
StateLRUCache
which will accept theAvailabilityPendingExecutedBlock
, move theBeaconState
to a very limited LRU cache and return aDietAvailablityPendingExecutedBlock
which contains all the same data except theBeaconState
has been replaced with a root.If many unavailable blocks are stored at the same time in the
DataAvailabiltyCache
, the excess states will be dropped. If those states are needed later they are recovered by loading the parentBeaconState
from disk and replaying the block.