-
Notifications
You must be signed in to change notification settings - Fork 699
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cleanup PVF artifact by cache limit and stale time #4662
Conversation
pub enum CleanupBy { | ||
// Inactive time after which artefact is deleted | ||
Time(Duration), | ||
// Max size in bytes. Reaching it older artefacts are deleted |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment seems wrong, we delete least used ones. At least for me older
usually means in relation to creation time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
|
||
for (k, v) in self.inner.iter() { | ||
if let ArtifactState::Prepared { ref path, last_time_needed, .. } = *v { | ||
if let Ok(metadata) = fs::metadata(path) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What bothers me here is that we're running a (possibly large) number of synchronous blocking filesystem operations in Artifacts::prune()
, which is called from the PVF host's main loop running on a non-blocking threadpool. In case of I/O problems the whole PVF host will stall. I think we should either make use of tokio::fs::metadata()
, or, even better, store the artifact's size as a property of the artifact itself along with the other data, and then no filesystem access is required inside prune()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, added the size to artifact's state
artifact_sizes.sort_by_key(|&(_, _, _, last_time_needed)| last_time_needed); | ||
|
||
while total_size > *size_limit { | ||
let Some((artifact_id, path, size, _)) = artifact_sizes.pop() else { break }; | ||
to_remove.push((artifact_id, path)); | ||
total_size -= size; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A unit test to check the correctness of this behavior would be definitely appreciated :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added (and it saved me because the implementation was not correct)
This reverts commit a7e043b.
if now | ||
.duration_since(last_time_needed) | ||
.map(|stale_time| stale_time < cleanup_config.min_stale_time) | ||
.unwrap_or(false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Returns an Err if earlier is later than self, and the error contains how far from self the time is.
In case of error we should return true to break here, we don't want to delete things if time move backwards for whatever reason.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks alright to me.
The CI pipeline was cancelled due to failure one of the required jobs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, thank you!
let used_recently = now | ||
.duration_since(last_time_needed) | ||
.map(|stale_time| stale_time < cleanup_config.min_stale_time) | ||
.unwrap_or(true); | ||
if used_recently { | ||
break; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm 🤔 So, the cache may grow unbounded as long as all the artifacts are fresh? I mean, this mimics the current behavior, so no new attack vectors are introduced here for sure, but probably we should review the old ones... I don't call for any change here right now, it's just for discussion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the cache may grow unbounded as long as all the artifacts are fresh
I believe it growths no more than now because we use same 24h limit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, exactly. I'm just wondering if there's some scenario where someone buys some cheap coretime leftovers and starts pushing tons of PVFs around within 24 hours to overflow validators' disk space. Sounds unlikely, just paranoia, probably.
Co-authored-by: s0me0ne-unkn0wn <48632512+s0me0ne-unkn0wn@users.noreply.github.com>
/// Remove artifacts older than the given TTL or the total artifacts size limit and return id | ||
/// and path of the removed ones. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doc comment needs fixing: we evict LRU artifacts only if we go over cache limit.
Co-authored-by: Andrei Sandu <54316454+sandreim@users.noreply.github.com>
Co-authored-by: Andrei Sandu <54316454+sandreim@users.noreply.github.com>
* master: (29 commits) Append overlay optimization. (#1223) finalization: Skip tree route calculation if no forks present (#4721) Remove unncessary call remove_from_peers_set (#4742) add pov-recovery unit tests and support for elastic scaling (#4733) approval-voting: Add no shows debug information (#4726) Revamp the Readme of the parachain template (#4713) Update README.md to move the PSVM link under a "Tooling" section under the "Releases" section (#4734) frame/proc-macro: Refactor code for better readability (#4712) Contracts: update wasmi to 0.32 (#3679) Backport style changes from P<>K bridge to R<>W bridge (#4732) New reference doc for Custom RPC V2 (#4654) Frame Pallets: Clean a lot of test setups (#4642) Fix occupied core handling (#4691) statement-distribution: Fix false warning (#4727) Update the README to include a link to the Polkadot SDK Version Manager (#4718) Cleanup PVF artifact by cache limit and stale time (#4662) Update link to a latest polkadot release (#4711) [CI] Delete cargo-deny config (#4677) fix build on MacOS: bump secp256k1 and secp256k1-sys to patched versions (#4709) Unify dependency aliases (#4633) ...
Part of paritytech#4324 We don't change but extend the existing cleanup strategy. - We still don't touch artifacts being stale less than 24h - First time we attempt pruning only when we hit cache limit (10 GB) - If somehow happened that after we hit 10 GB and least used artifact is stale less than 24h we don't remove it. --------- Co-authored-by: s0me0ne-unkn0wn <48632512+s0me0ne-unkn0wn@users.noreply.github.com> Co-authored-by: Andrei Sandu <54316454+sandreim@users.noreply.github.com>
Part of #4324
We don't change but extend the existing cleanup strategy.