Reduce the number of objects allocated by SLM when listing the snapshots to retain #99953

tlrx · 2023-09-27T13:00:31Z

The SLM retention clean up task SnapshotRetentionTask lists all snapshots in order to identify the snapshots to retain and the snapshots to delete. While doing so it retrieves the full snapshot information, including shard snapshot details and shard snapshot failures, to later only use snapshot metadata and snapshot timestamp to select the snapshots to retain. When the snapshots contain thousands of shards it represents of lot of objects that are unnecessary created, putting a lot of pressure on the garbage collector.

We should improve the way SLM retrieves snapshots to reduce the huge allocations of objects. David made some interesting suggestions in comments.

The text was updated successfully, but these errors were encountered:

elasticsearchmachine · 2023-09-27T13:00:54Z

Pinging @elastic/es-distributed (Team:Distributed)

DaveCTurner · 2023-09-28T06:18:02Z

I was wondering if we should instead introduce a new master-node transport action, to be called by SLM, which calls BlobStoreRepository#getSnapshotInfo and SnapshotsService#deleteSnapshots itself, avoiding the need to accumulate the intermediate list of SnapshotInfo objects at all. We don't need most of the logic of the get-snapshots API here, and we could do some nicer things like more intelligent throttling if we weren't working directly with the client-facing get-snapshots and delete-snapshots actions.

DaveCTurner · 2023-09-28T07:31:35Z

Another thing worth considering is whether we could refine SnapshotInfo#fromXContentInternal to use parser.skipChildren() on fields we aren't going to use. That's going to save a huge amount of allocation for cases where we just drop those fields later on.

tlrx · 2023-09-28T12:45:35Z

That makes perfect sense, thanks David.

A small refactoring to make elastic#99953 a little simpler: combine the logic for retrieving the snapshot info and filtering out the ineligible ones into a single function so we can replace it with a call to a dedicated client action in a followup.

A small refactoring to make #99953 a little simpler: combine the logic for retrieving the snapshot info and filtering out the ineligible ones into a single function so we can replace it with a call to a dedicated client action in a followup.

We only ever test each instance this predicate once, immediately after creating it, so we may as well just convert it into a regular method that returns `boolean` instead. More preliminary work before fixing elastic#99953

…100053) We only ever test each instance of this predicate once, immediately after creating it, so we may as well just convert it into a regular method that returns `boolean` instead. More preliminary work before fixing #99953

There is no need to obtain `SnapshotInfo` for all snapshots in order to compute SLM retention. With this commit we move to computing it directly from the `RepositoryData` in most circumstances, and in rare situations where we must still retrieve `SnapshotInfo` blobs we make sure not to hold many in memory at once. Closes elastic#99953

A small refactoring to make elastic#99953 a little simpler: combine the logic for retrieving the snapshot info and filtering out the ineligible ones into a single function so we can replace it with a call to a dedicated client action in a followup.

…lastic#100053) We only ever test each instance of this predicate once, immediately after creating it, so we may as well just convert it into a regular method that returns `boolean` instead. More preliminary work before fixing elastic#99953

There is no need to obtain `SnapshotInfo` for all snapshots in order to compute SLM retention. With this commit we move to computing it directly from the `RepositoryData` in most circumstances, and in rare situations where we must still retrieve `SnapshotInfo` blobs we make sure not to hold many in memory at once. Closes #99953

A small refactoring to make elastic#99953 a little simpler: combine the logic for retrieving the snapshot info and filtering out the ineligible ones into a single function so we can replace it with a call to a dedicated client action in a followup.

…lastic#100053) We only ever test each instance of this predicate once, immediately after creating it, so we may as well just convert it into a regular method that returns `boolean` instead. More preliminary work before fixing elastic#99953

tlrx added >enhancement :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs labels Sep 27, 2023

tlrx mentioned this issue Sep 27, 2023

Fix Large Shard Count Scalability Issues #77466

Open

97 tasks

elasticsearchmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label Sep 27, 2023

tlrx changed the title ~~Add Parameter to not return all snapshot info in Get Snapshots API~~ Reduce the number of objects allocated by SLM when listing the snapshots to retain Sep 28, 2023

DaveCTurner mentioned this issue Sep 29, 2023

Move SLM eligibility check #100044

Merged

DaveCTurner mentioned this issue Sep 29, 2023

Inline SnapshotRetentionConfiguration#getSnapshotDeletionPredicate #100053

Merged

DaveCTurner mentioned this issue Sep 30, 2023

Compute SLM retention from RepositoryData #100092

Merged

elasticsearchmachine closed this as completed in #100092 Oct 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce the number of objects allocated by SLM when listing the snapshots to retain #99953

Reduce the number of objects allocated by SLM when listing the snapshots to retain #99953

tlrx commented Sep 27, 2023 •

edited by DaveCTurner

Loading

elasticsearchmachine commented Sep 27, 2023

DaveCTurner commented Sep 28, 2023

DaveCTurner commented Sep 28, 2023

tlrx commented Sep 28, 2023

Reduce the number of objects allocated by SLM when listing the snapshots to retain #99953

Reduce the number of objects allocated by SLM when listing the snapshots to retain #99953

Comments

tlrx commented Sep 27, 2023 • edited by DaveCTurner Loading

elasticsearchmachine commented Sep 27, 2023

DaveCTurner commented Sep 28, 2023

DaveCTurner commented Sep 28, 2023

tlrx commented Sep 28, 2023

tlrx commented Sep 27, 2023 •

edited by DaveCTurner

Loading