Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce the number of objects allocated by SLM when listing the snapshots to retain #99953

Closed
Tracked by #77466
tlrx opened this issue Sep 27, 2023 · 4 comments · Fixed by #100092
Closed
Tracked by #77466

Reduce the number of objects allocated by SLM when listing the snapshots to retain #99953

tlrx opened this issue Sep 27, 2023 · 4 comments · Fixed by #100092
Labels
:Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >enhancement Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination.

Comments

@tlrx
Copy link
Member

tlrx commented Sep 27, 2023

The SLM retention clean up task SnapshotRetentionTask lists all snapshots in order to identify the snapshots to retain and the snapshots to delete. While doing so it retrieves the full snapshot information, including shard snapshot details and shard snapshot failures, to later only use snapshot metadata and snapshot timestamp to select the snapshots to retain. When the snapshots contain thousands of shards it represents of lot of objects that are unnecessary created, putting a lot of pressure on the garbage collector.

We should improve the way SLM retrieves snapshots to reduce the huge allocations of objects. David made some interesting suggestions in comments.

@tlrx tlrx added >enhancement :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs labels Sep 27, 2023
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

@elasticsearchmachine elasticsearchmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label Sep 27, 2023
@DaveCTurner
Copy link
Contributor

I was wondering if we should instead introduce a new master-node transport action, to be called by SLM, which calls BlobStoreRepository#getSnapshotInfo and SnapshotsService#deleteSnapshots itself, avoiding the need to accumulate the intermediate list of SnapshotInfo objects at all. We don't need most of the logic of the get-snapshots API here, and we could do some nicer things like more intelligent throttling if we weren't working directly with the client-facing get-snapshots and delete-snapshots actions.

@DaveCTurner
Copy link
Contributor

Another thing worth considering is whether we could refine SnapshotInfo#fromXContentInternal to use parser.skipChildren() on fields we aren't going to use. That's going to save a huge amount of allocation for cases where we just drop those fields later on.

@tlrx
Copy link
Member Author

tlrx commented Sep 28, 2023

That makes perfect sense, thanks David.

@tlrx tlrx changed the title Add Parameter to not return all snapshot info in Get Snapshots API Reduce the number of objects allocated by SLM when listing the snapshots to retain Sep 28, 2023
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this issue Sep 29, 2023
A small refactoring to make elastic#99953 a little simpler: combine the logic
for retrieving the snapshot info and filtering out the ineligible ones
into a single function so we can replace it with a call to a dedicated
client action in a followup.
DaveCTurner added a commit that referenced this issue Sep 29, 2023
A small refactoring to make #99953 a little simpler: combine the logic
for retrieving the snapshot info and filtering out the ineligible ones
into a single function so we can replace it with a call to a dedicated
client action in a followup.
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this issue Sep 29, 2023
We only ever test each instance this predicate once, immediately after
creating it, so we may as well just convert it into a regular method
that returns `boolean` instead.

More preliminary work before fixing elastic#99953
elasticsearchmachine pushed a commit that referenced this issue Sep 29, 2023
…100053)

We only ever test each instance of this predicate once, immediately
after creating it, so we may as well just convert it into a regular
method that returns `boolean` instead.

More preliminary work before fixing #99953
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this issue Sep 30, 2023
There is no need to obtain `SnapshotInfo` for all snapshots in order to
compute SLM retention. With this commit we move to computing it directly
from the `RepositoryData` in most circumstances, and in rare situations
where we must still retrieve `SnapshotInfo` blobs we make sure not to
hold many in memory at once.

Closes elastic#99953
piergm pushed a commit to piergm/elasticsearch that referenced this issue Oct 2, 2023
A small refactoring to make elastic#99953 a little simpler: combine the logic
for retrieving the snapshot info and filtering out the ineligible ones
into a single function so we can replace it with a call to a dedicated
client action in a followup.
piergm pushed a commit to piergm/elasticsearch that referenced this issue Oct 2, 2023
…lastic#100053)

We only ever test each instance of this predicate once, immediately
after creating it, so we may as well just convert it into a regular
method that returns `boolean` instead.

More preliminary work before fixing elastic#99953
elasticsearchmachine pushed a commit that referenced this issue Oct 2, 2023
There is no need to obtain `SnapshotInfo` for all snapshots in order to
compute SLM retention. With this commit we move to computing it directly
from the `RepositoryData` in most circumstances, and in rare situations
where we must still retrieve `SnapshotInfo` blobs we make sure not to
hold many in memory at once.

Closes #99953
jakelandis pushed a commit to jakelandis/elasticsearch that referenced this issue Oct 2, 2023
A small refactoring to make elastic#99953 a little simpler: combine the logic
for retrieving the snapshot info and filtering out the ineligible ones
into a single function so we can replace it with a call to a dedicated
client action in a followup.
jakelandis pushed a commit to jakelandis/elasticsearch that referenced this issue Oct 2, 2023
…lastic#100053)

We only ever test each instance of this predicate once, immediately
after creating it, so we may as well just convert it into a regular
method that returns `boolean` instead.

More preliminary work before fixing elastic#99953
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >enhancement Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants