Snapshot Deletion Could Run more Concurrently to Snapshot Creation #82853
Labels
:Distributed Coordination/Snapshot/Restore
Anything directly related to the `_snapshot/*` APIs
>enhancement
Team:Distributed (Obsolete)
Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination.
Currently, snapshot deletion completely locks the repository it operates on, forbidding both shard level snapshot creates and snapshot finalisations for the entire duration of the delete.
This is a problem when when executing very large deletes (as in large product of snapshot count * indices_per_snapshot) because blob stores like S3 are often somewhat rate limited in terms of how many blobs can be deleted per unit of time.
Without a change to the file structure in a repository I don't think we can speed up deletes much end-to-end beyond the current state. What we can however do is reduce the impact of a long running delete considerably by the following change.
Currently a snapshot delete runs as follows:
With the way deletes are currently implemented this level of locking out of other operations is unnecessary.
Switching the order of the last two steps to:
Would drastically reduce the impact of a slow delete without introducing any safety issues.
The reason for this being a safe change is as follows:
When we delete a snapshot, we do the following steps:
This also does not introduce an added risk of leaking blobs in all practically relevant cases because it fundamentally does not change the cleanup properties of deletes in step 2 above. We will always collect all unreferenced index folders and all unreferenced blobs in shard folders that get touched by the delete in step 2. Hence, if a delete fails to fully complete the next delete will simply pick up where it left off. The mechanics of what we try to delete and when are not touched by this change. All we are changing is allowing other operations to continue while we run the delete of known-to-not-be-referenced-ever blobs concurrently to other operations.
The text was updated successfully, but these errors were encountered: