Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] [Remote State] NoSuchFileException during stale manifest deletion within a cluster UUID in the upload path #10586

Closed
linuxpi opened this issue Oct 12, 2023 · 0 comments · Fixed by #10611
Assignees
Labels
bug Something isn't working Cluster Manager Storage:Remote Storage Issues and PRs relating to data and metadata storage v2.12.0 Issues and PRs related to version 2.12.0

Comments

@linuxpi
Copy link
Collaborator

linuxpi commented Oct 12, 2023

Describe the bug
While uploading new/update cluster state to remote, we trigger deletion async task to delete stale remote cluster metadata metadata manifests. We retain last 10 latest Cluster metadata manifests in remote. During local testing with s3 remote we saw the following exception:

[2023-10-12T01:01:42,262][INFO ][o.o.g.r.RemoteClusterStateService] [data3] Deleting stale cluster UUIDs data from remote [bukhtawa-cluster-1]
[2023-10-12T01:01:44,468][INFO ][o.o.g.r.RemoteClusterStateService] [data3] Deleting stale cluster UUIDs data from remote [bukhtawa-cluster-1]
[2023-10-12T01:01:48,757][ERROR][o.o.g.r.RemoteClusterStateService] [data3] Error while fetching Remote Cluster Metadata manifests
java.lang.IllegalStateException: Error while downloading cluster metadata - manifest__9223372036854775806__9223372036854775803__9223370339802184428
	at org.opensearch.gateway.remote.RemoteClusterStateService.fetchRemoteClusterMetadataManifest(RemoteClusterStateService.java:785) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.gateway.remote.RemoteClusterStateService.lambda$deleteClusterMetadata$13(RemoteClusterStateService.java:908) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at java.lang.Iterable.forEach(Iterable.java:75) ~[?:?]
	at org.opensearch.gateway.remote.RemoteClusterStateService.deleteClusterMetadata(RemoteClusterStateService.java:907) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.gateway.remote.RemoteClusterStateService$2.onResponse(RemoteClusterStateService.java:861) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.gateway.remote.RemoteClusterStateService$2.onResponse(RemoteClusterStateService.java:857) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.common.blobstore.BlobContainer.listBlobsByPrefixInSortedOrder(BlobContainer.java:234) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.index.translog.transfer.BlobStoreTransferService.listAllInSortedOrder(BlobStoreTransferService.java:223) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.index.translog.transfer.BlobStoreTransferService.lambda$listAllInSortedOrderAsync$12(BlobStoreTransferService.java:233) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:849) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) [?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) [?:?]
	at java.lang.Thread.run(Thread.java:1623) [?:?]
Caused by: java.nio.file.NoSuchFileException: Blob object [YnVraHRhd2EtY2x1c3Rlci0x/cluster-state/5R4YBmNSRmqUNphK29gK_Q/manifest/manifest__9223372036854775806__9223372036854775803__9223370339802184428] not found: The specified key does not exist. (Service: S3, Status Code: 404, Request ID: PVHQJZVTZ4KFH6X5, Extended Request ID: kEqbTT6Z9/E5FWXehxe5xhaheyFJwhv/3Bvpw964Wjz0uE8InVyTaCuApi86cRiqsKD3iFh7RRA=)
	at org.opensearch.repositories.s3.S3RetryingInputStream.openStream(S3RetryingInputStream.java:129) ~[?:?]
	at org.opensearch.repositories.s3.S3RetryingInputStream.<init>(S3RetryingInputStream.java:99) ~[?:?]
	at org.opensearch.repositories.s3.S3RetryingInputStream.<init>(S3RetryingInputStream.java:82) ~[?:?]
	at org.opensearch.repositories.s3.S3BlobContainer.readBlob(S3BlobContainer.java:149) ~[?:?]
	at org.opensearch.repositories.blobstore.ChecksumBlobStoreFormat.read(ChecksumBlobStoreFormat.java:129) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.gateway.remote.RemoteClusterStateService.fetchRemoteClusterMetadataManifest(RemoteClusterStateService.java:779) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	... 12 more

Expected behavior
Deletion of stale Cluster metadata manifest should lead to errors and not delete unintended data , leading to data loss

@linuxpi linuxpi added bug Something isn't working untriaged labels Oct 12, 2023
@linuxpi linuxpi added Storage Issues and PRs relating to data and metadata storage Storage:Remote v2.12.0 Issues and PRs related to version 2.12.0 Cluster Manager and removed untriaged labels Oct 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Cluster Manager Storage:Remote Storage Issues and PRs relating to data and metadata storage v2.12.0 Issues and PRs related to version 2.12.0
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant