Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add integration tests for remote store restore flow #8484

Merged

Conversation

BhumikaSaini-Amazon
Copy link
Contributor

@BhumikaSaini-Amazon BhumikaSaini-Amazon commented Jul 6, 2023

Description

Add the following test scenarios to the suite:

  1. Restoring an index with >= 1 replica - 1 for each of the restore source / data loss (if any) combinations (RSS only, RSS+RTS both, flushed data, refreshed data, unrefreshed data)
  2. Specifying indices using wildcards.
  3. Excluding indices using wildcards.
  4. Restoring multiple indices via single request.
  5. Restoring only few of the red remote-enabled indices.

Related Issues

#8483

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • Commits are signed per the DCO using --signoff
  • Commit changes are listed out in CHANGELOG.md file (See: Changelog)

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@github-actions
Copy link
Contributor

github-actions bot commented Jul 6, 2023

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

github-actions bot commented Jul 6, 2023

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

github-actions bot commented Jul 6, 2023

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

github-actions bot commented Jul 6, 2023

Gradle Check (Jenkins) Run Completed with:

@BhumikaSaini-Amazon
Copy link
Contributor Author

BhumikaSaini-Amazon commented Jul 7, 2023

testRTSRestoreWithCommittedDataNotAllRedRemoteIndices is failing even though testRemoteTranslogRestoreWithCommittedData passes. Need to check why.

Resolved, indexSingleDoc() was always using INDEX_NAME

Bhumika Saini added 3 commits July 20, 2023 01:41
…CommitDeletionWithoutInvokeFlush

Signed-off-by: Bhumika Saini <sabhumik@amazon.com>
Signed-off-by: Bhumika Saini <sabhumik@amazon.com>
Signed-off-by: Bhumika Saini <sabhumik@amazon.com>
@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

Signed-off-by: Bhumika Saini <sabhumik@amazon.com>
@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

Signed-off-by: Bhumika Saini <sabhumik@amazon.com>
@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@ashking94
Copy link
Member

@BhumikaSaini-Amazon Based on the analysis that I did, this is what I have found. This issue would not happen for indexes with just 1 shard as the node that holds the shard is stopped and there is no history of retention leases (of the shard that was residing on the stopped node) after we close index and start the remote store restore.

For the indexes that have more than 1 shard, this issue happens when the syncRetentionLeases async task (that runs periodically) persists the retention leases to the local disk.

protected void dispatchedShardOperationOnPrimary(
Request request,
IndexShard primary,
ActionListener<PrimaryResult<Request, Response>> listener
) {
ActionListener.completeWith(listener, () -> {
assert request.waitForActiveShards().equals(ActiveShardCount.NONE) : request.waitForActiveShards();
Objects.requireNonNull(request);
Objects.requireNonNull(primary);
primary.persistRetentionLeases();
return new WritePrimaryResult<>(request, new Response(), null, null, primary, getLogger());
});
}

Afterwards, on cluster state update for restoring the remote index, IndexShard is created and then recovery starts on the newly created IndexShard object.
public IndexShard createShard(
final ShardRouting shardRouting,
final SegmentReplicationCheckpointPublisher checkpointPublisher,
final PeerRecoveryTargetService recoveryTargetService,
final RecoveryListener recoveryListener,
final RepositoriesService repositoriesService,
final Consumer<IndexShard.ShardFailure> onShardFailure,
final Consumer<ShardId> globalCheckpointSyncer,
final RetentionLeaseSyncer retentionLeaseSyncer,
final DiscoveryNode targetNode,
final DiscoveryNode sourceNode,
final RemoteRefreshSegmentPressureService remoteRefreshSegmentPressureService
) throws IOException {
Objects.requireNonNull(retentionLeaseSyncer);
ensureChangesAllowed();
IndexService indexService = indexService(shardRouting.index());
assert indexService != null;
RecoveryState recoveryState = indexService.createRecoveryState(shardRouting, targetNode, sourceNode);
IndexShard indexShard = indexService.createShard(
shardRouting,
globalCheckpointSyncer,
retentionLeaseSyncer,
checkpointPublisher,
remoteRefreshSegmentPressureService
);
indexShard.addShardFailureCallback(onShardFailure);
indexShard.startRecovery(recoveryState, recoveryTargetService, recoveryListener, repositoriesService, mapping -> {
assert recoveryState.getRecoverySource().getType() == RecoverySource.Type.LOCAL_SHARDS
: "mapping update consumer only required by local shards recovery";
client.admin()
.indices()
.preparePutMapping()
.setConcreteIndex(shardRouting.index()) // concrete index - no name clash, it uses uuid
.setSource(mapping.source().string(), XContentType.JSON)
.get();
}, this);
return indexShard;
}

During the recovery from remote store, the innerOpenEngineAndTranslog method of IndexShard is invoked where the retentionLeases are restored from the local disk back to in-memory replication tracker.
private void innerOpenEngineAndTranslog(LongSupplier globalCheckpointSupplier, boolean syncFromRemote) throws IOException {
assert Thread.holdsLock(mutex) == false : "opening engine under mutex";
if (state != IndexShardState.RECOVERING) {
throw new IndexShardNotRecoveringException(shardId, state);
}
final EngineConfig config = newEngineConfig(globalCheckpointSupplier);
// we disable deletes since we allow for operations to be executed against the shard while recovering
// but we need to make sure we don't loose deletes until we are done recovering
config.setEnableGcDeletes(false);
updateRetentionLeasesOnReplica(loadRetentionLeases());
assert recoveryState.getRecoverySource().expectEmptyRetentionLeases() == false || getRetentionLeases().leases().isEmpty()
: "expected empty set of retention leases with recovery source ["
+ recoveryState.getRecoverySource()
+ "] but got "
+ getRetentionLeases();

This issue is flaky due to the nature of async sync retention leases. As for now, I am creating an issue to remove retention leases when index has remote translog enabled. Meanwhile, we can safely override the expectEmptyRetentionLeases of RemoteStoreRecoverySource class.

Bhumika Saini added 2 commits July 20, 2023 16:59
Signed-off-by: Bhumika Saini <sabhumik@amazon.com>
…es once opensearch-project#8795 is resolved

Signed-off-by: Bhumika Saini <sabhumik@amazon.com>
@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

Signed-off-by: Bhumika Saini <sabhumik@amazon.com>
@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@sachinpkale sachinpkale merged commit 27a14c7 into opensearch-project:main Jul 20, 2023
@opensearch-trigger-bot
Copy link
Contributor

The backport to 2.x failed:

The process '/usr/bin/git' failed with exit code 128

To backport manually, run these commands in your terminal:

# Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add ../.worktrees/backport-2.x 2.x
# Navigate to the new working tree
pushd ../.worktrees/backport-2.x
# Create a new branch
git switch --create backport/backport-8484-to-2.x
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 27a14c7c811f930b2b36c13f20e1371c6206dd51
# Push it to GitHub
git push --set-upstream origin backport/backport-8484-to-2.x
# Go back to the original working tree
popd
# Delete the working tree
git worktree remove ../.worktrees/backport-2.x

Then, create a pull request where the base branch is 2.x and the compare/head branch is backport/backport-8484-to-2.x.

BhumikaSaini-Amazon pushed a commit to BhumikaSaini-Amazon/OpenSearch that referenced this pull request Jul 21, 2023
…ct#8484)

---------

Signed-off-by: Bhumika Saini <sabhumik@amazon.com>
(cherry picked from commit 27a14c7)
BhumikaSaini-Amazon pushed a commit to BhumikaSaini-Amazon/OpenSearch that referenced this pull request Jul 21, 2023
…ct#8484)

---------

Signed-off-by: Bhumika Saini <sabhumik@amazon.com>
(cherry picked from commit 27a14c7)
BhumikaSaini-Amazon pushed a commit to BhumikaSaini-Amazon/OpenSearch that referenced this pull request Jul 21, 2023
…ct#8484)

---------

Signed-off-by: Bhumika Saini <sabhumik@amazon.com>
sachinpkale pushed a commit that referenced this pull request Jul 21, 2023
---------

Signed-off-by: Bhumika Saini <sabhumik@amazon.com>
baba-devv pushed a commit to baba-devv/OpenSearch that referenced this pull request Jul 29, 2023
…ct#8484)

Add integration tests for remote store restore flow (opensearch-project#8484)

---------

Signed-off-by: Bhumika Saini <sabhumik@amazon.com>
kaushalmahi12 pushed a commit to kaushalmahi12/OpenSearch that referenced this pull request Sep 12, 2023
…ct#8484)

Add integration tests for remote store restore flow (opensearch-project#8484)

---------

Signed-off-by: Bhumika Saini <sabhumik@amazon.com>
Signed-off-by: Kaushal Kumar <ravi.kaushal97@gmail.com>
brusic pushed a commit to brusic/OpenSearch that referenced this pull request Sep 25, 2023
…ct#8484)

Add integration tests for remote store restore flow (opensearch-project#8484)

---------

Signed-off-by: Bhumika Saini <sabhumik@amazon.com>
Signed-off-by: Ivan Brusic <ivan.brusic@flocksafety.com>
shiv0408 pushed a commit to Gaurav614/OpenSearch that referenced this pull request Apr 25, 2024
…ct#8484)

Add integration tests for remote store restore flow (opensearch-project#8484)

---------

Signed-off-by: Bhumika Saini <sabhumik@amazon.com>
Signed-off-by: Shivansh Arora <hishiv@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 2.x Backport to 2.x branch skip-changelog
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants