Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Remote Store] Primary/Replica side changes to support Dual Replication #12821

Merged
merged 21 commits into from
Apr 2, 2024

Conversation

shourya035
Copy link
Member

@shourya035 shourya035 commented Mar 21, 2024

Description

This PR adds support for dual mode replication during remote store migration; Changes made:

  • Change all implementations of indexShard.isRemoteTranslogStoreEnabled to indexShard.indexSettings().isRemoteNode. This ensures that shards started on Remote store enabled nodes performs remote based operations in the replication flow.
  • Wiring DiscoveryNodes obtained from ClusterState to IndexShard. A Function implementation is passed on to ReplicationTracker as a supplier to check if the shards in a replication group is assigned to a remote or docrep node
  • Depending on whether the target shard is hosted on a remote enabled or non-remote (docrep enabled) node, the ReplicationModeAwareProxy would determine whether replication actions needs to be dropped or fan out to the corresponding replica shard copy. Replication overrides are bypassed and all actions are fanned out if the target shard copy is residing in a docrep enabled node

Added integration tests to test out the following flows:

  • Primary in remote, replica in docrep, primary hosting node uploads to remote store but replica hosting node does not
  • Primary in remote, replica in docrep and remote, primary and one replica hosting node uploads to remote but the docrep replica hosting node does not
  • Primary in remote, fails over to replica in docrep. Docrep shard continues indexing as is and does not upload anything to remote

Old PR raised on @gbbafna 's fork: gbbafna#152

Related Issues

Partially resolves: #12413

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • Failing checks are inspected and point to the corresponding known issue(s) (See: Troubleshooting Failing Builds)
  • Commits are signed per the DCO using --signoff
  • Commit changes are listed out in CHANGELOG.md file (See: Changelog)
  • Public documentation issue/PR created

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@github-actions github-actions bot added enhancement Enhancement or improvement to existing feature or request RFC Issues requesting major changes Storage:Durability Issues and PRs related to the durability framework Storage:Remote labels Mar 21, 2024
@shourya035 shourya035 added skip-changelog and removed RFC Issues requesting major changes labels Mar 21, 2024
Copy link
Contributor

github-actions bot commented Mar 21, 2024

Compatibility status:

Checks if related components are compatible with change 0ed42b8

Incompatible components

Skipped components

Compatible components

Compatible components: [https://github.com/opensearch-project/custom-codecs.git, https://github.com/opensearch-project/geospatial.git, https://github.com/opensearch-project/flow-framework.git, https://github.com/opensearch-project/observability.git, https://github.com/opensearch-project/cross-cluster-replication.git, https://github.com/opensearch-project/job-scheduler.git, https://github.com/opensearch-project/opensearch-oci-object-storage.git, https://github.com/opensearch-project/ml-commons.git, https://github.com/opensearch-project/k-nn.git, https://github.com/opensearch-project/neural-search.git, https://github.com/opensearch-project/notifications.git, https://github.com/opensearch-project/security-analytics.git, https://github.com/opensearch-project/asynchronous-search.git, https://github.com/opensearch-project/sql.git, https://github.com/opensearch-project/reporting.git, https://github.com/opensearch-project/common-utils.git, https://github.com/opensearch-project/performance-analyzer-rca.git, https://github.com/opensearch-project/anomaly-detection.git, https://github.com/opensearch-project/index-management.git, https://github.com/opensearch-project/security.git, https://github.com/opensearch-project/alerting.git, https://github.com/opensearch-project/performance-analyzer.git]

Copy link
Contributor

❌ Gradle check result for 163635b: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

❌ Gradle check result for 50cb08a: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@github-actions github-actions bot added the RFC Issues requesting major changes label Mar 21, 2024
Copy link
Contributor

❌ Gradle check result for 82f2562: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

❌ Gradle check result for 8aabf1a: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@shourya035 shourya035 self-assigned this Mar 21, 2024
@shourya035 shourya035 removed enhancement Enhancement or improvement to existing feature or request RFC Issues requesting major changes Storage:Remote Storage:Durability Issues and PRs related to the durability framework labels Mar 21, 2024
Copy link
Contributor

❌ Gradle check result for 0a5ef5b: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@github-actions github-actions bot added enhancement Enhancement or improvement to existing feature or request RFC Issues requesting major changes Storage:Durability Issues and PRs related to the durability framework Storage:Remote labels Mar 22, 2024
@ashking94
Copy link
Member

@shourya035 Lets add one more usecase for integ test -

  • Primary in remote, replica in docrep and remote. Primary on remote fails, remote replica should become the new primary?

@shourya035
Copy link
Member Author

@shourya035 Lets add one more usecase for integ test -

* Primary in remote, replica in docrep and remote. Primary on remote fails, remote replica should become the new primary?

@ashking94 We don't have that piece ready yet. As of now, if a primary in remote fails, the failover can happen to either docrep or remote copy. @ltaragi is working on this piece wherein failover preference would be given to the remote copy over the docrep one.

Signed-off-by: Shourya Dutta Biswas <114977491+shourya035@users.noreply.github.com>
Signed-off-by: Shourya Dutta Biswas <114977491+shourya035@users.noreply.github.com>
Signed-off-by: Shourya Dutta Biswas <114977491+shourya035@users.noreply.github.com>
Signed-off-by: Shourya Dutta Biswas <114977491+shourya035@users.noreply.github.com>
Copy link
Contributor

github-actions bot commented Apr 1, 2024

❌ Gradle check result for 3a39232: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: Shourya Dutta Biswas <114977491+shourya035@users.noreply.github.com>
Copy link
Contributor

github-actions bot commented Apr 1, 2024

✅ Gradle check result for 04fc5cb: SUCCESS

Copy link
Contributor

github-actions bot commented Apr 1, 2024

❌ Gradle check result for b11585a: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: Shourya Dutta Biswas <114977491+shourya035@users.noreply.github.com>
Copy link
Contributor

github-actions bot commented Apr 1, 2024

✅ Gradle check result for 19e7221: SUCCESS

Copy link
Member

@ashking94 ashking94 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, left some minor comments, pls address them.

Signed-off-by: Shourya Dutta Biswas <114977491+shourya035@users.noreply.github.com>
Copy link
Contributor

github-actions bot commented Apr 2, 2024

❌ Gradle check result for 77798c7: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: Shourya Dutta Biswas <114977491+shourya035@users.noreply.github.com>
Copy link
Contributor

github-actions bot commented Apr 2, 2024

❕ Gradle check result for 0ed42b8: UNSTABLE

  • TEST FAILURES:
      1 org.opensearch.repositories.azure.AzureBlobContainerRetriesTests.testWriteLargeBlob
      1 org.opensearch.remotestore.RemoteStoreStatsIT.testDownloadStatsCorrectnessSinglePrimarySingleReplica
      1 org.opensearch.index.ShardIndexingPressureSettingsIT.testShardIndexingPressureEnforcedEnabledDisabledSetting
      1 org.opensearch.index.ShardIndexingPressureSettingsIT.classMethod
      1 org.opensearch.backwards.MixedClusterClientYamlTestSuiteIT.test {p0=search.aggregation/20_terms/string profiler via global ordinals}

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

@gbbafna gbbafna merged commit 8def8cb into opensearch-project:main Apr 2, 2024
30 of 32 checks passed
shourya035 added a commit to shourya035/OpenSearch that referenced this pull request Apr 2, 2024
…cation during Remote Store Migration (opensearch-project#12821)

Signed-off-by: Shourya Dutta Biswas <114977491+shourya035@users.noreply.github.com>
shourya035 added a commit to shourya035/OpenSearch that referenced this pull request Apr 2, 2024
…cation during Remote Store Migration (opensearch-project#12821)

Signed-off-by: Shourya Dutta Biswas <114977491+shourya035@users.noreply.github.com>
shourya035 added a commit to shourya035/OpenSearch that referenced this pull request Apr 17, 2024
…cation during Remote Store Migration (opensearch-project#12821)

Signed-off-by: Shourya Dutta Biswas <114977491+shourya035@users.noreply.github.com>
gbbafna pushed a commit that referenced this pull request Apr 17, 2024
…port Dual Replication during Remote Store Migration (#13028)

* [Remote Store] Add Primary/Replica side changes to support Dual Replication during Remote Store Migration (#12821)

Signed-off-by: Shourya Dutta Biswas <114977491+shourya035@users.noreply.github.com>

* Fix build due to multiple commits to same file causing compilation failure (#13019)

Signed-off-by: Gaurav Bafna <gbbafna@amazon.com>
Signed-off-by: Shourya Dutta Biswas <114977491+shourya035@users.noreply.github.com>

---------

Signed-off-by: Shourya Dutta Biswas <114977491+shourya035@users.noreply.github.com>
shiv0408 pushed a commit to Gaurav614/OpenSearch that referenced this pull request Apr 25, 2024
…cation during Remote Store Migration (opensearch-project#12821)

Signed-off-by: Shourya Dutta Biswas <114977491+shourya035@users.noreply.github.com>
Signed-off-by: Shivansh Arora <hishiv@amazon.com>
harshavamsi pushed a commit to harshavamsi/OpenSearch that referenced this pull request Apr 29, 2024
…cation during Remote Store Migration (opensearch-project#12821)

Signed-off-by: Shourya Dutta Biswas <114977491+shourya035@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhancement or improvement to existing feature or request RFC Issues requesting major changes skip-changelog Storage:Durability Issues and PRs related to the durability framework Storage:Remote
Projects
Status: ✅ Done
Development

Successfully merging this pull request may close these issues.

[Remote Store] Design - Dual Mode Replication during Remote Store migration
4 participants