Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Skip remote-repositories validations for node-joins when RepositoriesService is not in sync with cluster-state #16763

Merged
merged 12 commits into from
Dec 10, 2024

Conversation

Pranshu-S
Copy link
Contributor

@Pranshu-S Pranshu-S commented Dec 3, 2024

Description

During node joins, when a new node containing new repository metadata joins the cluster, the cluster-manager attempts to publish the updated cluster state that includes the node and its metadata. While during this update if the publish operation succeeds and the commit fails due to other issues (like network disruption or joining leader in term), it leads to a persistent cycle of NullPointerExceptions which prevents the cluster to become stable. This is because as part of the publish, the last accepted version and cluster state are updated but due to commits not run, the cluster-state appliers are not executed. This results in the repositories service not in sync with the repositories metadata in the cluster state. Now when the current cluster-manager (leader) steps down and another cluster-manager is elected:

  1. The newly elected cluster-manager attempts to verify repository metadata as part of its leadership transition.
  2. It checks the metadata in the cluster state for the presence of a specific repository. If the repository exists, it attempts to fetch its corresponding object from the repository service.
  3. Since the repository service was not updated earlier (due to the cluster appliers not being executed), this leads to a NullPointerException, causing instability in the cluster-manager election and transition process.

Related Issues

Resolves #16762

Check List

  • Functionality includes testing.
  • [ ] API changes companion pull request created, if applicable.
  • [] Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

…Service is not in sync with cluster-state

Signed-off-by: Pranshu Shukla <pranshushukla06@gmail.com>
@github-actions github-actions bot added bug Something isn't working Cluster Manager labels Dec 3, 2024
Copy link
Contributor

github-actions bot commented Dec 3, 2024

❌ Gradle check result for 5c9d397: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: Pranshu Shukla <pranshushukla06@gmail.com>
Copy link
Contributor

github-actions bot commented Dec 3, 2024

❌ Gradle check result for b884cfc: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: Pranshu Shukla <pranshushukla06@gmail.com>
Copy link
Contributor

github-actions bot commented Dec 3, 2024

❌ Gradle check result for 6ea3fd6: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: Pranshu Shukla <pranshushukla06@gmail.com>
Copy link
Contributor

github-actions bot commented Dec 3, 2024

❌ Gradle check result for a0c1abd: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: Pranshu Shukla <pranshushukla06@gmail.com>
Copy link
Contributor

github-actions bot commented Dec 3, 2024

❕ Gradle check result for ed9f5e7: UNSTABLE

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

Copy link

codecov bot commented Dec 3, 2024

Codecov Report

Attention: Patch coverage is 75.00000% with 2 lines in your changes missing coverage. Please review.

Project coverage is 72.13%. Comparing base (42dc22e) to head (de0b01a).
Report is 6 commits behind head on main.

Files with missing lines Patch % Lines
...g/opensearch/repositories/RepositoriesService.java 0.00% 2 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main   #16763      +/-   ##
============================================
+ Coverage     72.05%   72.13%   +0.08%     
- Complexity    65183    65240      +57     
============================================
  Files          5318     5318              
  Lines        303993   304004      +11     
  Branches      43990    43992       +2     
============================================
+ Hits         219028   219307     +279     
+ Misses        67046    66710     -336     
- Partials      17919    17987      +68     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@Pranshu-S
Copy link
Contributor Author

Signed-off-by: Pranshu Shukla <pranshushukla06@gmail.com>
Copy link
Contributor

github-actions bot commented Dec 6, 2024

✅ Gradle check result for d009e53: SUCCESS

Signed-off-by: Pranshu Shukla <pranshushukla06@gmail.com>
Copy link
Contributor

github-actions bot commented Dec 9, 2024

✅ Gradle check result for 9f8753f: SUCCESS

Signed-off-by: Pranshu Shukla <pranshushukla06@gmail.com>
Signed-off-by: Pranshu Shukla <pranshushukla06@gmail.com>
@shwetathareja shwetathareja added the backport 2.x Backport to 2.x branch label Dec 9, 2024
Copy link
Contributor

github-actions bot commented Dec 9, 2024

❌ Gradle check result for dfb56d8: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@Pranshu-S
Copy link
Contributor Author

Signed-off-by: Pranshu Shukla <pranshushukla06@gmail.com>
Copy link
Contributor

github-actions bot commented Dec 9, 2024

❌ Gradle check result for e89ea10: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: Pranshu Shukla <pranshushukla06@gmail.com>
Copy link
Contributor

github-actions bot commented Dec 9, 2024

✅ Gradle check result for de0b01a: SUCCESS

@shwetathareja shwetathareja merged commit da6eda7 into opensearch-project:main Dec 10, 2024
38 checks passed
@opensearch-trigger-bot
Copy link
Contributor

The backport to 2.x failed:

The process '/usr/bin/git' failed with exit code 128

To backport manually, run these commands in your terminal:

# Navigate to the root of your repository
cd $(git rev-parse --show-toplevel)
# Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add ../.worktrees/OpenSearch/backport-2.x 2.x
# Navigate to the new working tree
pushd ../.worktrees/OpenSearch/backport-2.x
# Create a new branch
git switch --create backport/backport-16763-to-2.x
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 da6eda776a0c33f75da3645b04218c35d44d3aa7
# Push it to GitHub
git push --set-upstream origin backport/backport-16763-to-2.x
# Go back to the original working tree
popd
# Delete the working tree
git worktree remove ../.worktrees/OpenSearch/backport-2.x

Then, create a pull request where the base branch is 2.x and the compare/head branch is backport/backport-16763-to-2.x.

Pranshu-S added a commit to Pranshu-S/OpenSearch that referenced this pull request Dec 10, 2024
…Service is not in sync with cluster-state (opensearch-project#16763)

* Skip remote-repositories validations for node-joins when RepositoriesService is not in sync with cluster-state

Signed-off-by: Pranshu Shukla <pranshushukla06@gmail.com>
shwetathareja pushed a commit that referenced this pull request Dec 10, 2024
…en RepositoriesService is not in sync with cluster-state (#16820)

* Skip remote-repositories validations for node-joins when RepositoriesService is not in sync with cluster-state (#16763)

Signed-off-by: Pranshu Shukla <pranshushukla06@gmail.com>
mingshl pushed a commit to mingshl/OpenSearch-Mingshl that referenced this pull request Dec 16, 2024
…Service is not in sync with cluster-state (opensearch-project#16763)

* Skip remote-repositories validations for node-joins when RepositoriesService is not in sync with cluster-state

Signed-off-by: Pranshu Shukla <pranshushukla06@gmail.com>
mingshl pushed a commit to mingshl/OpenSearch-Mingshl that referenced this pull request Dec 16, 2024
…Service is not in sync with cluster-state (opensearch-project#16763)

* Skip remote-repositories validations for node-joins when RepositoriesService is not in sync with cluster-state

Signed-off-by: Pranshu Shukla <pranshushukla06@gmail.com>
Signed-off-by: Mingshi Liu <mingshl@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: ✅ Done
4 participants