Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Skip performOnPrimary step when executing PublishCheckpoint. #6366

Merged
merged 2 commits into from
Feb 20, 2023

Conversation

mch2
Copy link
Member

@mch2 mch2 commented Feb 18, 2023

Description

This change fixes broken SegmentReplicationRelocationITs and unmutes them. The cause of the flaky tests is WAIT_UNTIL requests blocking relocation of primary shards because replicas have not refreshed. The issue is that primary shards cannot publish their checkpoints because PublishCheckpointAction is a ReplicationOperation, that first hits the primary and acquires a permit. Given we are blocking operations, the permit is delayed and the request never succeeds. This fixes the issue by ensuring PublishCheckpoint requests are not sent to the primary, and only to replicas within the replication group directly.

Issues Resolved

closes #6065

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • Commits are signed per the DCO using --signoff
  • Commit changes are listed out in CHANGELOG.md file (See: Changelog)

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

Signed-off-by: Marc Handalian <handalm@amazon.com>

Fix spotless.

Signed-off-by: Marc Handalian <handalm@amazon.com>
@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

  • RESULT: UNSTABLE ❕
  • TEST FAILURES:
      1 org.opensearch.cluster.routing.allocation.decider.DiskThresholdDeciderIT.testIndexCreateBlockIsRemovedWhenAnyNodesNotExceedHighWatermarkWithAutoReleaseEnabled

@codecov-commenter
Copy link

codecov-commenter commented Feb 18, 2023

Codecov Report

Merging #6366 (e360577) into main (36f5cfe) will decrease coverage by 0.08%.
The diff coverage is 5.88%.

📣 This organization is not using Codecov’s GitHub App Integration. We recommend you install it so Codecov can continue to function properly for your repositories. Learn more

@@             Coverage Diff              @@
##               main    #6366      +/-   ##
============================================
- Coverage     70.76%   70.68%   -0.08%     
+ Complexity    59051    58953      -98     
============================================
  Files          4799     4799              
  Lines        282432   282438       +6     
  Branches      40716    40718       +2     
============================================
- Hits         199856   199638     -218     
- Misses        66147    66313     +166     
- Partials      16429    16487      +58     
Impacted Files Coverage Δ
...eplication/checkpoint/PublishCheckpointAction.java 23.43% <5.88%> (-4.15%) ⬇️
...g/opensearch/index/analysis/CharFilterFactory.java 0.00% <0.00%> (-100.00%) ⬇️
.../java/org/opensearch/node/NodeClosedException.java 50.00% <0.00%> (-50.00%) ⬇️
...ava/org/opensearch/action/NoSuchNodeException.java 0.00% <0.00%> (-50.00%) ⬇️
...a/org/opensearch/tasks/TaskCancelledException.java 50.00% <0.00%> (-50.00%) ⬇️
...cluster/coordination/PendingClusterStateStats.java 20.00% <0.00%> (-48.00%) ⬇️
.../java/org/opensearch/search/dfs/AggregatedDfs.java 51.61% <0.00%> (-45.17%) ⬇️
...ava/org/opensearch/search/dfs/DfsSearchResult.java 47.87% <0.00%> (-39.37%) ⬇️
...pensearch/action/ingest/DeletePipelineRequest.java 31.25% <0.00%> (-37.50%) ⬇️
...r/src/main/java/org/opensearch/http/HttpUtils.java 33.33% <0.00%> (-33.34%) ⬇️
... and 475 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

.put("index.number_of_replicas", replicaCount)
.put("index.refresh_interval", -1)
).get();
prepareCreate(INDEX_NAME, Settings.builder().put(SETTING_NUMBER_OF_REPLICAS, replicaCount)).get();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

}

/**
* This test verifies happy path when primary shard is relocated newly added node (target) in the cluster. Before
* relocation and after relocation documents are indexed and documents are verified
*/
@AwaitsFix(bugUrl = "https://github.com/opensearch-project/OpenSearch/issues/5669")
Copy link
Member

@dreamer-89 dreamer-89 Feb 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: For completenes perspective, do we need relocation tests for IMMEDIATE & default refresh policy ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes good idea, I've added randomness to select a refresh policy in all of the tests in this class other than the explicit wait_until test.

@@ -146,12 +146,11 @@ public void handleResponse(ReplicationResponse response) {
timer.time()
)
);
task.setPhase("finished");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this setPhase call is removed ? If it is not useful, let's remove same during failure as well (on line 156)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not intentional, thanks for catching this, it is useful in tracking state in the ReplicationTask.

- Add random refresh policy to reloation ITs.
- add back finishing ReplicationTask.

Signed-off-by: Marc Handalian <handalm@amazon.com>
Copy link
Member

@dreamer-89 dreamer-89 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks @mch2 for fixing the bug and updating tests here.

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

  • RESULT: UNSTABLE ❕
  • TEST FAILURES:
      1 org.opensearch.indices.replication.SegmentReplicationStatsIT.testSegmentReplicationStatsResponse

@mch2 mch2 merged commit 72314f7 into opensearch-project:main Feb 20, 2023
@mch2 mch2 deleted the publishCheckpoint branch February 20, 2023 23:44
@mch2 mch2 added the backport 2.x Backport to 2.x branch label Feb 20, 2023
@opensearch-trigger-bot
Copy link
Contributor

The backport to 2.x failed:

The process '/usr/bin/git' failed with exit code 128

To backport manually, run these commands in your terminal:

# Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add ../.worktrees/backport-2.x 2.x
# Navigate to the new working tree
pushd ../.worktrees/backport-2.x
# Create a new branch
git switch --create backport/backport-6366-to-2.x
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 72314f75807f5eb72d7b178180dda72d92997dfd
# Push it to GitHub
git push --set-upstream origin backport/backport-6366-to-2.x
# Go back to the original working tree
popd
# Delete the working tree
git worktree remove ../.worktrees/backport-2.x

Then, create a pull request where the base branch is 2.x and the compare/head branch is backport/backport-6366-to-2.x.

mch2 added a commit to mch2/OpenSearch that referenced this pull request Feb 21, 2023
…rch-project#6366)

* Skip performOnPrimary step when executing PublishCheckpoint.

Signed-off-by: Marc Handalian <handalm@amazon.com>

Fix spotless.

Signed-off-by: Marc Handalian <handalm@amazon.com>

* PR Feedback:
- Add random refresh policy to reloation ITs.
- add back finishing ReplicationTask.

Signed-off-by: Marc Handalian <handalm@amazon.com>

---------

Signed-off-by: Marc Handalian <handalm@amazon.com>
(cherry picked from commit 72314f7)
mch2 added a commit to mch2/OpenSearch that referenced this pull request Mar 4, 2023
…rch-project#6366)

* Skip performOnPrimary step when executing PublishCheckpoint.

Signed-off-by: Marc Handalian <handalm@amazon.com>

Fix spotless.

Signed-off-by: Marc Handalian <handalm@amazon.com>

* PR Feedback:
- Add random refresh policy to reloation ITs.
- add back finishing ReplicationTask.

Signed-off-by: Marc Handalian <handalm@amazon.com>

---------

Signed-off-by: Marc Handalian <handalm@amazon.com>
Rishikesh1159 added a commit to Rishikesh1159/OpenSearch that referenced this pull request Mar 11, 2023
…6366

Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com>
Rishikesh1159 added a commit that referenced this pull request Mar 15, 2023
…estRelocateWhileContinuouslyIndexingAndWaitingForRefresh (#6637)

* Trigger Refresh on NRT Engine.

Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com>

* Revert changes made to PublishCheckpointAction in #6366

Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com>

* Fix failing unit test

Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com>

* Force flush on new elected primary after relocation.

Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com>

* Fix failing unit test.

Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com>

* Remove unnecessary assertions

Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com>

* Adding tests.

Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com>

* Address comments

Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com>

* Fix indentation.

Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com>

---------

Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com>
opensearch-trigger-bot bot pushed a commit that referenced this pull request Mar 15, 2023
…estRelocateWhileContinuouslyIndexingAndWaitingForRefresh (#6637)

* Trigger Refresh on NRT Engine.

Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com>

* Revert changes made to PublishCheckpointAction in #6366

Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com>

* Fix failing unit test

Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com>

* Force flush on new elected primary after relocation.

Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com>

* Fix failing unit test.

Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com>

* Remove unnecessary assertions

Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com>

* Adding tests.

Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com>

* Address comments

Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com>

* Fix indentation.

Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com>

---------

Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com>
(cherry picked from commit 1e5d913)
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Rishikesh1159 pushed a commit that referenced this pull request Mar 16, 2023
…estRelocateWhileContinuouslyIndexingAndWaitingForRefresh (#6637) (#6675)

* Trigger Refresh on NRT Engine.



* Revert changes made to PublishCheckpointAction in #6366



* Fix failing unit test



* Force flush on new elected primary after relocation.



* Fix failing unit test.



* Remove unnecessary assertions



* Adding tests.



* Address comments



* Fix indentation.



---------


(cherry picked from commit 1e5d913)

Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com>
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
mingshl pushed a commit to mingshl/OpenSearch-Mingshl that referenced this pull request Mar 24, 2023
…estRelocateWhileContinuouslyIndexingAndWaitingForRefresh (opensearch-project#6637)

* Trigger Refresh on NRT Engine.

Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com>

* Revert changes made to PublishCheckpointAction in opensearch-project#6366

Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com>

* Fix failing unit test

Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com>

* Force flush on new elected primary after relocation.

Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com>

* Fix failing unit test.

Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com>

* Remove unnecessary assertions

Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com>

* Adding tests.

Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com>

* Address comments

Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com>

* Fix indentation.

Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com>

---------

Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com>
Signed-off-by: Mingshi Liu <mingshl@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 2.x Backport to 2.x branch skip-changelog
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] failing IT test : SegmentReplicationRelocationIT
3 participants