Cluster health call to throw decommissioned exception for local decommissioned node #6008

imRishN · 2023-01-25T14:57:13Z

Description

This PR adds a param to cluster health local call to check if a node is decommissioned or not before retrieving its health from a local cluster state.

Example Request/Response -

In a non decommissioned cluster

> curl "localhost:9200/_cluster/health?pretty&local&ensure_local_node_commissioned"
{
  "cluster_name" : "runTask",
  "status" : "green",
  "timed_out" : false,
  "number_of_nodes" : 3,
  "number_of_data_nodes" : 3,
  "discovered_master" : true,
  "discovered_cluster_manager" : true,
  "active_primary_shards" : 0,
  "active_shards" : 0,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 0,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 100.0
}

> curl "localhost:9200/_cluster/health?pretty&local"                               
{
  "cluster_name" : "runTask",
  "status" : "green",
  "timed_out" : false,
  "number_of_nodes" : 3,
  "number_of_data_nodes" : 3,
  "discovered_master" : true,
  "discovered_cluster_manager" : true,
  "active_primary_shards" : 0,
  "active_shards" : 0,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 0,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 100.0
}

In a decommissioned cluster and request on a decommissioned node

> curl "localhost:9200/_cluster/health?pretty&local"                                                            
{
  "cluster_name" : "runTask",
  "status" : "red",
  "timed_out" : false,
  "number_of_nodes" : 3,
  "number_of_data_nodes" : 3,
  "discovered_master" : false,
  "discovered_cluster_manager" : false,
  "active_primary_shards" : 0,
  "active_shards" : 0,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 0,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : "NaN"
}

> curl "localhost:9200/_cluster/health?pretty&local&ensure_local_node_commissioned"
{
  "error" : {
    "root_cause" : [
      {
        "type" : "node_decommissioned_exception",
        "reason" : "local node is decommissioned"
      }
    ],
    "type" : "node_decommissioned_exception",
    "reason" : "local node is decommissioned"
  },
  "status" : 422
}

> curl "localhost:9200/_cluster/health?pretty&ensure_local_node_commissioned" 
{
  "error" : {
    "root_cause" : [
      {
        "type" : "action_request_validation_exception",
        "reason" : "Validation Failed: 1: not a local request to ensure local node commissioned;"
      }
    ],
    "type" : "action_request_validation_exception",
    "reason" : "Validation Failed: 1: not a local request to ensure local node commissioned;"
  },
  "status" : 400
}

In a decommissioned cluster and on a non decommissioned node

> curl "localhost:9201/_cluster/health?pretty&local"                              
{
  "cluster_name" : "runTask",
  "status" : "green",
  "timed_out" : false,
  "number_of_nodes" : 2,
  "number_of_data_nodes" : 2,
  "discovered_master" : true,
  "discovered_cluster_manager" : true,
  "active_primary_shards" : 0,
  "active_shards" : 0,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 0,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 100.0
}

> curl "localhost:9201/_cluster/health?pretty&local&ensure_local_node_commissioned"
{
  "cluster_name" : "runTask",
  "status" : "green",
  "timed_out" : false,
  "number_of_nodes" : 2,
  "number_of_data_nodes" : 2,
  "discovered_master" : true,
  "discovered_cluster_manager" : true,
  "active_primary_shards" : 0,
  "active_shards" : 0,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 0,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 100.0
}

Issues Resolved

#4528

Check List

New functionality includes testing.
- All tests pass
New functionality has been documented.
- New functionality has javadoc added
Commits are signed per the DCO using --signoff
Commit changes are listed out in CHANGELOG.md file (See: Changelog)

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

…missioned nodes Signed-off-by: Rishab Nahata <rnnahata@amazon.com>

github-actions · 2023-01-25T15:00:57Z

Gradle Check (Jenkins) Run Completed with:

RESULT: FAILURE ❌
URL: https://build.ci.opensearch.org/job/gradle-check/9889/
CommitID: bd1df17
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green.
Is the failure a flaky test unrelated to your change?

Signed-off-by: Rishab Nahata <rnnahata@amazon.com>

…ort-fix

github-actions · 2023-01-26T05:57:06Z

Gradle Check (Jenkins) Run Completed with:

RESULT: FAILURE ❌
URL: https://build.ci.opensearch.org/job/gradle-check/9945/
CommitID: 6fc708d
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green.
Is the failure a flaky test unrelated to your change?

Signed-off-by: Rishab Nahata <rnnahata@amazon.com>

github-actions · 2023-01-26T06:48:53Z

Gradle Check (Jenkins) Run Completed with:

RESULT: FAILURE ❌
URL: https://build.ci.opensearch.org/job/gradle-check/9946/
CommitID: 2f9ea8a
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green.
Is the failure a flaky test unrelated to your change?

Signed-off-by: Rishab Nahata <rnnahata@amazon.com>

github-actions · 2023-01-26T08:24:34Z

Gradle Check (Jenkins) Run Completed with:

RESULT: FAILURE ❌
URL: https://build.ci.opensearch.org/job/gradle-check/9949/
CommitID: 69495e0
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green.
Is the failure a flaky test unrelated to your change?

Signed-off-by: Rishab Nahata <rnnahata@amazon.com>

github-actions · 2023-01-27T07:07:33Z

Gradle Check (Jenkins) Run Completed with:

RESULT: UNSTABLE ❕
TEST FAILURES:

      1 org.opensearch.indices.replication.SegmentReplicationIT.testStartReplicaAfterPrimaryIndexesDocs

URL: https://build.ci.opensearch.org/job/gradle-check/10031/
CommitID: 4160593
Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

codecov-commenter · 2023-01-27T07:09:47Z

Codecov Report

Merging #6008 (7f715a5) into main (f9eb9bf) will increase coverage by 0.09%.
The diff coverage is 50.00%.

📣 This organization is not using Codecov’s GitHub App Integration. We recommend you install it so Codecov can continue to function properly for your repositories. Learn more

@@             Coverage Diff              @@
##               main    #6008      +/-   ##
============================================
+ Coverage     70.73%   70.83%   +0.09%     
- Complexity    58738    58775      +37     
============================================
  Files          4771     4771              
  Lines        280820   280840      +20     
  Branches      40568    40572       +4     
============================================
+ Hits         198645   198920     +275     
+ Misses        65865    65619     -246     
+ Partials      16310    16301       -9

Impacted Files	Coverage Δ
...in/cluster/health/ClusterHealthRequestBuilder.java	`27.77% <0.00%> (-1.64%)`	⬇️
...g/opensearch/cluster/coordination/Coordinator.java	`78.36% <ø> (-0.25%)`	⬇️
...ster/decommission/NodeDecommissionedException.java	`40.00% <0.00%> (-10.00%)`	⬇️
...n/cluster/health/TransportClusterHealthAction.java	`45.26% <20.00%> (-0.69%)`	⬇️
...ion/admin/cluster/health/ClusterHealthRequest.java	`80.16% <70.00%> (-1.82%)`	⬇️
.../action/admin/cluster/RestClusterHealthAction.java	`65.85% <100.00%> (+1.75%)`	⬆️
...luster/routing/allocation/RoutingExplanations.java	`41.37% <0.00%> (-58.63%)`	⬇️
.../java/org/opensearch/node/NodeClosedException.java	`50.00% <0.00%> (-50.00%)`	⬇️
.../admin/cluster/reroute/ClusterRerouteResponse.java	`55.00% <0.00%> (-45.00%)`	⬇️
...pensearch/indices/breaker/CircuitBreakerStats.java	`27.77% <0.00%> (-41.67%)`	⬇️
... and 478 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

Signed-off-by: Rishab Nahata <rnnahata@amazon.com>

github-actions · 2023-01-27T08:30:11Z

Gradle Check (Jenkins) Run Completed with:

RESULT: UNSTABLE ❕
TEST FAILURES:

      1 org.opensearch.cluster.routing.allocation.decider.DiskThresholdDeciderIT.testIndexCreateBlockWhenAllNodesExceededHighWatermark
      1 org.opensearch.cluster.routing.allocation.decider.DiskThresholdDeciderIT.testIndexCreateBlockIsRemovedWhenAnyNodesNotExceedHighWatermark

URL: https://build.ci.opensearch.org/job/gradle-check/10032/
CommitID: 50a52a6
Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

Signed-off-by: Rishab Nahata <rnnahata@amazon.com>

imRishN · 2023-01-27T09:12:16Z

...rc/main/java/org/opensearch/action/support/clustermanager/ClusterManagerNodeReadRequest.java

+        if (out.getVersion().onOrAfter(Version.CURRENT)) {
+            out.writeBoolean(ensureLocalNodeCommissioned);
+        }


Build fails in mixed cluster test if version check is PUT for 2.6 and not current. Please suggest correct way of doing this

We will put 2.6 and then backport

github-actions · 2023-01-27T09:23:32Z

Gradle Check (Jenkins) Run Completed with:

RESULT: SUCCESS ✅
URL: https://build.ci.opensearch.org/job/gradle-check/10033/
CommitID: 2b95f53

Bukhtawar · 2023-01-27T11:33:14Z

rest-api-spec/src/main/resources/rest-api-spec/api/cluster.health.json

@@ -111,6 +111,10 @@
      "awareness_attribute":{
        "type":"string",
        "description":"The awareness attribute for which the health is required"
+      },
+      "ensure_local_node_commissioned":{


wondering we could use ensure_node_commissioned and only support with _local to start with. Later we could extend this to any node_id

There could only be two kinds of transport request. One which retrieves information from local cluster state of the node or another which gets it from leader's state. There's no mechanism which says run this transport request on a specific node id. Hence, I feel this would ALWAYS run with local param only

Updating it to ensure_node_commissioned

Bukhtawar · 2023-01-27T11:36:49Z

...r/src/main/java/org/opensearch/action/admin/cluster/health/TransportClusterHealthAction.java

@@ -134,7 +140,11 @@ protected void clusterManagerOperation(
        final ClusterState unusedState,
        final ActionListener<ClusterHealthResponse> listener
    ) {
-
+        if (request.ensureLocalNodeCommissioned()
+            && discovery instanceof Coordinator


Should discovery instanceof Coordinator be an assertion instead?

Asserts wouldn't run on prod. And only coordinator has this node's commission status info. If a developer uses a different Discovery mechanism it might break this. Hence putting this check directly

Exactly so tests should fail for a developer, in prod this is expected to be Coordinator

Would it be fair to assume that Coordinator can only be the discovery for all use cases? Can there a plugin which writes there own Discovery model? I see something like this implemented in Gateway service. Here, they assume discovery might not be instance of Coordinator. But I get your point, for this change we can also add asserts and put in if as well to be a little cautious

if (discovery instanceof Coordinator) { recoveryRunnable = () -> clusterService.submitStateUpdateTask("local-gateway-elected-state", new RecoverStateUpdateTask()); } else { final Gateway gateway = new Gateway(settings, clusterService, listGatewayMetaState); recoveryRunnable = () -> gateway.performStateRecovery(new GatewayRecoveryListener()); }

Bukhtawar · 2023-01-27T11:39:16Z

server/src/main/java/org/opensearch/cluster/decommission/NodeDecommissionedException.java

+    @Override
+    public RestStatus status() {
+        return RestStatus.UNPROCESSABLE_ENTITY;
+    }
 }


424 HTTP Error code seems more appropriate

Sure, updated

Bukhtawar · 2023-01-27T11:40:59Z

Can you also verify what happens on a new node w/o cluster state that join as decommissioned vs existing node w/ cluster state getting decommissioned

Signed-off-by: Rishab Nahata <rnnahata@amazon.com>

github-actions · 2023-01-29T12:35:14Z

Gradle Check (Jenkins) Run Completed with:

RESULT: UNSTABLE ❕
TEST FAILURES:

      1 org.opensearch.cluster.routing.allocation.decider.DiskThresholdDeciderIT.testIndexCreateBlockWithAReadOnlyBlock

URL: https://build.ci.opensearch.org/job/gradle-check/10123/
CommitID: 7f715a5
Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

Bukhtawar · 2023-01-29T13:53:51Z

CHANGELOG.md

@@ -48,6 +48,7 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
 - Changed http code on create index API with bad input raising NotXContentException from 500 to 400 ([#4773](https://github.com/opensearch-project/OpenSearch/pull/4773))
 - Change http code for DecommissioningFailedException from 500 to 400 ([#5283](https://github.com/opensearch-project/OpenSearch/pull/5283))
 - Require MediaType in Strings.toString API ([#6009](https://github.com/opensearch-project/OpenSearch/pull/6009))
+- Cluster health call to throw decommissioned exception for local decommissioned node([#6008](https://github.com/opensearch-project/OpenSearch/pull/6008))


Put this under section unreleased 2.x

imRishN · 2023-01-29T13:57:14Z

Can you also verify what happens on a new node w/o cluster state that join as decommissioned vs existing node w/ cluster state getting decommissioned

Both fails with same expected error with the new param

server/src/main/java/org/opensearch/action/admin/cluster/health/ClusterHealthRequest.java

Signed-off-by: Rishab Nahata <rnnahata@amazon.com>

github-actions · 2023-01-29T14:37:58Z

Gradle Check (Jenkins) Run Completed with:

RESULT: FAILURE ❌
URL: https://build.ci.opensearch.org/job/gradle-check/10125/
CommitID: 2b04e51
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green.
Is the failure a flaky test unrelated to your change?

…missioned node (opensearch-project#6008) * Cluster health call to throw decommissioned exception for local decommissioned nodes Signed-off-by: Rishab Nahata <rnnahata@amazon.com>

…missioned node (#6008) (#6059) * [Backport 2.x]Cluster health call to throw decommissioned exception for local decommissioned nodes Signed-off-by: Rishab Nahata <rnnahata@amazon.com>

Cluster health call to throw decommissioned exception for local decom…

bd1df17

…missioned nodes Signed-off-by: Rishab Nahata <rnnahata@amazon.com>

imRishN added 2 commits January 26, 2023 11:09

Fix spotless check

99ae759

Signed-off-by: Rishab Nahata <rnnahata@amazon.com>

Merge remote-tracking branch 'upstream/main' into decommission/transp…

6fc708d

…ort-fix

imRishN added 2 commits January 26, 2023 11:53

Fix param

fbd6d71

Signed-off-by: Rishab Nahata <rnnahata@amazon.com>

Add changelog

2f9ea8a

Signed-off-by: Rishab Nahata <rnnahata@amazon.com>

Fix version issue

69495e0

Signed-off-by: Rishab Nahata <rnnahata@amazon.com>

Fix version issue

4160593

Signed-off-by: Rishab Nahata <rnnahata@amazon.com>

Add integ test

50a52a6

Signed-off-by: Rishab Nahata <rnnahata@amazon.com>

Add rest test

2b95f53

Signed-off-by: Rishab Nahata <rnnahata@amazon.com>

imRishN marked this pull request as ready for review January 27, 2023 09:06

imRishN requested review from reta, anasalkouz, andrross, Bukhtawar, CEHENKLE, dblock, gbbafna, setiah, kartg, kotwanikunal, mch2, nknize and owaiskazi19 as code owners January 27, 2023 09:06

imRishN requested review from adnapibar, Rishikesh1159, ryanbogan, saratvemulapalli, shwetathareja, dreamer-89, tlfeng, VachaShah and xuezhou25 as code owners January 27, 2023 09:06

imRishN commented Jan 27, 2023

View reviewed changes

Bukhtawar reviewed Jan 27, 2023

View reviewed changes

imRishN added 3 commits January 27, 2023 17:47

Comment fixes

5ff1c0f

Signed-off-by: Rishab Nahata <rnnahata@amazon.com>

Refactor

670fd5e

Signed-off-by: Rishab Nahata <rnnahata@amazon.com>

Spotless

7f715a5

Signed-off-by: Rishab Nahata <rnnahata@amazon.com>

Bukhtawar reviewed Jan 29, 2023

View reviewed changes

server/src/main/java/org/opensearch/action/admin/cluster/health/ClusterHealthRequest.java Outdated Show resolved Hide resolved

Resolve comments

2b04e51

Signed-off-by: Rishab Nahata <rnnahata@amazon.com>

Bukhtawar approved these changes Jan 29, 2023

View reviewed changes

Bukhtawar merged commit 249f1a6 into opensearch-project:main Jan 29, 2023

imRishN mentioned this pull request Jan 29, 2023

[Backport -2.x] Cluster health call to throw decommissioned exception for local decommissioned node #6059

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cluster health call to throw decommissioned exception for local decommissioned node #6008

Cluster health call to throw decommissioned exception for local decommissioned node #6008

imRishN commented Jan 25, 2023 •

edited

Loading

github-actions bot commented Jan 25, 2023

github-actions bot commented Jan 26, 2023

github-actions bot commented Jan 26, 2023

github-actions bot commented Jan 26, 2023

github-actions bot commented Jan 27, 2023

codecov-commenter commented Jan 27, 2023 •

edited

Loading

github-actions bot commented Jan 27, 2023

imRishN Jan 27, 2023

Bukhtawar Jan 27, 2023

github-actions bot commented Jan 27, 2023

Bukhtawar Jan 27, 2023

imRishN Jan 27, 2023 •

edited

Loading

imRishN Jan 29, 2023

Bukhtawar Jan 27, 2023

imRishN Jan 27, 2023

Bukhtawar Jan 27, 2023

imRishN Jan 27, 2023

Bukhtawar Jan 27, 2023

imRishN Jan 27, 2023

Bukhtawar commented Jan 27, 2023

github-actions bot commented Jan 29, 2023

Bukhtawar Jan 29, 2023

imRishN Jan 29, 2023

imRishN commented Jan 29, 2023

github-actions bot commented Jan 29, 2023

Cluster health call to throw decommissioned exception for local decommissioned node #6008

Cluster health call to throw decommissioned exception for local decommissioned node #6008

Conversation

imRishN commented Jan 25, 2023 • edited Loading

Description

Issues Resolved

Check List

github-actions bot commented Jan 25, 2023

Gradle Check (Jenkins) Run Completed with:

github-actions bot commented Jan 26, 2023

Gradle Check (Jenkins) Run Completed with:

github-actions bot commented Jan 26, 2023

Gradle Check (Jenkins) Run Completed with:

github-actions bot commented Jan 26, 2023

Gradle Check (Jenkins) Run Completed with:

github-actions bot commented Jan 27, 2023

Gradle Check (Jenkins) Run Completed with:

codecov-commenter commented Jan 27, 2023 • edited Loading

Codecov Report

github-actions bot commented Jan 27, 2023

Gradle Check (Jenkins) Run Completed with:

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Jan 27, 2023

Gradle Check (Jenkins) Run Completed with:

Choose a reason for hiding this comment

imRishN Jan 27, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Bukhtawar commented Jan 27, 2023

github-actions bot commented Jan 29, 2023

Gradle Check (Jenkins) Run Completed with:

Choose a reason for hiding this comment

Choose a reason for hiding this comment

imRishN commented Jan 29, 2023

github-actions bot commented Jan 29, 2023

Gradle Check (Jenkins) Run Completed with:

imRishN commented Jan 25, 2023 •

edited

Loading

codecov-commenter commented Jan 27, 2023 •

edited

Loading

imRishN Jan 27, 2023 •

edited

Loading