Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cluster health call to throw decommissioned exception for local decommissioned node #6008

Merged
merged 13 commits into from
Jan 29, 2023

Conversation

imRishN
Copy link
Member

@imRishN imRishN commented Jan 25, 2023

Description

This PR adds a param to cluster health local call to check if a node is decommissioned or not before retrieving its health from a local cluster state.

Example Request/Response -

  1. In a non decommissioned cluster
> curl "localhost:9200/_cluster/health?pretty&local&ensure_local_node_commissioned"
{
  "cluster_name" : "runTask",
  "status" : "green",
  "timed_out" : false,
  "number_of_nodes" : 3,
  "number_of_data_nodes" : 3,
  "discovered_master" : true,
  "discovered_cluster_manager" : true,
  "active_primary_shards" : 0,
  "active_shards" : 0,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 0,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 100.0
}

> curl "localhost:9200/_cluster/health?pretty&local"                               
{
  "cluster_name" : "runTask",
  "status" : "green",
  "timed_out" : false,
  "number_of_nodes" : 3,
  "number_of_data_nodes" : 3,
  "discovered_master" : true,
  "discovered_cluster_manager" : true,
  "active_primary_shards" : 0,
  "active_shards" : 0,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 0,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 100.0
}
  1. In a decommissioned cluster and request on a decommissioned node
> curl "localhost:9200/_cluster/health?pretty&local"                                                            
{
  "cluster_name" : "runTask",
  "status" : "red",
  "timed_out" : false,
  "number_of_nodes" : 3,
  "number_of_data_nodes" : 3,
  "discovered_master" : false,
  "discovered_cluster_manager" : false,
  "active_primary_shards" : 0,
  "active_shards" : 0,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 0,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : "NaN"
}

> curl "localhost:9200/_cluster/health?pretty&local&ensure_local_node_commissioned"
{
  "error" : {
    "root_cause" : [
      {
        "type" : "node_decommissioned_exception",
        "reason" : "local node is decommissioned"
      }
    ],
    "type" : "node_decommissioned_exception",
    "reason" : "local node is decommissioned"
  },
  "status" : 422
}

> curl "localhost:9200/_cluster/health?pretty&ensure_local_node_commissioned" 
{
  "error" : {
    "root_cause" : [
      {
        "type" : "action_request_validation_exception",
        "reason" : "Validation Failed: 1: not a local request to ensure local node commissioned;"
      }
    ],
    "type" : "action_request_validation_exception",
    "reason" : "Validation Failed: 1: not a local request to ensure local node commissioned;"
  },
  "status" : 400
}
  1. In a decommissioned cluster and on a non decommissioned node
> curl "localhost:9201/_cluster/health?pretty&local"                              
{
  "cluster_name" : "runTask",
  "status" : "green",
  "timed_out" : false,
  "number_of_nodes" : 2,
  "number_of_data_nodes" : 2,
  "discovered_master" : true,
  "discovered_cluster_manager" : true,
  "active_primary_shards" : 0,
  "active_shards" : 0,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 0,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 100.0
}

> curl "localhost:9201/_cluster/health?pretty&local&ensure_local_node_commissioned"
{
  "cluster_name" : "runTask",
  "status" : "green",
  "timed_out" : false,
  "number_of_nodes" : 2,
  "number_of_data_nodes" : 2,
  "discovered_master" : true,
  "discovered_cluster_manager" : true,
  "active_primary_shards" : 0,
  "active_shards" : 0,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 0,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 100.0
}

Issues Resolved

#4528

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • Commits are signed per the DCO using --signoff
  • Commit changes are listed out in CHANGELOG.md file (See: Changelog)

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

…missioned nodes

Signed-off-by: Rishab Nahata <rnnahata@amazon.com>
@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

Signed-off-by: Rishab Nahata <rnnahata@amazon.com>
@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

Signed-off-by: Rishab Nahata <rnnahata@amazon.com>
Signed-off-by: Rishab Nahata <rnnahata@amazon.com>
@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

Signed-off-by: Rishab Nahata <rnnahata@amazon.com>
@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

Signed-off-by: Rishab Nahata <rnnahata@amazon.com>
@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

  • RESULT: UNSTABLE ❕
  • TEST FAILURES:
      1 org.opensearch.indices.replication.SegmentReplicationIT.testStartReplicaAfterPrimaryIndexesDocs

@codecov-commenter
Copy link

codecov-commenter commented Jan 27, 2023

Codecov Report

Merging #6008 (7f715a5) into main (f9eb9bf) will increase coverage by 0.09%.
The diff coverage is 50.00%.

📣 This organization is not using Codecov’s GitHub App Integration. We recommend you install it so Codecov can continue to function properly for your repositories. Learn more

@@             Coverage Diff              @@
##               main    #6008      +/-   ##
============================================
+ Coverage     70.73%   70.83%   +0.09%     
- Complexity    58738    58775      +37     
============================================
  Files          4771     4771              
  Lines        280820   280840      +20     
  Branches      40568    40572       +4     
============================================
+ Hits         198645   198920     +275     
+ Misses        65865    65619     -246     
+ Partials      16310    16301       -9     
Impacted Files Coverage Δ
...in/cluster/health/ClusterHealthRequestBuilder.java 27.77% <0.00%> (-1.64%) ⬇️
...g/opensearch/cluster/coordination/Coordinator.java 78.36% <ø> (-0.25%) ⬇️
...ster/decommission/NodeDecommissionedException.java 40.00% <0.00%> (-10.00%) ⬇️
...n/cluster/health/TransportClusterHealthAction.java 45.26% <20.00%> (-0.69%) ⬇️
...ion/admin/cluster/health/ClusterHealthRequest.java 80.16% <70.00%> (-1.82%) ⬇️
.../action/admin/cluster/RestClusterHealthAction.java 65.85% <100.00%> (+1.75%) ⬆️
...luster/routing/allocation/RoutingExplanations.java 41.37% <0.00%> (-58.63%) ⬇️
.../java/org/opensearch/node/NodeClosedException.java 50.00% <0.00%> (-50.00%) ⬇️
.../admin/cluster/reroute/ClusterRerouteResponse.java 55.00% <0.00%> (-45.00%) ⬇️
...pensearch/indices/breaker/CircuitBreakerStats.java 27.77% <0.00%> (-41.67%) ⬇️
... and 478 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

Signed-off-by: Rishab Nahata <rnnahata@amazon.com>
@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

  • RESULT: UNSTABLE ❕
  • TEST FAILURES:
      1 org.opensearch.cluster.routing.allocation.decider.DiskThresholdDeciderIT.testIndexCreateBlockWhenAllNodesExceededHighWatermark
      1 org.opensearch.cluster.routing.allocation.decider.DiskThresholdDeciderIT.testIndexCreateBlockIsRemovedWhenAnyNodesNotExceedHighWatermark

Signed-off-by: Rishab Nahata <rnnahata@amazon.com>
Comment on lines 66 to 68
if (out.getVersion().onOrAfter(Version.CURRENT)) {
out.writeBoolean(ensureLocalNodeCommissioned);
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Build fails in mixed cluster test if version check is PUT for 2.6 and not current. Please suggest correct way of doing this

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will put 2.6 and then backport

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@@ -111,6 +111,10 @@
"awareness_attribute":{
"type":"string",
"description":"The awareness attribute for which the health is required"
},
"ensure_local_node_commissioned":{
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wondering we could use ensure_node_commissioned and only support with _local to start with. Later we could extend this to any node_id

Copy link
Member Author

@imRishN imRishN Jan 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There could only be two kinds of transport request. One which retrieves information from local cluster state of the node or another which gets it from leader's state. There's no mechanism which says run this transport request on a specific node id. Hence, I feel this would ALWAYS run with local param only

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updating it to ensure_node_commissioned

@@ -134,7 +140,11 @@ protected void clusterManagerOperation(
final ClusterState unusedState,
final ActionListener<ClusterHealthResponse> listener
) {

if (request.ensureLocalNodeCommissioned()
&& discovery instanceof Coordinator
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should discovery instanceof Coordinator be an assertion instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Asserts wouldn't run on prod. And only coordinator has this node's commission status info. If a developer uses a different Discovery mechanism it might break this. Hence putting this check directly

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly so tests should fail for a developer, in prod this is expected to be Coordinator

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be fair to assume that Coordinator can only be the discovery for all use cases? Can there a plugin which writes there own Discovery model? I see something like this implemented in Gateway service. Here, they assume discovery might not be instance of Coordinator. But I get your point, for this change we can also add asserts and put in if as well to be a little cautious

if (discovery instanceof Coordinator) {
            recoveryRunnable = () -> clusterService.submitStateUpdateTask("local-gateway-elected-state", new RecoverStateUpdateTask());
        } else {
            final Gateway gateway = new Gateway(settings, clusterService, listGatewayMetaState);
            recoveryRunnable = () -> gateway.performStateRecovery(new GatewayRecoveryListener());
        }

Comment on lines 33 to 37
@Override
public RestStatus status() {
return RestStatus.UNPROCESSABLE_ENTITY;
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

424 HTTP Error code seems more appropriate

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, updated

@Bukhtawar
Copy link
Collaborator

Can you also verify what happens on a new node w/o cluster state that join as decommissioned vs existing node w/ cluster state getting decommissioned

Signed-off-by: Rishab Nahata <rnnahata@amazon.com>
Signed-off-by: Rishab Nahata <rnnahata@amazon.com>
Signed-off-by: Rishab Nahata <rnnahata@amazon.com>
@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

  • RESULT: UNSTABLE ❕
  • TEST FAILURES:
      1 org.opensearch.cluster.routing.allocation.decider.DiskThresholdDeciderIT.testIndexCreateBlockWithAReadOnlyBlock

CHANGELOG.md Outdated
@@ -48,6 +48,7 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
- Changed http code on create index API with bad input raising NotXContentException from 500 to 400 ([#4773](https://github.com/opensearch-project/OpenSearch/pull/4773))
- Change http code for DecommissioningFailedException from 500 to 400 ([#5283](https://github.com/opensearch-project/OpenSearch/pull/5283))
- Require MediaType in Strings.toString API ([#6009](https://github.com/opensearch-project/OpenSearch/pull/6009))
- Cluster health call to throw decommissioned exception for local decommissioned node([#6008](https://github.com/opensearch-project/OpenSearch/pull/6008))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put this under section unreleased 2.x

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack

@imRishN
Copy link
Member Author

imRishN commented Jan 29, 2023

Can you also verify what happens on a new node w/o cluster state that join as decommissioned vs existing node w/ cluster state getting decommissioned

Both fails with same expected error with the new param

Signed-off-by: Rishab Nahata <rnnahata@amazon.com>
@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@Bukhtawar Bukhtawar merged commit 249f1a6 into opensearch-project:main Jan 29, 2023
imRishN added a commit to imRishN/OpenSearch that referenced this pull request Jan 29, 2023
…missioned node (opensearch-project#6008)

* Cluster health call to throw decommissioned exception for local decommissioned nodes

Signed-off-by: Rishab Nahata <rnnahata@amazon.com>
Bukhtawar pushed a commit that referenced this pull request Jan 29, 2023
…missioned node (#6008) (#6059)

* [Backport 2.x]Cluster health call to throw decommissioned exception for local decommissioned nodes

Signed-off-by: Rishab Nahata <rnnahata@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants