Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] In batch mode _cluster/allocation/explain API returns incorrect response #13990

Closed
SwethaGuptha opened this issue Jun 5, 2024 · 1 comment
Labels
bug Something isn't working Cluster Manager

Comments

@SwethaGuptha
Copy link
Contributor

Describe the bug

API _cluster/allocation/explain is returning incorrect response on clusters with batch mode enabled because the request for shard explain allocation are being served by GatewayAllocator instead of ShardsBatchGatewayAllocator.(AllocatorFetchLogic, ExistingShardAllocatorSetting). A change in AllocationService is required to switch to the ShardsBatchGatewayAllocator when batch mode is enabled.

Issue was identified by:
Enabling index.unassigned.node_left.delayed_timeout and taking down nodes with 2 replicas of the shard, the expected response from _cluster/allocation/explain was allocation_delayed whereas the API returned awaiting_info instead.

Related component

Cluster Manager

To Reproduce

  1. Create a cluster with dedicated master and 10 data nodes.
  2. Create a test index with 2 primary and 3 replica
curl -X PUT "localhost:9200/test-ind?pretty" -H 'Content-Type: application/json' -d'
{
  "settings": {
    "index": {
      "number_of_shards": 2,
      "number_of_replicas": 3
    }
  }
}'
  1. Enable the unassigned delayed_timeout setting
4. curl -X PUT "localhost:9200/_all/_settings?pretty" -H 'Content-Type: application/json' -d'
{
  "settings": {
    "index.unassigned.node_left.delayed_timeout": "10m"
  }
}
  1. Get the nodes with shards for the index
curl localhost:9200/_cat/shards/test-ind
  1. Stop ES process on 2 data nodes with the replicas for shard0
  2. Get allocation response for the shard
curl -XGET 'http://localhost:9200/_cluster/allocation/explain' -H 'Content-Type: application/json' -d '{
  "index": "test-ind",
  "shard": 0,
  "primary": false
}'
  1. Validate value for can_allocate field in response is awaiting_info, response would look like this:
{"index":"test-ind","shard":0,"primary":false,"current_state":"unassigned","unassigned_info":{"reason":"NODE_LEFT","at":"2024-06-05T05:33:16.753Z","details":"node_left [Bvu-mf5XSPu3DEmv9ndBgw]","last_allocation_status":"no_attempt"},"can_allocate":"awaiting_info","allocate_explanation":"cannot allocate because information about existing shard data is still being retrieved from some of the nodes","node_allocation_decisions":[{"node_id":"3YYYQYZLQaGck1tIOJ57xg","node_name":"517c7e06d65968c38f1a4140b265ccc4","

Expected behavior

Value for can_allocate field in response is delayed_timeout

Additional Details

OpenSearch Version: 2.14

@peternied
Copy link
Member

[Triage - attendees 1 2 3 4 5 6 7]
@SwethaGuptha Thanks for creating this issue, could you create a pull request to address?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Cluster Manager
Projects
Status: ✅ Done
Development

No branches or pull requests

2 participants