Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Remote Store] _cat/recovery APIs provides inconsistent results #12047

Open
Bukhtawar opened this issue Jan 27, 2024 · 8 comments
Open

[Remote Store] _cat/recovery APIs provides inconsistent results #12047

Bukhtawar opened this issue Jan 27, 2024 · 8 comments
Assignees
Labels
bug Something isn't working Storage:Remote Storage:Resiliency Issues and PRs related to the storage resiliency

Comments

@Bukhtawar
Copy link
Collaborator

Bukhtawar commented Jan 27, 2024

Describe the bug

  1. When compared with total initialising shards in _cluster/health API, the _cat/recovery?active_only shows an inconsistent count of recoveries in progress.
curl localhost:9200/_cluster/health?pretty   
{
  "cluster_name" : ":test-poc",
  "status" : "green",
  "timed_out" : false,
  "number_of_nodes" : 200,
  "number_of_data_nodes" : 194,
  "discovered_master" : true,
  "discovered_cluster_manager" : true,
  "active_primary_shards" : 356,
  "active_shards" : 1045,
  "relocating_shards" : 6,
  "initializing_shards" : 0,
  "unassigned_shards" : 0,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 100.0
}

% curl localhost:9200/_cat/recovery?active_only
opensearch-client-ingest-2024-03-04t11 41 35.5m peer translog xx.xx.26.41 9402b974b33058c1ae02a4b5661dda2e 172.16.33.240 c4557e0db97459e36fc9ca27c0dad06d n/a n/a 136 136 100.0% 156 3053790354 3053790354 100.0% 3697028691 41390688 0 0.0%
  1. The translog download step doesn't populate the translog recovery stats
curl localhost:9200/_cat/recovery?active_only
opensearch-client-ingest-2024-03-04t11 4  6.1m peer translog xx.xx.xx.xx 05c8acd9758c7d833fc7abd77ed74727 xx.xx.xx.xx  de189fe355a94cec9e526de75d404767 n/a n/a 192 192 100.0% 209 3361023615 3361023615 100.0% 3697263527 41654806 0 0.0%
opensearch-client-ingest-2024-03-04t11 41 6.1m peer translog xx.xx.xx.xx  9402b974b33058c1ae02a4b5661dda2e xx.xx.xx.xx c4557e0db97459e36fc9ca27c0dad06d n/a n/a 136 136 100.0% 156 3053790354 3053790354 100.0% 3697028691 41390688 0 0.0%
opensearch-client-ingest-2024-03-04t11 82 6.1m peer translog xx.xx.xx.xx fb81f7c57a903d39d463446784b6b4f7 xx.xx.xx.xx 36c64f30e810ee73e01f8b27f914112a n/a n/a 218 218 100.0% 218 3718572999 3718572999 100.0% 3718572999 41340427 0 0.0%
opensearch-client-ingest-2024-03-04t11 88 6.1m peer translog xx.xx.xx.xx  6ccbb5a732ebb55a5bb0cb7c68ba7fa7 xx.xx.xx.xx c4557e0db97459e36fc9ca27c0dad06d n/a n/a 169 169 100.0% 186 3426163926 3426163926 100.0% 3707056645 41419312 0 0.0%
opensearch-client-ingest-2024-03-04t11 97 6.1m peer translog xx.xx.xx.xx 2ccb6b6c969be5ad2ba792fcba818c88 xx.xx.xx.xx 704ce836066d1b071d681a17814a37d7 n/a n/a 214 214 100.0% 231 3549799591 3549799591 100.0% 3966964286 39712007 0 0.0%
opensearch-client-ingest-2024-03-04t11 98 6.1m peer translog xx.xx.xx.xx  64cf55465863ce20799f8885ea335347 xx.xx.xx.xx 1eef064b01e3004c2363feb81e648a81 n/a n/a 158 158 100.0% 161 4744590420 4744590420 100.0% 4997318586 32743206 0 0.0%

Related component

Storage:Remote

Expected behavior

Consistent API results

@peternied
Copy link
Member

[Triage - attendees 1 2 3 4 5 6 7 8]
@Bukhtawar Thanks for filing this issue, this is a very confusing experience and it would be good to address.

@lukas-vlcek
Copy link
Contributor

lukas-vlcek commented Feb 16, 2024

Hi, I would like to take this ticket.
Is there a way how to reproduce the bug?

While conducting my investigation, I welcome your insights and recommendations on specific areas to focus on.

@Bukhtawar
Copy link
Collaborator Author

Yes I believe this should be reproducible on a multi-node setup, hosting shards of few 100MBs, where we exclude IP of one node and trigger a relocation process of shards on the excluded node.
Then compare the output of _cluster/health "intialising/relocating" count and _cat/recovery?active_only to see the discrepancy in count.

@lukas-vlcek
Copy link
Contributor

lukas-vlcek commented Feb 23, 2024

@Bukhtawar Thanks!
Do you think you can elaborate bit on "exclude IP of one node". Do you mean exclude the node from a shard allocation?

Would the following scenario be a good candidate?

Imagine a two node cluster, Node 1 having three primary shards, Node 2 being empty.

flowchart TB
    Primary_A
    Primary_B
    Primary_C
    subgraph "Node 1"
    Primary_A
    Primary_B
    Primary_C
    end
    subgraph "Node 2"
    end
Loading

Next, we exclude the Node 1 from a shard allocation:

PUT _cluster/settings
{
  "persistent" : {
    "cluster.routing.allocation.exclude._ip" : "_Node 1 IP_"
  }
}

This should (if I am not mistaken) trigger replication of all shards from Node 1 to Node 2.

flowchart TB
    Primary_A--"Replicating"-->Replica_A
    Primary_B--"Replicating"-->Replica_B
    Primary_C--"Replicating"-->Replica_C
    subgraph "Node 1"
    Primary_A
    Primary_B
    Primary_C
    end
    subgraph "Node 2"
    Replica_A
    Replica_B
    Replica_C
    end
Loading

Now, while shards are being replicated, we can request _cluster/health and _cat/recovery?active_only (as discussed previously) and that should give us inconsistent counts, correct?

I assume we need shards to be of a "larger size" only because we need to make sure the replication activity takes some time (enough time for us to be able to request counts and compare). How about if we instead throttle the amount of data for replication? This means that shards could be quite small but it will still take some time to replicate. Do you think this will also lead to issue reproduction?

The point is that if using throttling is possible then we should be able to implement a regular unit test.

@rramachand21 rramachand21 added the Storage:Resiliency Issues and PRs related to the storage resiliency label Feb 29, 2024
@rramachand21 rramachand21 moved this from 🆕 New to 🏗 In progress in Storage Project Board Feb 29, 2024
@lukas-vlcek
Copy link
Contributor

lukas-vlcek commented Mar 14, 2024

Hi @Bukhtawar

I was looking at this and I found that the following integration test is already testing something very similar:

./server/src/internalClusterTest/java/org/opensearch/indices/recovery/IndexRecoveryIT.java

For example it has a test called testRerouteRecovery() that uses the following scenario:

  1. It starts a cluster with a single node (A)
  2. It creates a new index with single shard and no replicas
  3. Then it adds new node (B) to the cluster
  4. Then is slows down recoveries (using RecoverySettings.INDICES_RECOVERY_MAX_BYTES_PER_SEC_SETTING)
  5. Then it forces relocation of the shard using "admin.cluster.reroute" request (so a bit different strategy than discussed above, but still this triggers the recovery process)
  6. It check for count of active (ie. stage != DONE) shard recoveries etc...
  7. ...

I was experimenting and modified some tests and added "admin.cluster.health" request into them to get initializing and relocating shard counts and so far I was not able to spot/replicate the count discrepancy.

Do you think it can be because the size of the index in the test is quite small (just couple of 100kbs)? Though, the test explicitly makes sure the counts are obtained while the recovery process is throttled and the shard recovery stage is not DONE (in other words the counts are compared while the recovery is still running).

However, there is still another question I wanted to ask. Did you have anything specific in mind when you said:

The translog download step doesn't populate the translog recovery stats

Can you elaborate on this please?

I will push modification of the test tomorrow so that you can see what I mean.

lukas-vlcek added a commit to lukas-vlcek/OpenSearch that referenced this issue Mar 20, 2024
This is WIP to drive the discussion further, do not merge it!

Signed-off-by: Lukáš Vlček <lukas.vlcek@aiven.io>
@lukas-vlcek
Copy link
Contributor

@Bukhtawar

Please see #12792
I believe this is very detailed try to reproduce the issue. Unfortunately, it is not reproducing the issue currently (the test passes, which means the issue does not materialize).

Can you think of some hits about what to change in order to recreate the issue?

For example, do you think the shard recovery state stage has to be Stage.TRANSLOG? Notice that in the IT the stage is currenltyStage.INDEX.

@Bukhtawar
Copy link
Collaborator Author

Adding @sachinpkale for his thoughts as well. Will take a look shortly

@shourya035
Copy link
Member

@sachinpkale @Bukhtawar This PR is waiting on your inputs. Can you bring this to closure?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Storage:Remote Storage:Resiliency Issues and PRs related to the storage resiliency
Projects
Status: 🏗 In progress
Development

No branches or pull requests

5 participants