[Remote Store] _cat/recovery APIs provides inconsistent results #12047

Bukhtawar · 2024-01-27T06:40:58Z

Describe the bug

When compared with total initialising shards in _cluster/health API, the _cat/recovery?active_only shows an inconsistent count of recoveries in progress.

curl localhost:9200/_cluster/health?pretty   
{
  "cluster_name" : ":test-poc",
  "status" : "green",
  "timed_out" : false,
  "number_of_nodes" : 200,
  "number_of_data_nodes" : 194,
  "discovered_master" : true,
  "discovered_cluster_manager" : true,
  "active_primary_shards" : 356,
  "active_shards" : 1045,
  "relocating_shards" : 6,
  "initializing_shards" : 0,
  "unassigned_shards" : 0,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 100.0
}

% curl localhost:9200/_cat/recovery?active_only
opensearch-client-ingest-2024-03-04t11 41 35.5m peer translog xx.xx.26.41 9402b974b33058c1ae02a4b5661dda2e 172.16.33.240 c4557e0db97459e36fc9ca27c0dad06d n/a n/a 136 136 100.0% 156 3053790354 3053790354 100.0% 3697028691 41390688 0 0.0%

The translog download step doesn't populate the translog recovery stats

curl localhost:9200/_cat/recovery?active_only
opensearch-client-ingest-2024-03-04t11 4  6.1m peer translog xx.xx.xx.xx 05c8acd9758c7d833fc7abd77ed74727 xx.xx.xx.xx  de189fe355a94cec9e526de75d404767 n/a n/a 192 192 100.0% 209 3361023615 3361023615 100.0% 3697263527 41654806 0 0.0%
opensearch-client-ingest-2024-03-04t11 41 6.1m peer translog xx.xx.xx.xx  9402b974b33058c1ae02a4b5661dda2e xx.xx.xx.xx c4557e0db97459e36fc9ca27c0dad06d n/a n/a 136 136 100.0% 156 3053790354 3053790354 100.0% 3697028691 41390688 0 0.0%
opensearch-client-ingest-2024-03-04t11 82 6.1m peer translog xx.xx.xx.xx fb81f7c57a903d39d463446784b6b4f7 xx.xx.xx.xx 36c64f30e810ee73e01f8b27f914112a n/a n/a 218 218 100.0% 218 3718572999 3718572999 100.0% 3718572999 41340427 0 0.0%
opensearch-client-ingest-2024-03-04t11 88 6.1m peer translog xx.xx.xx.xx  6ccbb5a732ebb55a5bb0cb7c68ba7fa7 xx.xx.xx.xx c4557e0db97459e36fc9ca27c0dad06d n/a n/a 169 169 100.0% 186 3426163926 3426163926 100.0% 3707056645 41419312 0 0.0%
opensearch-client-ingest-2024-03-04t11 97 6.1m peer translog xx.xx.xx.xx 2ccb6b6c969be5ad2ba792fcba818c88 xx.xx.xx.xx 704ce836066d1b071d681a17814a37d7 n/a n/a 214 214 100.0% 231 3549799591 3549799591 100.0% 3966964286 39712007 0 0.0%
opensearch-client-ingest-2024-03-04t11 98 6.1m peer translog xx.xx.xx.xx  64cf55465863ce20799f8885ea335347 xx.xx.xx.xx 1eef064b01e3004c2363feb81e648a81 n/a n/a 158 158 100.0% 161 4744590420 4744590420 100.0% 4997318586 32743206 0 0.0%

Related component

Storage:Remote

Expected behavior

Consistent API results

The text was updated successfully, but these errors were encountered:

peternied · 2024-01-31T16:37:45Z

[Triage - attendees 1 2 3 4 5 6 7 8]
@Bukhtawar Thanks for filing this issue, this is a very confusing experience and it would be good to address.

lukas-vlcek · 2024-02-16T12:47:37Z

Hi, I would like to take this ticket.
Is there a way how to reproduce the bug?

While conducting my investigation, I welcome your insights and recommendations on specific areas to focus on.

Bukhtawar · 2024-02-21T16:27:22Z

Yes I believe this should be reproducible on a multi-node setup, hosting shards of few 100MBs, where we exclude IP of one node and trigger a relocation process of shards on the excluded node.
Then compare the output of _cluster/health "intialising/relocating" count and _cat/recovery?active_only to see the discrepancy in count.

lukas-vlcek · 2024-02-23T13:56:46Z

@Bukhtawar Thanks!
Do you think you can elaborate bit on "exclude IP of one node". Do you mean exclude the node from a shard allocation?

Would the following scenario be a good candidate?

Imagine a two node cluster, Node 1 having three primary shards, Node 2 being empty.

flowchart TB
    Primary_A
    Primary_B
    Primary_C
    subgraph "Node 1"
    Primary_A
    Primary_B
    Primary_C
    end
    subgraph "Node 2"
    end

Next, we exclude the Node 1 from a shard allocation:

PUT _cluster/settings
{
  "persistent" : {
    "cluster.routing.allocation.exclude._ip" : "_Node 1 IP_"
  }
}

This should (if I am not mistaken) trigger replication of all shards from Node 1 to Node 2.

flowchart TB
    Primary_A--"Replicating"-->Replica_A
    Primary_B--"Replicating"-->Replica_B
    Primary_C--"Replicating"-->Replica_C
    subgraph "Node 1"
    Primary_A
    Primary_B
    Primary_C
    end
    subgraph "Node 2"
    Replica_A
    Replica_B
    Replica_C
    end

Now, while shards are being replicated, we can request _cluster/health and _cat/recovery?active_only (as discussed previously) and that should give us inconsistent counts, correct?

I assume we need shards to be of a "larger size" only because we need to make sure the replication activity takes some time (enough time for us to be able to request counts and compare). How about if we instead throttle the amount of data for replication? This means that shards could be quite small but it will still take some time to replicate. Do you think this will also lead to issue reproduction?

The point is that if using throttling is possible then we should be able to implement a regular unit test.

lukas-vlcek · 2024-03-14T19:05:35Z

Hi @Bukhtawar

I was looking at this and I found that the following integration test is already testing something very similar:

./server/src/internalClusterTest/java/org/opensearch/indices/recovery/IndexRecoveryIT.java

For example it has a test called testRerouteRecovery() that uses the following scenario:

It starts a cluster with a single node (A)
It creates a new index with single shard and no replicas
Then it adds new node (B) to the cluster
Then is slows down recoveries (using RecoverySettings.INDICES_RECOVERY_MAX_BYTES_PER_SEC_SETTING)
Then it forces relocation of the shard using "admin.cluster.reroute" request (so a bit different strategy than discussed above, but still this triggers the recovery process)
It check for count of active (ie. stage != DONE) shard recoveries etc...
...

I was experimenting and modified some tests and added "admin.cluster.health" request into them to get initializing and relocating shard counts and so far I was not able to spot/replicate the count discrepancy.

Do you think it can be because the size of the index in the test is quite small (just couple of 100kbs)? Though, the test explicitly makes sure the counts are obtained while the recovery process is throttled and the shard recovery stage is not DONE (in other words the counts are compared while the recovery is still running).

However, there is still another question I wanted to ask. Did you have anything specific in mind when you said:

The translog download step doesn't populate the translog recovery stats

Can you elaborate on this please?

I will push modification of the test tomorrow so that you can see what I mean.

This is WIP to drive the discussion further, do not merge it! Signed-off-by: Lukáš Vlček <lukas.vlcek@aiven.io>

lukas-vlcek · 2024-03-20T16:39:17Z

@Bukhtawar

Please see #12792
I believe this is very detailed try to reproduce the issue. Unfortunately, it is not reproducing the issue currently (the test passes, which means the issue does not materialize).

Can you think of some hits about what to change in order to recreate the issue?

For example, do you think the shard recovery state stage has to be Stage.TRANSLOG? Notice that in the IT the stage is currenltyStage.INDEX.

Bukhtawar · 2024-03-21T06:06:21Z

Adding @sachinpkale for his thoughts as well. Will take a look shortly

shourya035 · 2024-09-05T15:28:34Z

@sachinpkale @Bukhtawar This PR is waiting on your inputs. Can you bring this to closure?

Bukhtawar added bug Something isn't working untriaged labels Jan 27, 2024

github-actions bot added the Storage:Remote label Jan 27, 2024

Bukhtawar mentioned this issue Jan 27, 2024

[Meta] Remote Store Follow-up Items #11966

Open

peternied removed the untriaged label Jan 31, 2024

Bukhtawar added the Storage-Lifecycle label Feb 15, 2024

github-project-automation bot added this to Storage Project Board Feb 15, 2024

github-project-automation bot moved this to 🆕 New in Storage Project Board Feb 15, 2024

Bukhtawar removed the Storage-Lifecycle label Feb 15, 2024

rramachand21 assigned rramachand21 and lukas-vlcek and unassigned rramachand21 Feb 29, 2024

rramachand21 added the Storage:Resiliency Issues and PRs related to the storage resiliency label Feb 29, 2024

rramachand21 moved this from 🆕 New to 🏗 In progress in Storage Project Board Feb 29, 2024

lukas-vlcek added a commit to lukas-vlcek/OpenSearch that referenced this issue Mar 20, 2024

Replication IT for opensearch-project#12047

71f7922

This is WIP to drive the discussion further, do not merge it! Signed-off-by: Lukáš Vlček <lukas.vlcek@aiven.io>

lukas-vlcek mentioned this issue Mar 20, 2024

Replication IT for #12047 #12792

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Remote Store] _cat/recovery APIs provides inconsistent results #12047

[Remote Store] _cat/recovery APIs provides inconsistent results #12047

Bukhtawar commented Jan 27, 2024 •

edited

Loading

peternied commented Jan 31, 2024

lukas-vlcek commented Feb 16, 2024 •

edited

Loading

Bukhtawar commented Feb 21, 2024

lukas-vlcek commented Feb 23, 2024 •

edited

Loading

lukas-vlcek commented Mar 14, 2024 •

edited

Loading

lukas-vlcek commented Mar 20, 2024

Bukhtawar commented Mar 21, 2024

shourya035 commented Sep 5, 2024

[Remote Store] _cat/recovery APIs provides inconsistent results #12047

[Remote Store] _cat/recovery APIs provides inconsistent results #12047

Comments

Bukhtawar commented Jan 27, 2024 • edited Loading

Describe the bug

Related component

Expected behavior

peternied commented Jan 31, 2024

lukas-vlcek commented Feb 16, 2024 • edited Loading

Bukhtawar commented Feb 21, 2024

lukas-vlcek commented Feb 23, 2024 • edited Loading

lukas-vlcek commented Mar 14, 2024 • edited Loading

lukas-vlcek commented Mar 20, 2024

Bukhtawar commented Mar 21, 2024

shourya035 commented Sep 5, 2024

Bukhtawar commented Jan 27, 2024 •

edited

Loading

lukas-vlcek commented Feb 16, 2024 •

edited

Loading

lukas-vlcek commented Feb 23, 2024 •

edited

Loading

lukas-vlcek commented Mar 14, 2024 •

edited

Loading