FetchData changes for primaries and replicas #8865

Gaurav614 · 2023-07-25T11:22:08Z

Description

This pull request is part of the improvement #5098
It is mainly focussed around fetching the Data for PSA and RSA for eligible shards

The PR is dependent on following PRs:
#8742
#8218
#8356
#8746

Related Issues

Resolves #[Issue number to be closed when this PR is merged]

Check List

New functionality includes testing.
- All tests pass
New functionality has been documented.
- New functionality has javadoc added
Commits are signed per the DCO using --signoff
Commit changes are listed out in CHANGELOG.md file (See: Changelog)

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Gaurav Chandani <chngau@amazon.com>

github-actions · 2023-07-25T11:32:25Z

Gradle Check (Jenkins) Run Completed with:

RESULT: FAILURE ❌
URL: https://build.ci.opensearch.org/job/gradle-check/20894/
CommitID: 61f8c61
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green.
Is the failure a flaky test unrelated to your change?

shiv0408 · 2023-08-08T09:49:54Z

server/src/main/java/org/opensearch/gateway/GatewayAllocator.java

+                shardToIgnoreNodes.put(shardId, allocation.getIgnoreNodes(shardId));
+            }
+            AsyncBatchShardFetch<? extends BaseNodeResponse> asyncFetcher = shardsBatch.getAsyncFetcher();
+            AsyncBatchShardFetch.FetchResult<? extends BaseNodeResponse> shardBatchState = asyncFetcher.fetchData(


Can you rename this variable to shardBatchStore to represent that this contains shard store address of primary shard?

Store suffix is used in conjuction with replicas in code base

amkhar · 2023-08-09T09:35:13Z

server/src/main/java/org/opensearch/gateway/GatewayAllocator.java

+                return new AsyncBatchShardFetch.FetchResult<>(null, Collections.emptyMap());
+            }
+
+            String batchId = startedShardBatchLookup.getOrDefault(shardRouting.shardId(), null);


Is shard was started or failed in between, we may get null here. So should we iterate on all eligible shards to get the batchId ? relying on first one may be incorrect.

not possible single threaded system

shiv0408 · 2023-08-21T05:06:21Z

server/src/main/java/org/opensearch/gateway/GatewayAllocator.java

@@ -335,10 +396,54 @@ protected AsyncShardFetch.FetchResult<TransportNodesListShardStoreMetadata.NodeS
            }
            return shardStores;
        }
+    }
+
+    class InternalReplicaBatchShardAllocator extends ReplicaShardBatchAllocator {


Can you implement hasInitiatedFetching function in this class or should I pick this whole Internal class in my PR?

Signed-off-by: Gaurav Chandani <chngau@amazon.com>

github-actions · 2023-09-06T09:06:13Z

Compatibility status:

Checks if related components are compatible with change b7e2119

Incompatible components

Skipped components

Compatible components

github-actions · 2023-09-06T09:14:28Z

Gradle Check (Jenkins) Run Completed with:

RESULT: FAILURE ❌
URL: https://build.ci.opensearch.org/job/gradle-check/24592/
CommitID: b7e2119
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green.
Is the failure a flaky test unrelated to your change?

opensearch-trigger-bot · 2023-10-06T15:20:59Z

This PR is stalled because it has been open for 30 days with no activity.

khushbr

Please add UTs with the next revision of this PR.

khushbr · 2023-12-05T06:16:25Z

server/src/main/java/org/opensearch/gateway/GatewayAllocator.java

@@ -55,9 +55,15 @@
 import org.opensearch.common.util.set.Sets;
 import org.opensearch.index.shard.ShardId;
 import org.opensearch.indices.store.TransportNodesListShardStoreMetadata;
+import org.opensearch.indices.store.TransportNodesListShardStoreMetadata;
+import org.opensearch.indices.store.TransportNodesListShardStoreMetadataBatch;
+import org.opensearch.indices.store.TransportNodesListShardStoreMetadataBatch.NodeStoreFilesMetadataBatch;


+ typo at the beginning of line.

khushbr · 2023-12-05T06:18:04Z

server/src/main/java/org/opensearch/gateway/GatewayAllocator.java

+    private final PrimaryShardBatchAllocator primaryBatchShardAllocator;
+    private final ReplicaShardBatchAllocator replicaBatchShardAllocator;


Let us stay consistent in our naming. The 'Batch' and 'Shard' in class name and variable name are inverted.
I prefer ShardBatch.

ack, will update this later once PRs/tasks for Allocators are merged/approved to avoid any back and forth

khushbr · 2023-12-05T06:20:08Z

server/src/main/java/org/opensearch/gateway/GatewayAllocator.java


    private final ConcurrentMap<
        ShardId,
        AsyncShardFetch<TransportNodesListGatewayStartedShards.NodeGatewayStartedShards>> asyncFetchStarted = ConcurrentCollections
-            .newConcurrentMap();
+        .newConcurrentMap();


nit: Fix the syntax. Add back the tab spacing.

khushbr · 2023-12-05T06:26:22Z

server/src/main/java/org/opensearch/gateway/GatewayAllocator.java

@@ -303,6 +313,59 @@ protected AsyncShardFetch.FetchResult<TransportNodesListGatewayStartedShards.Nod
        }
    }

+
+    class InternalPrimaryBatchShardAllocator extends PrimaryShardBatchAllocator {


Let us fix the naming here as well.

ack, same comment as above

khushbr · 2023-12-05T06:33:35Z

server/src/main/java/org/opensearch/gateway/GatewayAllocator.java

+
+        @Override
+        @SuppressWarnings("unchecked")
+        protected AsyncShardFetch.FetchResult<TransportNodesListGatewayStartedShardsBatch.NodeGatewayStartedShardsBatch> fetchData(Set<ShardRouting> shardsEligibleForFetch,


rename to eligibleShards and inEligibleShards ?

khushbr · 2023-12-05T07:18:06Z

server/src/main/java/org/opensearch/gateway/GatewayAllocator.java

+
+            if (shardsBatch.getBatchedShards().isEmpty() && shardsEligibleForFetch.isEmpty()) {
+                logger.debug("Batch {} is empty", batchId);
+                return new AsyncShardFetch.FetchResult<>(null, Collections.emptyMap());


Same as above, use DiscoveryNodes.EMPTY_NODES instead of null value for DiscoveryNodes param.

replied same as above

khushbr · 2023-12-05T07:23:01Z

server/src/main/java/org/opensearch/gateway/GatewayAllocator.java

+            Map<ShardId, Set<String>> shardToIgnoreNodes = new HashMap<>();
+
+            for (ShardId shardId : shardsBatch.asyncBatch.shardToCustomDataPath.keySet()) {
+                shardToIgnoreNodes.put(shardId, allocation.getIgnoreNodes(shardId));


Can the shardToIgnoreNodes map have empty (set) values ? Can we ignore adding the entry in such cases?

Ref:

OpenSearch/server/src/main/java/org/opensearch/cluster/routing/allocation/RoutingAllocation.java

Lines 241 to 248 in f7f3500

public Set<String> getIgnoreNodes(ShardId shardId) {

if (ignoredShardToNodes == null) {

return emptySet();

}

Set<String> ignore = ignoredShardToNodes.get(shardId);

if (ignore == null) {

return emptySet();

}

Even if we ignore it will be later created by AsyncShardFetch object for completeness sake.

Can you link the code where we are adding the entries with empty set.

Is there scope to optimize here - Avoid creating the empty sets that server no purpose ?

khushbr · 2023-12-05T07:26:12Z

server/src/main/java/org/opensearch/gateway/GatewayAllocator.java

+                shardToIgnoreNodes
+            );
+
+            if (shardBatchState.hasData()) {


In what scenario will shardBatchState not have data ? Should we add a log statement for it?

While the fetching is still in progress/failure.

What you want to log?

Should we add a log statement for it?

It'll start creating too many logs, let's avoid that.

khushbr · 2023-12-05T07:43:07Z

server/src/main/java/org/opensearch/gateway/GatewayAllocator.java

+
+        @Override
+        @SuppressWarnings("unchecked")
+        protected AsyncShardFetch.FetchResult<TransportNodesListGatewayStartedShardsBatch.NodeGatewayStartedShardsBatch> fetchData(Set<ShardRouting> shardsEligibleForFetch,


I am trying to understand, the method fetchData(Set<ShardRouting> shardsEligibleForFetch, Set<ShardRouting> inEligibleShards, RoutingAllocation allocation) is trying to fetch the Response for a single batch or across all the batches ?

I am assuming it is former and the whole convulated logic of shardId -> shardRouting -> batchID -> shardsBatch -> shardBatchState is for the method override. If this is true, then can we:

Rename shardsEligibleForFetch to eligibleShardsInBatch , inEligibleShards to ineligibleShardsInBatch

Split this into 2 method, fetchDataShardsBatch and fetchData :

AsyncShardFetch.FetchResult<TransportNodesListGatewayStartedShardsBatch.NodeGatewayStartedShardsBatch> fetchDataShardsBatch ( Set<ShardRouting> eligibleShardsInBatch, Set<ShardRouting> ineligibleShardsInBatch, RoutingAllocation allocation) { ... ShardsBatch shardsBatch = ... return fetchData(ShardsBatch, allocation); } AsyncShardFetch.FetchResult<TransportNodesListGatewayStartedShardsBatch.NodeGatewayStartedShardsBatch> fetchData ( ShardsBatch shardsBatch, RoutingAllocation allocation) { ... }

Renaming part if fine.

On second point, that can be a good suggestion only if we have two such functions that cater to fetchData of both primaries and replicas, since both of the fetchData in original code is doing exact same things. Otherwise if follow the above approach as you suggested then dont you think it will overkill since it will lead to 4 different methods(2 each for replicas and primaries)that will be used at single place and have wont be having any reuse.

So extending on your though process, if we can do this then we can avoid some code duplication-

@Override @SuppressWarnings("unchecked") protected AsyncShardFetch.FetchResult<TransportNodesListGatewayStartedBatchShards.NodeGatewayStartedShardsBatch> fetchData( Set<ShardRouting> shardsEligibleForFetch, Set<ShardRouting> inEligibleShards, RoutingAllocation allocation ) ShardsBatch shardsbatch=fetchDataShardBatch(shardsEligibleForFetch, inEligibleShards) fetchDataForShardBatch(shardsbatch, shardsbatch.primary()) Same as above for replicas

And based on that we will implement two more methods fetchDataShardBatch to get batch of set of shards
and then a generic response from this method fetchDataForShardBatch, which will be later type casted by repected fetchData() call of primaries/replicas

khushbr · 2023-12-05T07:44:17Z

server/src/main/java/org/opensearch/gateway/GatewayAllocator.java

+            String batchId = getBatchId(shard, shard.primary());
+            return batchId!=null;


merge into getBatchId(shard, shard.primary()) != null; ?

ticheng-aws · 2024-01-05T23:50:22Z

Hi @Gaurav614, Is this being worked upon? Pls free to reach out to maintainers for further reviews.

Gaurav614 · 2024-02-12T08:18:44Z

Changes in this PR are not needed since we have refactored the changes into this PR:https://github.com/opensearch-project/OpenSearch/pull/8746/files

FetchData changes for primaries and replicas

61f8c61

Signed-off-by: Gaurav Chandani <chngau@amazon.com>

shiv0408 reviewed Aug 8, 2023

View reviewed changes

amkhar reviewed Aug 9, 2023

View reviewed changes

shiv0408 reviewed Aug 21, 2023

View reviewed changes

Incorporated changes from dependent PRs

b7e2119

Signed-off-by: Gaurav Chandani <chngau@amazon.com>

shiv0408 mentioned this pull request Oct 4, 2023

Fixed Allocation Explain API in batch mode #10348

Closed

7 tasks

opensearch-trigger-bot bot added the stalled Issues that have stalled label Oct 6, 2023

khushbr reviewed Dec 5, 2023

View reviewed changes

Gaurav614 marked this pull request as ready for review December 6, 2023 09:24

Gaurav614 requested review from abbashus, adnapibar, anasalkouz, andrross, Bukhtawar, CEHENKLE, dblock, dbwiddis, dreamer-89, gbbafna, kartg, kotwanikunal, mch2 and msfroh as code owners December 6, 2023 09:24

Gaurav614 requested review from nknize, owaiskazi19, peternied, reta, Rishikesh1159, ryanbogan, sachinpkale, saratvemulapalli, setiah, shwetathareja, sohami, tlfeng and VachaShah as code owners December 6, 2023 09:24

ticheng-aws added the enhancement Enhancement or improvement to existing feature or request label Jan 5, 2024

opensearch-trigger-bot bot removed the stalled Issues that have stalled label Jan 12, 2024

Gaurav614 closed this Feb 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FetchData changes for primaries and replicas #8865

FetchData changes for primaries and replicas #8865

Gaurav614 commented Jul 25, 2023

github-actions bot commented Jul 25, 2023

shiv0408 Aug 8, 2023

Gaurav614 Sep 6, 2023

amkhar Aug 9, 2023

Gaurav614 Sep 6, 2023

shiv0408 Aug 21, 2023

github-actions bot commented Sep 6, 2023

github-actions bot commented Sep 6, 2023

opensearch-trigger-bot bot commented Oct 6, 2023

khushbr left a comment

khushbr Dec 5, 2023

Gaurav614 Dec 6, 2023

khushbr Dec 5, 2023

Gaurav614 Dec 6, 2023

khushbr Dec 5, 2023

Gaurav614 Dec 6, 2023

khushbr Dec 5, 2023

Gaurav614 Dec 6, 2023

khushbr Dec 5, 2023

Gaurav614 Dec 6, 2023

khushbr Dec 5, 2023

Gaurav614 Dec 6, 2023

khushbr Dec 5, 2023

Gaurav614 Dec 20, 2023

khushbr Dec 20, 2023

khushbr Dec 5, 2023

Gaurav614 Dec 6, 2023

amkhar Dec 13, 2023

khushbr Dec 5, 2023 •

edited

Loading

Gaurav614 Dec 6, 2023

khushbr Dec 5, 2023

Gaurav614 Dec 6, 2023

ticheng-aws commented Jan 5, 2024

Gaurav614 commented Feb 12, 2024

		private final PrimaryShardBatchAllocator primaryBatchShardAllocator;
		private final ReplicaShardBatchAllocator replicaBatchShardAllocator;

	public Set<String> getIgnoreNodes(ShardId shardId) {
	if (ignoredShardToNodes == null) {
	return emptySet();
	}
	Set<String> ignore = ignoredShardToNodes.get(shardId);
	if (ignore == null) {
	return emptySet();
	}

		String batchId = getBatchId(shard, shard.primary());
		return batchId!=null;

FetchData changes for primaries and replicas #8865

FetchData changes for primaries and replicas #8865

Conversation

Gaurav614 commented Jul 25, 2023

Description

Related Issues

Check List

github-actions bot commented Jul 25, 2023

Gradle Check (Jenkins) Run Completed with:

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Sep 6, 2023

Compatibility status:

Incompatible components

Skipped components

Compatible components

github-actions bot commented Sep 6, 2023

Gradle Check (Jenkins) Run Completed with:

opensearch-trigger-bot bot commented Oct 6, 2023

khushbr left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

khushbr Dec 5, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ticheng-aws commented Jan 5, 2024

Gaurav614 commented Feb 12, 2024

khushbr Dec 5, 2023 •

edited

Loading