Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SOLR-17381 SolrJ fix to fetch entire ClusterState if asked #2853

Merged
merged 10 commits into from
Dec 9, 2024

Conversation

aparnasuresh85
Copy link
Contributor

Description

Address a test failure in ClusterStateProviderTest.testGetClusterStatus when ClusterStateProvider (CSP) chosen is Http2ClusterStateProvider. More importantly, while ClusterStateProvider.getClusterStatus() is highly discouraged, it should still be functional since there are codepaths using it.

Solution

If a call to get CLUSTERSTATUS intending to get collections data specifies a null value for collection parameter, proceed to fetch entire cluster status since fetching the remaining (live_nodes, cluster properties, roles etc) should not be time consuming anyway.

Tests

Ensure ClusterStateProviderTest.testGetClusterState() passes when chosen CSP is Http2ClusterStateProvider. (Currently the test fails with a NPE)

Checklist

Please review the following and check all that apply:

  • I have reviewed the guidelines for How to Contribute and my code conforms to the standards described there to the best of my ability.
  • I have created a Jira issue and added the issue ID to my pull request title.
  • I have given Solr maintainers access to contribute to my PR branch. (optional but recommended, not available for branches on forks living under an organisation)
  • I have developed this patch against the main branch.
  • I have run ./gradlew check.
  • I have added tests for my changes.
  • I have added documentation for the Reference Guide

Copy link
Contributor

@dsmiley dsmiley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: These params.set calls are passing "true" | "false" but there's a nice boolean (i.e. a "sugar" method) for the primitive. Could you change that please? It's a minor matter I know but you wrote this code not long ago.

I understand how this change fixes the bug but it feels like the solution optimizes for the smallest possible adequate fix. I look at the method's caller, fetchClusterState. By its name, it should be fetching the cluster's state, but it's kind of working in two modes -- everything or one collection. Can we have fetchClusterState do only the whole cluster (and pass null to submitClusterStateRequest for the request type to signify everything) and have a different/new fetchCollectionState method that only gets the state of one collection, returning a DocCollection? That will be the common code path, and this recommendation I propose would clarify what little work is pertinent to that code path. If this seems like too much, feel free to push back.

Also, I observe that the clusterProperties param is unused (always given null, I mean). Unsure if this is a recent bug/oversight.

@dsmiley dsmiley changed the title Allow the possibility to fetch entire cluster state SOLR-17381 fix to fetch entire cluster state if asked Nov 7, 2024
@dsmiley dsmiley changed the title SOLR-17381 fix to fetch entire cluster state if asked SOLR-17381 SolrJ fix to fetch entire ClusterState if asked Nov 7, 2024
Copy link
Contributor

@dsmiley dsmiley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't want my feedback to seen as blocking this fix; maybe this should be merged and we move on.

(notwithstanding "tidy" of course)

Copy link
Contributor

@dsmiley dsmiley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is so much better!

Copy link
Contributor

@dsmiley dsmiley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent

String collectionName = e.getKey();
Map<String, Object> collStateMap = (Map<String, Object>) e.getValue();
cs =
cs.copyWith(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling copyWith in a loop of collections is O(N^2) (N=#collections); right? Not a problem introduced in this PR but I just want to confer with you on this problem.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it is indeed O(N^2).

* Use GenericSolrRequest not QueryRequest
* Removed unused collection property loading
List<String> liveNodesList = (List<String>) cluster.get("live_nodes");
if (liveNodesList != null) {
Set<String> liveNodes = new HashSet<>(liveNodesList);
this.liveNodes = liveNodes;
this.liveNodes = Set.copyOf(liveNodesList);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small optimization

Map<String, DocCollection> collStateByName = new LinkedHashMap<>(collectionsNl.size());
for (Entry<String, NamedList<?>> entry : collectionsNl) {
final var collStateMap = entry.getValue().asMap(10);
final int zNodeVersion = (int) collStateMap.get("znodeVersion");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed a critical bug in your last version. We must use the ZK version for the DocCollection, not a bogus -1!

Comment on lines 144 to 146
var collectionsNl = (NamedList<NamedList<?>>) cluster.get("collections");

Map<String, DocCollection> collStateByName = new LinkedHashMap<>(collectionsNl.size());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No longer using ClusterState.copyWith, solving a perf problem

Comment on lines -176 to -180
if (clusterProperties != null) {
Map<String, Object> properties = (Map<String, Object>) cluster.get("properties");
if (properties != null) {
clusterProperties.putAll(properties);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was dead code; nobody calls this method with clusterProperties anymore

@github-actions github-actions bot added the tests label Dec 8, 2024
Copy link
Contributor

@dsmiley dsmiley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be ready, assuming tests pass. @aparnasuresh85 happy to review with you.


@SuppressWarnings({"rawtypes", "unchecked"})
private DocCollection fillPrs(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I inlined fillPrs into getDocCollectionFromObjects and added a bit more responsibility here so that the callers needn't handle zk version.

params.set("action", "CLUSTERSTATUS");

params.set("includeAll", false); // will flip flor CLUSTER_STATE
switch (requestType) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

switch is more idiomatic

@@ -117,4 +149,39 @@ private Instant getCreationTimeFromClusterStatus(String collectionName)
Map<String, Object> collection = (Map<String, Object>) collections.get(collectionName);
return Instant.ofEpochMilli((long) collection.get("creationTimeMillis"));
}

@Test
public void testClusterStateProvider() throws SolrServerException, IOException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the process of writing this test, I discovered some keys like "health" in the base ZkNodeProps that prevented equivalency with the ZK one. So I enhanced HttpCSP to remove these keys.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactored this test to a parameterized test to guarantee we test with both impls.

return new ZkClientClusterStateProvider(cluster.getZkStateReader());
}

private final Supplier<ClusterStateProvider> cspSupplier;

public ClusterStateProviderTest(String method) throws Exception {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

by using a String arg that refers to a method, the printed test execution is clear as to what is being invoked. This uses reflection allows deferred execution of construction of the CSP, which is useful.

case FETCH_NODE_ROLES -> params.set("roles", true);
}
params.set("prs", true);
var request = new GenericSolrRequest(METHOD.GET, "/admin/collections", params);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

formerly was QueryRequest which was wrong even though it apparently worked.

@aparnasuresh85
Copy link
Contributor Author

Should be ready, assuming tests pass. @aparnasuresh85 happy to review with you.

LGTM - could be merged when all test pass

@dsmiley
Copy link
Contributor

dsmiley commented Dec 9, 2024

The test failure is known flaky
I'll merge.

@dsmiley dsmiley merged commit d2045f6 into apache:main Dec 9, 2024
3 of 4 checks passed
dsmiley pushed a commit that referenced this pull request Dec 9, 2024
#2853)

When using the HTTP ClusterStateProvider (not ZK), getClusterState() wasn't working correctly; a regression from the first PR for this JIRA issue (not released).
Also,
* Optimization: fix O(N^2) algorithm to be O(N) for the number of collections when calling getClusterState
* liveNodes is now immutable, and probably a non-sorted ordering
* removed "health" and some other keys from DocCollection that aren't present when using ZK CSP

Minor details:
* clearly differentiate internal ClusterState processing from getting one DocCollection
* Use GenericSolrRequest not QueryRequest
* test the both CSPs better
---------

Co-authored-by: David Smiley <dsmiley@salesforce.com>

(cherry picked from commit d2045f6)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants