SOLR-16257: `ZkStateReader` changes to avoid race condition between `collectionWatches` and `watchedCollectionStates` #909

patsonluk · 2022-06-17T01:53:38Z

https://issues.apache.org/jira/browse/SOLR-16257

Description

As described in the Jira issue, ZkStateReader fields collectionWatches and watchedCollectionStates can run into race conditions.

And such condition might produce stale clusterState for some collection which can no longer be updated (until a new watch is registered for such collection)

The existing design is intended to synchronize the 2 map fields watchedCollectionStates and collectionWatches (ie if a collection is no longer watched in collectionWatches, then no entry for that collection should exist in watchedCollectionStates). However, it is hard to guarantee such "synchronization", one way to break it is demonstrated in "Steps to reproduce a race condition" within the Jira issue

Take note that even though in most cases watchedCollectionStates and collectionWatches should have the same keys, there seems to one valid case which if a collection to be created could be in collectionWatches but not yet in watchedCollectionStates.

Solution

Perhaps it's easier to simply eliminate watchedCollectionStates and add such state (as DocCollection) to the CollectionWatch itself (now added a new class DocCollectionWatch with contains a ref to the DocCollection) , such that we no longer need to worry about entry removed in collectionWatches somehow still remains in watchedCollectionStates.

Take note that there were 2 cases which used watchedCollectionStates.keySet(), that requires some extra consideration:

In forciblyRefreshAllClusterStateSlow, which we are proposing to just replace watchedCollectionStates.keySet() with collectionWatches.keySet(), the reasoning is that we should forcibly update every collection that is registered in collectionWatches, even if previous fetch on such collection returned null (ie watchedCollectionStates would have no entry on it). This is a minor change of behavior but I think it is more "correct"? Would really need more experienced Solr dev to share their thoughts here 🙏🏼
In getCurrentCollections, which we are proposing to add ONLY collections in collectionWatches with DocCollection currentDoc that is non null (ie, only collection that does exist from last update)

Tests

Added testForciblyRefreshAllClusterState, testGetCurrentCollections and testWatchRaceCondition to ZkStateReaderTest.

The first 2 are to verify the refactoring maintain the same behaviors for those 2 methods. And the last one is to test specifically race condition mentioned (hard to reproduce tho).

Since I am very new to this repo, I don't know the convention for junit testing on private methods (or is it a nono to test on them?). I usually would just make them package private for other projects but I see that ZkStateReaderTest is not within the same package of ZkStateReader hence I cannot do that. For now I changed those 2 methods to public, which feels wrong 😅

Please kindly advise the common approach for testing private methods 🙇🏼

Checklist

Please review the following and check all that apply:

I have reviewed the guidelines for How to Contribute and my code conforms to the standards described there to the best of my ability.
I have created a Jira issue and added the issue ID to my pull request title.
I have given Solr maintainers access to contribute to my PR branch. (optional but recommended)
I have developed this patch against the main branch.
I have run ./gradlew check.
I have added tests for my changes.
I have added documentation for the Reference Guide

Remarks

The ./gradlew check failed once on test case TestTlogReplica with warning of

WARN  (ZkTestServer Run Thread) [] o.a.s.c.ZkTestServer Watch limit violations:
...

Re-running ./gradlew check, the test no longer fails. Going to spend a bit more time to see whether the failure is related to this change.

so far no luck reproducing it with gradlew :solr:core:test --tests "org.apache.solr.cloud.TestTlogReplica.testRebalanceLeaders" -Ptests.jvms=6 "-Ptests.jvmargs=-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=120m" -Ptests.seed=C004E5CD6B2D226F -Ptests.file.encoding=US-ASCII

…nd watchedCollectionStates

patsonluk · 2022-06-17T20:33:10Z

~~The details of one test failure that I cannot reproduce:~~ Probably a red herring, this is likely just a warning that get printed but is expected for our case. Since we repeatedly register/remove watches

  2> 1444230 INFO  (SUITE-TestTlogReplica-seed#[C004E5CD6B2D226F]-worker) [n:127.0.0.1:56946_solr] o.a.s.c.ZkTestServer Shutting down ZkTestServer.
  2> 1444438 WARN  (ZkTestServer Run Thread) [] o.a.s.c.ZkTestServer Watch limit violations: 
  2> Maximum concurrent create/delete watches above limit:
  2> 
  2> 	33	/solr/collections/tlog_replica_test_create_delete/terms/shard1
  2> 	33	/solr/configs/conf
  2> 	31	/solr/collections/tlog_replica_test_create_delete/terms/shard2
  2> 	20	/solr/collections/tlog_replica_test_remove_leader/terms/shard1
  2> 	20	/solr/collections/tlog_replica_test_create_delete/state.json
  2> 	19	/solr/collections/tlog_replica_test_recovery/terms/shard1
  2> 	18	/solr/collections/tlog_replica_test_real_time_get/terms/shard1
  2> 	17	/solr/collections/tlog_replica_test_basic_leader_election/terms/shard1
  2> 	17	/solr/collections/tlog_replica_test_kill_leader/terms/shard1
  2> 	15	/solr/collections/tlog_replica_test_add_docs/terms/shard1
  2> 	11	/solr/collections/tlog_replica_test_delete_by_id/terms/shard1
  2> 	11	/solr/collections/tlog_replica_test_only_leader_indexes/terms/shard1
  2> 	10	/solr/collections/tlog_replica_test_rebalance_leaders/terms/shard1
  2> 	9	/solr/collections/tlog_replica_test_out_of_order_db_qwith_in_place_updates/terms/shard1
  2> 	9	/solr/aliases.json
  2> 	9	/solr/clusterprops.json
  2> 	8	/solr/packages.json
  2> 	8	/solr/security.json
  2> 	8	/solr/collections/tlog_replica_test_add_remove_tlog_replica/terms/shard1
  2> 	8	/solr/collections/tlog_replica_test_add_remove_tlog_replica/terms/shard2
  2> 	8	/solr/collections/tlog_replica_test_real_time_get/state.json
  2> 	7	/solr/collections/tlog_replica_test_add_docs/state.json
  2> 	7	/solr/collections/tlog_replica_test_remove_leader/state.json
  2> 	7	/solr/collections/tlog_replica_test_add_remove_tlog_replica/state.json
  2> 	6	/solr/collections/tlog_replica_test_basic_leader_election/state.json
  2> 	6	/solr/collections/tlog_replica_test_only_leader_indexes/state.json
  2> 	5	/solr/collections/tlog_replica_test_out_of_order_db_qwith_in_place_updates/state.json
  2> 	5	/solr/collections/tlog_replica_test_kill_leader/state.json
  2> 	5	/solr/collections/tlog_replica_test_recovery/collectionprops.json
  2> 	5	/solr/collections/tlog_replica_test_rebalance_leaders/state.json
  2> 	5	/solr/collections/tlog_replica_test_delete_by_id/state.json
  2> 	4	/solr/collections/tlog_replica_test_recovery/state.json
  2> 	3	/solr/collections/tlog_replica_test_rebalance_leaders/collectionprops.json
  2> 	3	/solr/collections/tlog_replica_test_kill_leader/collectionprops.json
  2> 	3	/solr/collections/tlog_replica_test_basic_leader_election/collectionprops.json
  2> 	2	/solr/collections/tlog_replica_test_delete_by_id/collectionprops.json
  2> 	2	/solr/collections/tlog_replica_test_add_remove_tlog_replica/collectionprops.json
  2> 	2	/solr/collections/tlog_replica_test_out_of_order_db_qwith_in_place_updates/collectionprops.json
  2> 	2	/solr/collections/tlog_replica_test_add_docs/collectionprops.json
  2> 	2	/solr/collections/tlog_replica_test_remove_leader/collectionprops.json
  2> 	2	/solr/collections/tlog_replica_test_only_leader_indexes/collectionprops.json
  2> 	2	/solr/collections/tlog_replica_test_real_time_get/collectionprops.json
  2> 
  2> Maximum concurrent data watches above limit:
  2> 
  2> 	97	/solr/collections/tlog_replica_test_recovery/state.json
  2> 	79	/solr/collections/tlog_replica_test_create_delete/state.json
  2> 	51	/solr/collections/tlog_replica_test_kill_leader/state.json
  2> 	49	/solr/collections/tlog_replica_test_out_of_order_db_qwith_in_place_updates/state.json
  2> 	48	/solr/collections/tlog_replica_test_basic_leader_election/state.json
  2> 	44	/solr/collections/tlog_replica_test_remove_leader/state.json
  2> 	37	/solr/collections/tlog_replica_test_add_remove_tlog_replica/state.json
  2> 	21	/solr/collections/tlog_replica_test_rebalance_leaders/state.json
  2> 	20	/solr/collections/tlog_replica_test_real_time_get/state.json
  2> 	19	/solr/collections/tlog_replica_test_add_docs/state.json
  2> 	15	/solr/collections/tlog_replica_test_only_leader_indexes/state.json
  2> 	15	/solr/collections/tlog_replica_test_delete_by_id/state.json
  2> 	4	/solr/collections/tlog_replica_test_recovery/leader_elect/shard1/election/72082096014163974-core_node4-n_0000000000
  2> 	4	/solr/overseer_elect/election/72082096014163974-127.0.0.1:56945_solr-n_0000000001
  2> 	3	/solr/collections/tlog_replica_test_create_delete/leader_elect/shard1/election/72082096014163984-core_node11-n_0000000000
  2> 	2	/solr/collections/tlog_replica_test_create_delete/leader_elect/shard1/election/72082096014163984-core_node12-n_0000000000
  2> 	2	/solr/overseer_elect/election/72082096014163984-127.0.0.1:56946_solr-n_0000000005
  2> 	2	/solr/collections/tlog_replica_test_create_delete/leader_elect/shard2/election/72082096014163984-core_node9-n_0000000001
  2> 	2	/solr/collections/tlog_replica_test_create_delete/leader_elect/shard1/election/72082096014163988-core_node14-n_0000000001
  2> 
  2> Maximum concurrent children watches above limit:
  2> 
  2> 	87	/solr/collections
  2> 	52	/solr/live_nodes
  2> 	48	/solr/collections/tlog_replica_test_create_delete/state.json
  2> 	32	/solr/collections/tlog_replica_test_recovery/state.json
  2> 	22	/solr/collections/tlog_replica_test_kill_leader/state.json
  2> 	21	/solr/collections/tlog_replica_test_out_of_order_db_qwith_in_place_updates/state.json
  2> 	20	/solr/collections/tlog_replica_test_basic_leader_election/state.json
  2> 	20	/solr/collections/tlog_replica_test_remove_leader/state.json
  2> 	14	/solr/collections/tlog_replica_test_add_docs/state.json
  2> 	13	/solr/collections/tlog_replica_test_add_remove_tlog_replica/state.json
  2> 	13	/solr/collections/tlog_replica_test_real_time_get/state.json
  2> 	12	/solr/collections/tlog_replica_test_rebalance_leaders/state.json
  2> 	12	/solr/collections/tlog_replica_test_only_leader_indexes/state.json
  2> 	12	/solr/collections/tlog_replica_test_delete_by_id/state.json
  2> 	3	/solr/overseer/queue
  2> 	3	/solr/overseer/collection-queue-work
  2> 	3	/solr/overseer/queue-work
  2>

…ntain consistent behavior with calls to `keySet` on the now refactored `watchedCollectionStates`

…tches instead of watched collection states, as newly created collection might not yet be reflected in watched collection states and should be refreshed.

…ntain consistent behavior with calls to `keySet` on the now refactored `watchedCollectionStates`

Forbidden method invocation: java.util.concurrent.Executors#newSingleThreadExecutor() [Spawns threads with vague names; use a custom thread factory (Lucene's NamedThreadFactory, Solr's SolrNamedThreadFactory) and name threads so that you can tell (by its name) which executor it is associated with] in org.apache.solr.cloud.overseer.ZkStateReaderTest (ZkStateReaderTest.java:359)

Forbidden method invocation: java.util.concurrent.Executors#newSingleThreadExecutor(java.util.concurrent.ThreadFactory) [Spawns threads without MDC logging context; use ExecutorUtil.newMDCAwareSingleThreadExecutor instead]

HoustonPutman

Thanks for tackling this Patson!

I haven't gone over everything, but took a first-pass look. I think we should probably keep the helper methods instead of sub-classing ConcurrentHashMap. But that's a separate issue than what you are trying to fix really.

solr/core/src/test/org/apache/solr/cloud/overseer/ZkStateReaderTest.java

solr/solrj/src/java/org/apache/solr/common/cloud/ZkStateReader.java

patsonluk · 2022-06-21T22:51:02Z

The original junit error was

 2> 1393351 INFO  (TEST-TestTlogReplica.testRebalanceLeaders-seed#[C004E5CD6B2D226F]) [] o.a.s.SolrTestCaseJ4 ###Ending testRebalanceLeaders
   >     java.lang.AssertionError: Can not find doc 1 in http://127.0.0.1:56945/solr
   >         at __randomizedtesting.SeedInfo.seed([C004E5CD6B2D226F:DEEBD73D53866E1F]:0)
   >         at org.junit.Assert.fail(Assert.java:89)
   >         at org.junit.Assert.assertTrue(Assert.java:42)
   >         at org.junit.Assert.assertNotNull(Assert.java:713)
   >         at org.apache.solr.cloud.TestTlogReplica.checkRTG(TestTlogReplica.java:1138)
   >         at org.apache.solr.cloud.TestTlogReplica.testRebalanceLeaders(TestTlogReplica.java:839)
   >         at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   >         at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
   >         at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

~~Seems to not happening anymore after the additional commits~~ . Still happens occasionally with ./gradlew :solr:core:beast -Ptests.dups=10 --tests "org.apache.solr.cloud.TestTlogReplica.testRebalanceLeaders" -Ptests.jvms=6 "-Ptests.jvmargs=-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=120m" -Ptests.seed=C004E5CD6B2D226F -Ptests.file.encoding=US-ASCII

Not sure if it's related to this change. This failures seems to happen occasionally http://fucit.org/solr-jenkins-reports/history-trend-of-recent-failures.html#series/org.apache.solr.cloud.TestTlogReplica.testRebalanceLeaders . Not sure if my mac book makes a diff tho 😓 (1 out of 3 X 10 tests.dups)

Fixed formatting

patsonluk · 2022-06-22T00:04:31Z

Thanks for tackling this Patson!

I haven't gone over everything, but took a first-pass look. I think we should probably keep the helper methods instead of sub-classing ConcurrentHashMap. But that's a separate issue than what you are trying to fix really.

Yes we can discuss about that, I don't have any strong opinion here.

My design of putting it into a separate class is based on the reasoning that updating the doc collection is tightly tied with the state of the of the watcher, especially that the collectionsWatched contains the DocCollection state now 😄 . Therefore I feel that might be reasonable encapsulation.

With the helper approach, it might be slightly harder to figure out the helper modifies the field by inspecting the method signature alone. It's more of a personal preference that I found methods with side effect a bit hard to track sometimes 😅 (even my current approach it still modifies the state ie has side effects too! But I think it's more obvious that the call/change is applied to the collectionsWatched object 😊 )

Thank you for the helpful code review!

patsonluk · 2022-06-29T00:29:55Z

Friendly ping @HoustonPutman . Wondering if you have any extra thoughts about this PR? 😊

HoustonPutman

Not sure if it's related to this change. This failures seems to happen occasionally http://fucit.org/solr-jenkins-reports/history-trend-of-recent-failures.html#series/org.apache.solr.cloud.TestTlogReplica.testRebalanceLeaders . Not sure if my mac book makes a diff tho 😓 (1 out of 3 X 10 tests.dups)

Yeah if it's on that list you generally don't need to worry about your likely unrelated PR making it worse.

My design of putting it into a separate class is based on the reasoning that updating the doc collection is tightly tied with the state of the of the watcher, especially that the collectionsWatched contains the DocCollection state now 😄 . Therefore I feel that might be reasonable encapsulation.

Ok I think I'm on your side now. But I do think there is some confusion between having a null value in collectionsWatches and having a null currentDoc in the value. Can you explain that difference?

Also can we rename currentDoc to currentState? Also DocCollectionWatch would be better named StatefulCollectionWatch, because the important characteristic is that its saving the current state of the collection.

Thanks again for the work here!

solr/solrj/src/java/org/apache/solr/common/cloud/ZkStateReader.java

solr/core/src/test/org/apache/solr/cloud/overseer/ZkStateReaderTest.java

HoustonPutman · 2022-07-14T18:16:54Z

I have a commit to fix the return signature, could you check-off the Allow maintainers to commit option?

cpoerschke · 2022-07-14T18:30:57Z

... could you check-off the Allow maintainers to commit option?

It seems that the PR might be from an organisation repo and possibly the isaacs/github#1681 might still apply then, just FYI in case it does.

HoustonPutman · 2022-07-14T18:33:15Z

Ahh thanks for reminding me about that @cpoerschke!

solr/solrj/src/java/org/apache/solr/common/cloud/ZkStateReader.java

…ng 10k times

….java Co-authored-by: Houston Putman <houstonputman@gmail.com>

patsonluk · 2022-07-14T23:22:35Z

@madrob @HoustonPutman I added 2 commits to better reproduce the race condition (copied the test changes to main branch and confirmed that it does trigger the failure). https://github.com/apache/solr/compare/24a05df012f1354ab210fa8e9a593c249f8254ad..0feb4e61bd02afdfff282b4495bf2bf8cf5050d3

Seems like we have addressed all the review comments? If so, what are the next steps? Many thanks again! 😊

madrob · 2022-07-15T15:04:26Z

solr/core/src/test/org/apache/solr/cloud/overseer/ZkStateReaderTest.java

+    currentCollections =
+        reader.getCurrentCollections(); // should detect both collections (c1 watched, c2 lazy
+    // loaded)


nit: this is strange wrapping, separate the comment onto its own line?

Thanks! I will try again, I ran both ./gradlew :solr:solrj:spotlessApply and ./gradlew :solr:core:spotlessApply and sometimes it produces some weird wrapping...

madrob · 2022-07-15T15:06:19Z

solr/solrj/src/java/org/apache/solr/common/util/CommonTestInjection.java

+      try {
+        log.info("Start: artificial delay for {}ms", delay);
+        Thread.sleep(delay);
+        log.info("Finish: artificial delay for {}ms", delay);


nit: move this to a finally block so that we still log even if we were interrupted.

Good point !

madrob · 2022-07-15T15:08:42Z

solr/solrj/src/java/org/apache/solr/common/cloud/ZkStateReader.java

            lazyCollectionStates.put(collection, new LazyCollectionRef(collection));
            reconstructState.set(true);
+            CommonTestInjection.injectDelay(); // To unit test race condition


our common pattern is to hide this behind a java assert. See also https://github.com/apache/solr/blob/main/solr/core/src/java/org/apache/solr/query/SolrRangeQuery.java#L183

O missed that one. Thanks for catching!

madrob · 2022-07-15T15:11:23Z

solr/solrj/src/java/org/apache/solr/common/cloud/ZkStateReader.java

    if (log.isDebugEnabled()) {
      log.debug(
          "clusterStateSet: interesting [{}] watched [{}] lazy [{}] total [{}]",
-          collectionWatches.keySet().size(),
-          watchedCollectionStates.keySet().size(),
+          collectionWatches.watchedCollections().size(),
+          collectionWatches.activeCollectionCount(),
          lazyCollectionStates.keySet().size(),
          clusterState.getCollectionStates().size());
    }

    if (log.isTraceEnabled()) {
      log.trace(
          "clusterStateSet: interesting [{}] watched [{}] lazy [{}] total [{}]",
-          collectionWatches.keySet(),
-          watchedCollectionStates.keySet(),
+          collectionWatches.watchedCollections(),
+          collectionWatches.activeCollections(),
          lazyCollectionStates.keySet(),
          clusterState.getCollectionStates());
    }


Is there still technically a race condition here where the watched collections and active collections could change between the successive calls?

I believe it could change but it should be okay.

As far as I understand, we only need a strong guarantee/synchronization on the write ops. As for read op, it appears none of the read operation requires such guarantee, ie even if the value read has changed or is inconsistent with the watchedCollectionStates, it should not cause any serious issues. (it might print value that does not reflect the latest set or fetch/not fetch something - but these are not new behaviors)

The core of this PR is to remove the race conditions between write operations of 2 separate maps, which can cause stale state. And now since all write ops are on the same map with sync, we should be safe 😊 . As for read ops, it's pretty much the same as before, which does not need to have such strong guarantee I suppose?

solr/solrj/src/java/org/apache/solr/common/cloud/ZkStateReader.java

madrob · 2022-07-15T15:18:11Z

solr/solrj/src/java/org/apache/solr/common/cloud/ZkStateReader.java

+              int oldCVersion =
+                  oldState.getPerReplicaStates() == null
+                      ? -1
+                      : oldState.getPerReplicaStates().cversion;
+              int newCVersion =
+                  newState.getPerReplicaStates() == null
+                      ? -1
+                      : newState.getPerReplicaStates().cversion;
+              if (oldState.getZNodeVersion() < newState.getZNodeVersion()
+                  || oldCVersion < newCVersion) {


Can we add a test that creates c1, gets the ref, deletes c1, creates c1 again, and then checks that the new ref is correct? The node version for both instances of c1 would be 0, but the cversion should be larger, so I think your logic still works, but I'd like to see a unit test for it. Or maybe this is already covered and I missed where it is in the tests? Thanks! Similar to what we saw in SOLR-16143.

@madrob Yes I think a test case to verify different version handling makes sense!

Added a new test case for that. And since we are testing cversion, i added PRS into the unit test (otherwise it will always be -1).

Does this address your concern?
https://github.com/apache/solr/pull/909/files#diff-c92a116c89a724226fb6efbea4c4f90ab70b40a779e1c50155c8eca646520358R227-R342

patsonluk · 2022-07-18T16:50:50Z

@HoustonPutman Thanks for triggering the workflow (I assume it was you? 😅 ), Now that all the checks have passed, what are the next steps? (trying to read up on https://cwiki.apache.org/confluence/display/SOLR/Git+commit+process and https://cwiki.apache.org/confluence/display/solr/HowToContribute#HowToContribute-Contributingyourwork)

There's no rush for this PR, but just want to make sure I am not holding off the process because of any missing items from me.

Big thank you again @HoustonPutman and @madrob !

HoustonPutman · 2022-07-20T15:24:35Z

@patsonluk I think this is good to go! I'll add an entry in the changelog, then commit and backport this! (Gonna run the full test suite first though)

HoustonPutman · 2022-07-20T15:27:30Z

Actually I can't push to your PR.

Can you add this at the end of the 9.1.0 "Improvements" section:

* SOLR-16257: Improve ZkStateReader to avoid race condition between collectionWatches and watchedCollectionStates
  (Patson Luk, Houston Putman, Mike Drob)

HoustonPutman · 2022-07-20T16:00:05Z

Tests pass, this is good to go when we get the Changelog entry

patsonluk · 2022-07-20T17:14:34Z

@HoustonPutman done! (hopefully this is the right place 😅 https://github.com/apache/solr/pull/909/files#diff-cdd1ae1b418ea74ca689c6e68e61f4e5b4c6f77f6408a4e6cf76ed8421b24d09)

HoustonPutman · 2022-07-20T19:01:54Z

Thanks so much for the work here @patsonluk , it was a long one, but I think very beneficial to the ZkStateReader!

Race condition existed between collectionWatches and watchedCollectionStates. (cherry picked from commit 97458a5)

patsonluk · 2022-07-20T22:16:39Z

Thank you so much for all the great reviews and code suggestions! @HoustonPutman @madrob 🎉

Hopefully such fix can eliminate occasional issue with Solr cluster state having the stale state, we do run into it once in a while in our prod and just once earlier this morning 😓

risdenk · 2022-08-08T13:34:40Z

FYI this has been causing NPE in ZkStateReaderTest - #966

patsonluk · 2022-08-17T23:54:06Z

FYI this has been causing NPE in ZkStateReaderTest - #966

Thanks for fixing it! 🙇🏼

…n states Downstream changes from apache/solr#909 and apache/solr#966

…d collection states (#176) * Changes to avoid race condition in ZkStateReader on watched collection states Downstream changes from apache/solr#909 and apache/solr#966 * Resolve merge conflict

patsonluk added 5 commits June 16, 2022 17:38

ZkStateReader changes to avoid race condition for collectionWatches a…

ac85088

…nd watchedCollectionStates

Fixed missing javadoc values

98813b2

/gradlew :solr:solrj:spotlessApply

a14a53a

Addressed :solr:solrj:validateLogCalls violations

548417c

/gradlew :solr:solrj:spotlessApply

2189b3a

patsonluk added 16 commits June 17, 2022 14:46

Fixed DocCollectionWatches to keep a set of active collections to mai…

aa026d6

…ntain consistent behavior with calls to `keySet` on the now refactored `watchedCollectionStates`

Code cleanup

07611a1

forciblyRefreshAllClusterStateSlow should use key set in collectionWa…

4a6d6a2

…tches instead of watched collection states, as newly created collection might not yet be reflected in watched collection states and should be refreshed.

Fixed DocCollectionWatches to keep a set of active collections to mai…

7a8f960

…ntain consistent behavior with calls to `keySet` on the now refactored `watchedCollectionStates`

/gradlew :solr:solrj:spotlessApply

eaa0211

Added unit test cases

027436b

./gradlew :solr:solrj:spotlessApply

d94a38b

Fixed gradle check

174b5c4

Forbidden method invocation: java.util.concurrent.Executors#newSingleThreadExecutor(java.util.concurrent.ThreadFactory) [Spawns threads without MDC logging context; use ExecutorUtil.newMDCAwareSingleThreadExecutor instead]

Code cleanup

8c32e86

./gradlew :solr:solrj:spotlessApply

1bfedb7

Reproduce race condition more consistently in ZkStateReaderTest

44243f6

./gradlew :solr:solrj:spotlessApply

22aa937

Code cleanup

c74c86b

Code cleanup

1861dba

Fixed unit test cases

479227d

HoustonPutman reviewed Jun 21, 2022

View reviewed changes

Do not use * for imports

d7b85e4

Fixed formatting

./gradlew :solr:solrj:spotlessApply

fbc91ff

HoustonPutman reviewed Jun 29, 2022

View reviewed changes

solr/solrj/src/java/org/apache/solr/common/cloud/ZkStateReader.java Outdated Show resolved Hide resolved

solr/solrj/src/java/org/apache/solr/common/cloud/ZkStateReader.java Show resolved Hide resolved

madrob reviewed Jun 29, 2022

View reviewed changes

HoustonPutman reviewed Jul 14, 2022

View reviewed changes

solr/solrj/src/java/org/apache/solr/common/cloud/ZkStateReader.java Outdated Show resolved Hide resolved

patsonluk and others added 3 commits July 14, 2022 13:04

Use delay injection in race condition unit test case instead of runni…

c39e2a9

…ng 10k times

Further improved the unit test case

0feb4e6

Update solr/solrj/src/java/org/apache/solr/common/cloud/ZkStateReader…

45181fb

….java Co-authored-by: Houston Putman <houstonputman@gmail.com>

madrob reviewed Jul 15, 2022

View reviewed changes

patsonluk added 5 commits July 15, 2022 10:23

Added comment for concurrent access concerns

c0dfcf8

Changes after code review

14afab4

Added extra unit test case on zk version handling.

7bad924

./gradlew :solr:core:spotlessApply

2ab50d4

More precise check condition

d3e6688

Update CHANGES.txt

e300041

HoustonPutman merged commit 97458a5 into apache:main Jul 20, 2022

HoustonPutman pushed a commit that referenced this pull request Jul 20, 2022

SOLR-16257: Improve ZkStateReader to avoid race condition (#909)

f1510b1

Race condition existed between collectionWatches and watchedCollectionStates. (cherry picked from commit 97458a5)

risdenk mentioned this pull request Aug 8, 2022

SOLR-16257: Fix ZkStateReaderTest NPE when checking node version #966

Merged

patsonluk added a commit to fullstorydev/lucene-solr that referenced this pull request Aug 17, 2022

Changes to avoid race condition in ZkStateReader on watched collectio…

44b45f0

…n states Downstream changes from apache/solr#909 and apache/solr#966

patsonluk mentioned this pull request Aug 18, 2022

Downstream changes to avoid race condition in ZkStateReader on watched collection states fullstorydev/lucene-solr#176

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SOLR-16257: `ZkStateReader` changes to avoid race condition between `collectionWatches` and `watchedCollectionStates` #909

SOLR-16257: `ZkStateReader` changes to avoid race condition between `collectionWatches` and `watchedCollectionStates` #909

patsonluk commented Jun 17, 2022 •

edited

Loading

patsonluk commented Jun 17, 2022 •

edited

Loading

HoustonPutman left a comment

patsonluk commented Jun 21, 2022 •

edited

Loading

patsonluk commented Jun 22, 2022 •

edited

Loading

patsonluk commented Jun 29, 2022

HoustonPutman left a comment

HoustonPutman commented Jul 14, 2022

cpoerschke commented Jul 14, 2022

HoustonPutman commented Jul 14, 2022

patsonluk commented Jul 14, 2022

madrob Jul 15, 2022

patsonluk Jul 15, 2022

madrob Jul 15, 2022

patsonluk Jul 15, 2022

madrob Jul 15, 2022

patsonluk Jul 15, 2022

madrob Jul 15, 2022

patsonluk Jul 15, 2022 •

edited

Loading

madrob Jul 15, 2022

patsonluk Jul 15, 2022

patsonluk commented Jul 18, 2022 •

edited

Loading

HoustonPutman commented Jul 20, 2022

HoustonPutman commented Jul 20, 2022

HoustonPutman commented Jul 20, 2022

patsonluk commented Jul 20, 2022

HoustonPutman commented Jul 20, 2022

patsonluk commented Jul 20, 2022 •

edited

Loading

risdenk commented Aug 8, 2022

patsonluk commented Aug 17, 2022

SOLR-16257: ZkStateReader changes to avoid race condition between collectionWatches and watchedCollectionStates #909

SOLR-16257: ZkStateReader changes to avoid race condition between collectionWatches and watchedCollectionStates #909

Conversation

patsonluk commented Jun 17, 2022 • edited Loading

Description

Solution

Tests

Checklist

Remarks

patsonluk commented Jun 17, 2022 • edited Loading

HoustonPutman left a comment

Choose a reason for hiding this comment

patsonluk commented Jun 21, 2022 • edited Loading

patsonluk commented Jun 22, 2022 • edited Loading

patsonluk commented Jun 29, 2022

HoustonPutman left a comment

Choose a reason for hiding this comment

HoustonPutman commented Jul 14, 2022

cpoerschke commented Jul 14, 2022

HoustonPutman commented Jul 14, 2022

patsonluk commented Jul 14, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

patsonluk Jul 15, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

patsonluk commented Jul 18, 2022 • edited Loading

HoustonPutman commented Jul 20, 2022

HoustonPutman commented Jul 20, 2022

HoustonPutman commented Jul 20, 2022

patsonluk commented Jul 20, 2022

HoustonPutman commented Jul 20, 2022

patsonluk commented Jul 20, 2022 • edited Loading

risdenk commented Aug 8, 2022

patsonluk commented Aug 17, 2022

SOLR-16257: `ZkStateReader` changes to avoid race condition between `collectionWatches` and `watchedCollectionStates` #909

SOLR-16257: `ZkStateReader` changes to avoid race condition between `collectionWatches` and `watchedCollectionStates` #909

patsonluk commented Jun 17, 2022 •

edited

Loading

patsonluk commented Jun 17, 2022 •

edited

Loading

patsonluk commented Jun 21, 2022 •

edited

Loading

patsonluk commented Jun 22, 2022 •

edited

Loading

patsonluk Jul 15, 2022 •

edited

Loading

patsonluk commented Jul 18, 2022 •

edited

Loading

patsonluk commented Jul 20, 2022 •

edited

Loading