Force Refresh Listeners when Acquiring all Operation Permits #36835

original-brownbear · 2018-12-19T12:56:12Z

Fixes the issue reproduced in the added tests:
- When having open index requests on a shard that are waiting for a refresh, relocating that shard
  becomes blocked until that refresh happens (which could be never as in the test scenario).
Fixed by:
- Before trying to aquire all permits for relocation, refresh if there are outstanding operations

PS: I ran the added tests for a few thousand runs without trouble.

elasticmachine · 2018-12-19T12:56:14Z

Pinging @elastic/es-distributed

original-brownbear · 2018-12-19T12:58:49Z

server/src/main/java/org/elasticsearch/index/shard/IndexShardOperationPermits.java

     * @throws InterruptedException      if calling thread is interrupted
     * @throws TimeoutException          if timed out waiting for in-flight operations to finish
     * @throws IndexShardClosedException if operation permit has been closed
     */
    <E extends Exception> void blockOperations(
            final long timeout,
            final TimeUnit timeUnit,
+            final CheckedRunnable<E> onActiveOperations,


It seems the only production use case for this method is in relocation so admit it's a little noisy to add this kind of general callback here, but it still seems like the smallest possible change to get a hook to run the refresh conditionally here (after preventing new operations from piling on more waits concurrently).

ywelsch

This has still a race I think. The issue is that there's no guarantee that the refresh will happen after all pending requests have registered a refresh listener. To ensure this, we need multiple steps. First, ensure that no new listeners are registered (this can be achieved by setting getMaxRefreshListeners to 0), and then doing a manual refresh to free all existing listeners. There is no need I think to inline all this into blockOperations, it can be done before calling this method.

* Fixes the issue reproduced in the added tests: * When having open index requests on a shard that are waiting for a refresh, relocating that shard becomes blocked until that refresh happens (which could be never as in the test scenario). * Fixed by: * Before trying to aquire all permits for relocation, refresh if there are outstanding operations

original-brownbear · 2018-12-20T10:03:52Z

@ywelsch alright, yours is a much better plan :) => reverted my approach and implemented that in f40c6a6 (sorry for accidental rebase)

…-fix

server/src/main/java/org/elasticsearch/index/shard/RefreshListeners.java

ywelsch

Thanks @original-brownbear. The concurrency looks better. I think we need to extend this to all actions that can possibly acquire all operation permits. In particular, I think this might also cause problems on replicas, e.g. when a replica learns of a new primary and tries to bump its term (see IndexShard#bumpPrimaryTerm). If it then has a refresh=wait_for op waiting (from the old primary), it will run into the same issue, and indefinitely stop accepting any writes from the new primary.

server/src/main/java/org/elasticsearch/index/shard/RefreshListeners.java

server/src/main/java/org/elasticsearch/index/shard/IndexShard.java

ywelsch · 2018-12-27T12:29:17Z

server/src/main/java/org/elasticsearch/index/shard/IndexShard.java

        try {
+            if (refreshListeners.refreshNeeded()) {
+                refresh("relocated");


As we always want to do the refresh after calling disallowAdd, I wonder if we should combine both into one method

I couldn't find a neat way of doing that since we have the async case and the blocking case of acquiring all the permits and we want to enforce some try-finally semantics for allowing the listeners again for both now. I'm not sure it actually makes things more readable if we hide the handling of exceptions from refresh(...) in some other method. I can try finding a nice way though :)

server/src/main/java/org/elasticsearch/index/shard/RefreshListeners.java

…-fix

…quiring all replicas

original-brownbear · 2018-12-27T15:09:13Z

@ywelsch all points addressed I think => should be good for another review.

I think we need to extend this to all actions that can possibly acquire all operation permits.

Done, I wrapped all cases of acquiring all permits I could find. It seems though, that IndexShard#bumpPrimaryTerm was the only production code use-case.
The other two methods where I added the logic seem to only be called from tests.

ywelsch

I've pushed
daa9fc7 which simplifies the code imho. We will also need unit tests for the RefreshListeners class.

original-brownbear · 2018-12-28T11:53:52Z

Ok thanks, I'll add some tests :)

original-brownbear · 2018-12-28T12:41:10Z

@ywelsch ok fixed :) Added test in 71bef56 Should be good for review now

ywelsch

LGTM

ywelsch · 2018-12-28T13:34:16Z

server/src/main/java/org/elasticsearch/index/shard/RefreshListeners.java

+                throw e;
+            }
+        }
+        return () -> runOnce.run();


maybe we could assert before this line here that assert refreshListeners == null?

ywelsch · 2018-12-28T13:37:13Z

Please adapt PR title to make it clear this is not only for relocations. For example: "Force refresh listeners when acquiring all operation permits"

…-fix

…#36835) * Fixes the issue reproduced in the added tests: * When having open index requests on a shard that are waiting for a refresh, relocating that shard becomes blocked until that refresh happens (which could be never as in the test scenario).

* Force Refresh Listeners when Acquiring all Operation Permits (#36835) * Fixes the issue reproduced in the added tests: * When having open index requests on a shard that are waiting for a refresh, relocating that shard becomes blocked until that refresh happens (which could be never as in the test scenario).

tlrx · 2019-01-07T09:12:27Z

Thanks for fixing this @original-brownbear !

original-brownbear added >bug v7.0.0 :Distributed Indexing/Distributed A catch all label for anything in the Distributed Area. Please avoid if you can. v6.6.0 labels Dec 19, 2018

original-brownbear commented Dec 19, 2018

View reviewed changes

jasontedor added v6.7.0 and removed v6.6.0 labels Dec 19, 2018

original-brownbear requested a review from ywelsch December 19, 2018 14:52

ywelsch suggested changes Dec 20, 2018

View reviewed changes

original-brownbear added 2 commits December 20, 2018 11:00

CR: different fix

f40c6a6

original-brownbear force-pushed the relocation-refresh-fix branch from edba8bf to f40c6a6 Compare December 20, 2018 10:01

original-brownbear requested a review from ywelsch December 20, 2018 10:03

original-brownbear added 4 commits December 20, 2018 12:39

Merge remote-tracking branch 'elastic/master' into relocation-refresh…

70d2a24

…-fix

Merge remote-tracking branch 'elastic/master' into relocation-refresh…

b426bd2

…-fix

Merge remote-tracking branch 'elastic/master' into relocation-refresh…

a822245

…-fix

CR: Prevent new listeners more reliably

c0893c4

original-brownbear commented Dec 21, 2018

View reviewed changes

server/src/main/java/org/elasticsearch/index/shard/RefreshListeners.java Show resolved Hide resolved

ywelsch suggested changes Dec 27, 2018

View reviewed changes

original-brownbear added 2 commits December 27, 2018 14:22

Merge remote-tracking branch 'elastic/master' into relocation-refresh…

ca07cb3

…-fix

CR: handle multiple concurrent force refresh, cover other cases of ac…

f7a5171

…quiring all replicas

original-brownbear requested a review from ywelsch December 27, 2018 15:40

Refactor

daa9fc7

ywelsch reviewed Dec 28, 2018

View reviewed changes

original-brownbear requested review from ywelsch and removed request for ywelsch December 28, 2018 12:22

CR: add test

71bef56

original-brownbear force-pushed the relocation-refresh-fix branch from 0e1b6db to 71bef56 Compare December 28, 2018 12:40

original-brownbear requested a review from ywelsch December 28, 2018 12:40

ywelsch approved these changes Dec 28, 2018

View reviewed changes

Merge remote-tracking branch 'elastic/master' into relocation-refresh…

4a5483d

…-fix

original-brownbear changed the title ~~RELOCATION:Fix Indef. Block when Wait on Refresh~~ Force Refresh Listeners when Acquiring all Operation Permits Dec 28, 2018

CR: add assert

bf73f54

original-brownbear merged commit 4ac8fc6 into elastic:master Dec 28, 2018

original-brownbear deleted the relocation-refresh-fix branch December 28, 2018 15:42

original-brownbear added backport pending and removed backport pending labels Dec 28, 2018

original-brownbear mentioned this pull request Dec 28, 2018

Force Refresh Listeners when Acquiring all Operation Permits #37025

Merged

tlrx mentioned this pull request Jan 3, 2019

Replicate closed indices #33888

Closed

50 tasks

jimczi added v7.0.0-beta1 and removed v7.0.0 labels Feb 7, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Force Refresh Listeners when Acquiring all Operation Permits #36835

Force Refresh Listeners when Acquiring all Operation Permits #36835

original-brownbear commented Dec 19, 2018 •

edited

Loading

elasticmachine commented Dec 19, 2018

original-brownbear Dec 19, 2018

ywelsch left a comment

original-brownbear commented Dec 20, 2018

ywelsch left a comment

ywelsch Dec 27, 2018

original-brownbear Dec 27, 2018

original-brownbear commented Dec 27, 2018

ywelsch left a comment

original-brownbear commented Dec 28, 2018

original-brownbear commented Dec 28, 2018

ywelsch left a comment

ywelsch Dec 28, 2018

ywelsch commented Dec 28, 2018

tlrx commented Jan 7, 2019

Force Refresh Listeners when Acquiring all Operation Permits #36835

Force Refresh Listeners when Acquiring all Operation Permits #36835

Conversation

original-brownbear commented Dec 19, 2018 • edited Loading

elasticmachine commented Dec 19, 2018

original-brownbear Dec 19, 2018

Choose a reason for hiding this comment

ywelsch left a comment

Choose a reason for hiding this comment

original-brownbear commented Dec 20, 2018

ywelsch left a comment

Choose a reason for hiding this comment

ywelsch Dec 27, 2018

Choose a reason for hiding this comment

original-brownbear Dec 27, 2018

Choose a reason for hiding this comment

original-brownbear commented Dec 27, 2018

ywelsch left a comment

Choose a reason for hiding this comment

original-brownbear commented Dec 28, 2018

original-brownbear commented Dec 28, 2018

ywelsch left a comment

Choose a reason for hiding this comment

ywelsch Dec 28, 2018

Choose a reason for hiding this comment

ywelsch commented Dec 28, 2018

tlrx commented Jan 7, 2019

original-brownbear commented Dec 19, 2018 •

edited

Loading