-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Force Refresh Listeners when Acquiring all Operation Permits #36835
Force Refresh Listeners when Acquiring all Operation Permits #36835
Conversation
Pinging @elastic/es-distributed |
* @throws InterruptedException if calling thread is interrupted | ||
* @throws TimeoutException if timed out waiting for in-flight operations to finish | ||
* @throws IndexShardClosedException if operation permit has been closed | ||
*/ | ||
<E extends Exception> void blockOperations( | ||
final long timeout, | ||
final TimeUnit timeUnit, | ||
final CheckedRunnable<E> onActiveOperations, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems the only production use case for this method is in relocation so admit it's a little noisy to add this kind of general callback here, but it still seems like the smallest possible change to get a hook to run the refresh conditionally here (after preventing new operations from piling on more waits concurrently).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has still a race I think. The issue is that there's no guarantee that the refresh will happen after all pending requests have registered a refresh listener. To ensure this, we need multiple steps. First, ensure that no new listeners are registered (this can be achieved by setting getMaxRefreshListeners
to 0), and then doing a manual refresh to free all existing listeners. There is no need I think to inline all this into blockOperations, it can be done before calling this method.
* Fixes the issue reproduced in the added tests: * When having open index requests on a shard that are waiting for a refresh, relocating that shard becomes blocked until that refresh happens (which could be never as in the test scenario). * Fixed by: * Before trying to aquire all permits for relocation, refresh if there are outstanding operations
edba8bf
to
f40c6a6
Compare
server/src/main/java/org/elasticsearch/index/shard/RefreshListeners.java
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @original-brownbear. The concurrency looks better. I think we need to extend this to all actions that can possibly acquire all operation permits. In particular, I think this might also cause problems on replicas, e.g. when a replica learns of a new primary and tries to bump its term (see IndexShard#bumpPrimaryTerm
). If it then has a refresh=wait_for
op waiting (from the old primary), it will run into the same issue, and indefinitely stop accepting any writes from the new primary.
server/src/main/java/org/elasticsearch/index/shard/RefreshListeners.java
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/index/shard/IndexShard.java
Outdated
Show resolved
Hide resolved
try { | ||
if (refreshListeners.refreshNeeded()) { | ||
refresh("relocated"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we always want to do the refresh after calling disallowAdd
, I wonder if we should combine both into one method
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I couldn't find a neat way of doing that since we have the async case and the blocking case of acquiring all the permits and we want to enforce some try-finally
semantics for allowing the listeners again for both now. I'm not sure it actually makes things more readable if we hide the handling of exceptions from refresh(...)
in some other method. I can try finding a nice way though :)
server/src/main/java/org/elasticsearch/index/shard/RefreshListeners.java
Outdated
Show resolved
Hide resolved
@ywelsch all points addressed I think => should be good for another review.
Done, I wrapped all cases of acquiring all permits I could find. It seems though, that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've pushed
daa9fc7 which simplifies the code imho. We will also need unit tests for the RefreshListeners
class.
Ok thanks, I'll add some tests :) |
0e1b6db
to
71bef56
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
throw e; | ||
} | ||
} | ||
return () -> runOnce.run(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we could assert before this line here that assert refreshListeners == null
?
Please adapt PR title to make it clear this is not only for relocations. For example: "Force refresh listeners when acquiring all operation permits" |
…#36835) * Fixes the issue reproduced in the added tests: * When having open index requests on a shard that are waiting for a refresh, relocating that shard becomes blocked until that refresh happens (which could be never as in the test scenario).
* Force Refresh Listeners when Acquiring all Operation Permits (#36835) * Fixes the issue reproduced in the added tests: * When having open index requests on a shard that are waiting for a refresh, relocating that shard becomes blocked until that refresh happens (which could be never as in the test scenario).
Thanks for fixing this @original-brownbear ! |
becomes blocked until that refresh happens (which could be never as in the test scenario).
PS: I ran the added tests for a few thousand runs without trouble.