-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Verify primary mode usage with assertions #32667
Verify primary mode usage with assertions #32667
Conversation
Pinging @elastic/es-distributed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome. Left some questions I was curious about. Shouldn't block the change.
@@ -2259,7 +2254,6 @@ public void acquireReplicaOperationPermit(final long opPrimaryTerm, final long g | |||
final ActionListener<Releasable> onPermitAcquired, final String executorOnDelay, | |||
final Object debugInfo) { | |||
verifyNotClosed(); | |||
verifyReplicationTarget(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
out of curiosity - what made you switch this to be checked under a permit? (I agree it's good but was wondering if some test failed)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, multiple tests failed. The typical situation is that we're promoting a replica to primary but are still receiving in-flight requests from the previous primary.
if (indexShard.routingEntry().isRelocationTarget() == false) { | ||
try { | ||
final PlainActionFuture<Releasable> permitAcquiredFuture = new PlainActionFuture<>(); | ||
indexShard.acquireReplicaOperationPermit(primaryTerm, indexShard.getGlobalCheckpoint(), permitAcquiredFuture, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we replace this with an expect assertion block?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unfortunately does not work. The assertion is executed on another thread and the AssertionError
can therefore not be caught here, instead it ends up in the uncaught exception handler, failing the test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can suggest making it an evil test (no security manager) and attaching your own uncaught exception handler?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took another look at this test to see where a delay would come from (otherwise the exception should bubble up) and I think it's from the promotion clause above. I wonder if we should wait for it to complete it's background work to avoid a noisy test (and also enable this part of the test)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've changed the test so that it waits for the promotion to complete and reenabled the check in 7ad3d25.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Primary terms were introduced as part of the sequence-number effort (#10708) and added in ES 5.0. Subsequent work introduced the replication tracker which lets the primary own its replication group (#25692) to coordinate recovery and replication. The replication tracker explicitly exposes whether it is operating in primary mode or replica mode, independent of the ShardRouting object that's associated with a shard. During a primary relocation, for example, the primary mode is transferred between the primary relocation source and the primary relocation target. After transferring this so-called primary context, the old primary becomes a replication target and the new primary the replication source, reflected in the replication tracker on both nodes. With the most recent PR in this area (#32442), we finally have a clean transition between a shard that's operating as a primary and issuing sequence numbers and a shard that's serving as a replication target. The transition from one state to the other is enforced through the operation-permit system, where we block permit acquisition during such changes and perform the transition under this operation block, ensuring that there are no operations in progress while the transition is being performed. This finally allows us to turn the best-effort checks that were put in place to prevent shards from being used in the wrong way (i.e. primary as replica, or replica as primary) into hard assertions, making it easier to catch any bugs in this area.
Primary terms were introduced as part of the sequence-number effort (#10708) and added in ES 5.0. Subsequent work introduced the replication tracker which lets the primary own its replication group (#25692) to coordinate recovery and replication. The replication tracker explicitly exposes whether it is operating in primary mode or replica mode, independent of the
ShardRouting
object that's associated with a shard. During a primary relocation, for example, the primary mode is transferred between the primary relocation source and the primary relocation target. After transferring this so-called primary context, the old primary becomes a replication target and the new primary the replication source, reflected in the replication tracker on both nodes. With the most recent PR in this area (#32442), we finally have a clean transition between a shard that's operating as a primary and issuing sequence numbers and a shard that's serving as a replication target. The transition from one state to the other is enforced through the operation-permit system, where we block permit acquisition during such changes and perform the transition under this operation block, ensuring that there are no operations in progress while the transition is being performed. This finally allows us to turn the best-effort checks that were put in place to prevent shards from being used in the wrong way (i.e. primary as replica, or replica as primary) into hard assertions, making it easier to catch any bugs in this area.