-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Illegal shard failure requests #16275
Conversation
@bleskes Note that this does not yet add support to |
@bleskes I've pushed a new commit to use allocation IDs as identity instead of node IDs. I think this is ready for a formal review? |
Currently shard failure tasks are classified into two sets, tasks that correspond to shards that exist, and tasks that correspond to shards that do not exist. Tasks that correspond to shards that do not exist are immediately marked as successful (there is nothing to do), and tasks that correspond to shards that do exist are sent to the allocation service for handling the failure. However, more granuality is needed here to also classify tasks that correspond to shards for which the node that requested the failure is neither the shard's home nor is that node the primary node for the shard. This commit adds support for this additional classification but currently leaves this classification as unused.
Today, shard failure requests are blindly handled on the master without any validation that the request is a legal request. A legal request is a shard failure request for which the shard requesting the failure is either the local allocation or the primary allocation. This commit adds validation for illegal shard failure requests and communicates their illegality via a new exception IllegalShardFailureException. This exception can be used by shard failure listeners to discover when they've sent a shard failure request that they were not allowed to send (e.g., if they are no longer the primary allocation for the shard).
@@ -125,15 +127,15 @@ private static boolean isMasterChannelException(TransportException exp) { | |||
return ExceptionsHelper.unwrap(exp, MASTER_CHANNEL_EXCEPTIONS) != null; | |||
} | |||
|
|||
public void shardFailed(final ShardRouting shardRouting, final String indexUUID, final String message, @Nullable final Throwable failure, Listener listener) { | |||
public void shardFailed(final ShardRouting shardRouting, ShardRouting identity, final String indexUUID, final String message, @Nullable final Throwable failure, Listener listener) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think identity is a confusing name. Maybe "sourceShard"? in any case we need serous java docs... Also IndexUUID can go :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in cf46ce0.
Thanks @jasontedor . Left some comments |
This commit cleans up some minor issues in o.e.c.a.s.ShardStateAction: - rename the source shard variable in sending shard failure requests from "identity" to "sourceShardRouting" - remove index UUID parameter that is no longer needed - drop ShardStateAction#resendShardFailed method - add Javadocs to ShardStateAction#shardFailed
This commit simplifies ShardFailedClusterStateTaskExecutor by using Map#getOrDefault in several places instead of guarding Map#get with Map#containsKey checks.
This commit renames ShardStateAction.IllegalShardFailureException to ShardStateAction.NoLongerPrimaryShardException to clarify the intended usage of this exception.
This commit renames shard failure task classifications in o.e.c.a.s.ShardStateAction to greater clarify their intent.
This commit gives precedence to the source shard check when validating shard failure tasks. The thinking is that we should detect if a stale primary shard tries to fail a shard that is no longer there because in this case we want to fail the stale primary.
Holding a reference to the cluster state can potentially cause a large amount of heap to be retained. Instead, we should just hold references to the minimal necessary objects.
@bleskes I've pushed more commits addressing all of your comments. |
private final AtomicBoolean finished = new AtomicBoolean(); | ||
private final AtomicInteger success = new AtomicInteger(1); // We already wrote into the primary shard | ||
private final ConcurrentMap<String, Throwable> shardReplicaFailures = ConcurrentCollections.newConcurrentMap(); | ||
private final AtomicInteger pending; | ||
private final int totalShards; | ||
private final Releasable indexShardReference; | ||
private final ShardRouting primaryShard; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
now that #15900 is in, we can remove this and get the shardrouting of the primary from the indexShardReference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I merged master into this branch and pushed 16c3862.
private static ShardRouting randomInvalidIdentity(ClusterState currentState, ShardRouting shardRouting) { | ||
Set<ShardRouting> identities = new HashSet<>(); | ||
ShardRouting primaryShard = primaryShard(currentState, shardRouting); | ||
for (ShardRouting shard : currentState.routingTable().allShards()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not use java8 streams and two filters? will be cleaner. Also note that shardRouting has isSameAllocation which is saves on the Objects.equals
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great suggestions. I pushed 49baecc.
left a bunch of minor improvement suggestions... |
* master: (48 commits) Fix JVM GC monitor missing settings test Add random positive time value convenience method Added versions 2.1.3-SNAPSHOT and 2.2.1-SNAPSHOT, and bwc indices for 2.1.2 and 2.2.0 Fix compilation in TransportReplicationActionTests Revert "Make GeoDistanceSortBuilder serializable" Revert "Remove deprecation for geohash setter" Revert "Indentation fix for messy SimpleSortTest" Use allocation ids to prevent repeated recovery of failed shards Prevent TransportReplicationAction to route request based on stale local routing table Indentation fix for messy SimpleSortTest Remove deprecation for geohash setter Fix IndexShardTests.testStressRelocated Remove `discovery.zen.rejoin_on_master_gone` BWC: Added 1.7 version constants and bwc indices Make GeoDistanceSortBuilder serializable Mute IndexShardTests.testStressRelocated Add proper handoff between old and new copy of relocating primary shard Add operation counter for IndexShard Start to break lines at 140 characters CliTool: Allow unexpected exceptions to propagate ...
This commit uses the index shard reference held in the ReplicationPhase to get a shard routing to the primary shard.
Thanks @bleskes, I pushed more commits so this is ready for another round. |
LGTM. Thanks @jasontedor |
Today, shard failure requests are blindly handled on the master without
any validation that the request is a legal request. A legal request is a
shard failure request for which the shard requesting the failure is
either the local allocation or the primary allocation. This is because
shard failure requests are classified into only two sets: requests that
correspond to shards that exist, and requests that correspond to shards
that do not exist. Requests that correspond to shards that do not exist
are immediately marked as successful (there is nothing to do), and
requests that correspond to shards that do exist are sent to the
allocation service for handling the failure.
This pull request adds a third classification for shard failure requests
to separate out illegal shard failure requests and enables the master to
validate shard failure requests. The master communicates the illegality
of a shard failure request via a new exception:
IllegalShardFailureException
. This exception can be used by shardfailure listeners to discover when they've sent a shard failure request
that they were not allowed to send (e.g., if they are no longer the
primary allocation for the shard).