Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Illegal shard failure requests #16275

Closed
wants to merge 21 commits into from
Closed

Illegal shard failure requests #16275

wants to merge 21 commits into from

Conversation

jasontedor
Copy link
Member

Today, shard failure requests are blindly handled on the master without
any validation that the request is a legal request. A legal request is a
shard failure request for which the shard requesting the failure is
either the local allocation or the primary allocation. This is because
shard failure requests are classified into only two sets: requests that
correspond to shards that exist, and requests that correspond to shards
that do not exist. Requests that correspond to shards that do not exist
are immediately marked as successful (there is nothing to do), and
requests that correspond to shards that do exist are sent to the
allocation service for handling the failure.

This pull request adds a third classification for shard failure requests
to separate out illegal shard failure requests and enables the master to
validate shard failure requests. The master communicates the illegality
of a shard failure request via a new exception:
IllegalShardFailureException. This exception can be used by shard
failure listeners to discover when they've sent a shard failure request
that they were not allowed to send (e.g., if they are no longer the
primary allocation for the shard).

@jasontedor
Copy link
Member Author

@bleskes Note that this does not yet add support to TransportReplicationAction to handle situations where a node is acting as if it's a primary, tries to fail a replica but by the time the request arrives on the master the node is no longer the primary; the handling of such a situation is unchanged by this pull request. This is merely getting the plumbing in place to enable handling on the TransportReplicationAction side which will come in a later pull request.

@jasontedor
Copy link
Member Author

@bleskes I've pushed a new commit to use allocation IDs as identity instead of node IDs. I think this is ready for a formal review?

Currently shard failure tasks are classified into two sets, tasks that
correspond to shards that exist, and tasks that correspond to shards
that do not exist. Tasks that correspond to shards that do not exist are
immediately marked as successful (there is nothing to do), and tasks
that correspond to shards that do exist are sent to the allocation
service for handling the failure. However, more granuality is needed
here to also classify tasks that correspond to shards for which the node
that requested the failure is neither the shard's home nor is that node
the primary node for the shard. This commit adds support for this
additional classification but currently leaves this classification as
unused.
Today, shard failure requests are blindly handled on the master without
any validation that the request is a legal request. A legal request is a
shard failure request for which the shard requesting the failure is
either the local allocation or the primary allocation. This commit adds
validation for illegal shard failure requests and communicates their
illegality via a new exception IllegalShardFailureException. This
exception can be used by shard failure listeners to discover when
they've sent a shard failure request that they were not allowed to send
(e.g., if they are no longer the primary allocation for the shard).
@@ -125,15 +127,15 @@ private static boolean isMasterChannelException(TransportException exp) {
return ExceptionsHelper.unwrap(exp, MASTER_CHANNEL_EXCEPTIONS) != null;
}

public void shardFailed(final ShardRouting shardRouting, final String indexUUID, final String message, @Nullable final Throwable failure, Listener listener) {
public void shardFailed(final ShardRouting shardRouting, ShardRouting identity, final String indexUUID, final String message, @Nullable final Throwable failure, Listener listener) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think identity is a confusing name. Maybe "sourceShard"? in any case we need serous java docs... Also IndexUUID can go :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in cf46ce0.

@bleskes
Copy link
Contributor

bleskes commented Feb 1, 2016

Thanks @jasontedor . Left some comments

This commit cleans up some minor issues in o.e.c.a.s.ShardStateAction:
 - rename the source shard variable in sending shard failure requests
   from "identity" to "sourceShardRouting"
 - remove index UUID parameter that is no longer needed
 - drop ShardStateAction#resendShardFailed method
 - add Javadocs to ShardStateAction#shardFailed
This commit simplifies ShardFailedClusterStateTaskExecutor by using
Map#getOrDefault in several places instead of guarding Map#get with
Map#containsKey checks.
This commit renames ShardStateAction.IllegalShardFailureException to
ShardStateAction.NoLongerPrimaryShardException to clarify the intended
usage of this exception.
This commit renames shard failure task classifications in
o.e.c.a.s.ShardStateAction to greater clarify their intent.
This commit gives precedence to the source shard check when validating
shard failure tasks. The thinking is that we should detect if a stale
primary shard tries to fail a shard that is no longer there because in
this case we want to fail the stale primary.
Holding a reference to the cluster state can potentially cause a large
amount of heap to be retained. Instead, we should just hold references
to the minimal necessary objects.
@jasontedor
Copy link
Member Author

@bleskes I've pushed more commits addressing all of your comments.

private final AtomicBoolean finished = new AtomicBoolean();
private final AtomicInteger success = new AtomicInteger(1); // We already wrote into the primary shard
private final ConcurrentMap<String, Throwable> shardReplicaFailures = ConcurrentCollections.newConcurrentMap();
private final AtomicInteger pending;
private final int totalShards;
private final Releasable indexShardReference;
private final ShardRouting primaryShard;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now that #15900 is in, we can remove this and get the shardrouting of the primary from the indexShardReference.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I merged master into this branch and pushed 16c3862.

private static ShardRouting randomInvalidIdentity(ClusterState currentState, ShardRouting shardRouting) {
Set<ShardRouting> identities = new HashSet<>();
ShardRouting primaryShard = primaryShard(currentState, shardRouting);
for (ShardRouting shard : currentState.routingTable().allShards()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not use java8 streams and two filters? will be cleaner. Also note that shardRouting has isSameAllocation which is saves on the Objects.equals

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great suggestions. I pushed 49baecc.

@bleskes
Copy link
Contributor

bleskes commented Feb 2, 2016

left a bunch of minor improvement suggestions...

* master: (48 commits)
  Fix JVM GC monitor missing settings test
  Add random positive time value convenience method
  Added versions 2.1.3-SNAPSHOT and 2.2.1-SNAPSHOT, and bwc indices for 2.1.2 and 2.2.0
  Fix compilation in TransportReplicationActionTests
  Revert "Make GeoDistanceSortBuilder serializable"
  Revert "Remove deprecation for geohash setter"
  Revert "Indentation fix for messy SimpleSortTest"
  Use allocation ids to prevent repeated recovery of failed shards
  Prevent TransportReplicationAction to route request based on stale local routing table
  Indentation fix for messy SimpleSortTest
  Remove deprecation for geohash setter
  Fix IndexShardTests.testStressRelocated
  Remove `discovery.zen.rejoin_on_master_gone`
  BWC: Added 1.7 version constants and bwc indices
  Make GeoDistanceSortBuilder serializable
  Mute IndexShardTests.testStressRelocated
  Add proper handoff between old and new copy of relocating primary shard
  Add operation counter for IndexShard
  Start to break lines at 140 characters
  CliTool: Allow unexpected exceptions to propagate
  ...
This commit uses the index shard reference held in the ReplicationPhase
to get a shard routing to the primary shard.
* master:
  [TEST] Fail test if dummy doc is not found
  Test awaits Lucene snapshot upgrade
  Merge pull request #16371 from clintongormley/deprecate-multicast
  Merge pull request #16345 from lbrito1/patch-1
  Fix minor typo in migrate_3_0.asciidoc
@jasontedor
Copy link
Member Author

left a bunch of minor improvement suggestions...

Thanks @bleskes, I pushed more commits so this is ready for another round.

@bleskes
Copy link
Contributor

bleskes commented Feb 2, 2016

LGTM. Thanks @jasontedor

@jasontedor jasontedor closed this in a3a49a1 Feb 2, 2016
@jasontedor jasontedor mentioned this pull request Feb 3, 2016
9 tasks
@jasontedor jasontedor deleted the not-the-primary branch February 5, 2016 10:38
@clintongormley clintongormley added :Distributed Indexing/Distributed A catch all label for anything in the Distributed Area. Please avoid if you can. and removed :Cluster labels Feb 13, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Indexing/Distributed A catch all label for anything in the Distributed Area. Please avoid if you can. >enhancement v5.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants