Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stabilize RareClusterStateIT #38671

Merged
merged 1 commit into from
Feb 14, 2019

Conversation

original-brownbear
Copy link
Member

@original-brownbear original-brownbear commented Feb 10, 2019

@original-brownbear original-brownbear added >bug :Distributed Indexing/Distributed A catch all label for anything in the Distributed Area. Please avoid if you can. v8.0.0 v7.2.0 labels Feb 10, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@original-brownbear original-brownbear added >test Issues or PRs that are addressing/adding tests WIP labels Feb 10, 2019
@original-brownbear
Copy link
Member Author

Jenkins test this

@original-brownbear
Copy link
Member Author

Jenkins run elasticsearch-ci/1
Jenkins run elasticsearch-ci/2

@original-brownbear original-brownbear changed the title Fix Race in Wait for Dynamic Mapping Update Increase Timeout in RareClusterStateIT Feb 10, 2019
@ywelsch
Copy link
Contributor

ywelsch commented Feb 12, 2019

do we know why it takes 15 seconds? What is it waiting on?

@original-brownbear
Copy link
Member Author

@ywelsch it's waiting on the next cluster state from the dynamic mapping upgrade here https://github.com/elastic/elasticsearch/blob/master/server/src/main/java/org/elasticsearch/action/bulk/TransportShardBulkAction.java#L139.

I tried fixing this by creating the future before initiating the mapping update, but that randomly lead to even longer wait times (probably because some other new state without the mapping update hit the observer) and then we try writing again and it again fails and then we have to start over. As far as I understand it the deeper problem seems to be the indiscriminate waiting for just the next state in https://github.com/elastic/elasticsearch/blob/master/server/src/main/java/org/elasticsearch/action/bulk/TransportShardBulkAction.java#L139 combined with the fact that we add the observer only after we initiate the state update so it might just miss the update.

* Use actual master node, not just a master elligible node when trying to cancel publication. This only works on the master and for unlucky seeds we never try the master within the 10s that the busy assert runs.
* Closes elastic#36813
@original-brownbear original-brownbear changed the title Increase Timeout in RareClusterStateIT Stabilize RareClusterStateIT Feb 13, 2019
@original-brownbear
Copy link
Member Author

@ywelsch urgh :) this turned out to be really trivial. It was exactly 15s waits here because for the tested/failing seed from the issue it would use master nodes in the order not-master, actual-master, not-master, not-master, ..., actual-master within 15 seconds when randomly selecting master nodes.
So mostly if you ran that in a loop, the first retry on the busy assert would pass, but if you got unlucky you'd have to wait 15s to get the next try that hits actual-master :)

@original-brownbear original-brownbear merged commit c8224e3 into elastic:master Feb 14, 2019
@original-brownbear original-brownbear deleted the 36813 branch February 14, 2019 14:15
jasontedor added a commit to jasontedor/elasticsearch that referenced this pull request Feb 14, 2019
* elastic/master:
  Remove immediate operation retry after mapping update (elastic#38873)
  Remove mentioning of types from bulk API docs (elastic#38896)
  SQL: change JDBC setup URL in the documentation (elastic#38564)
  Skip BWC tests in checkPart1 and checkPart2 (elastic#38730)
  Enable silent FollowersCheckerTest (elastic#38851)
  Update TESTING.asciidoc with platform specific instructions (elastic#38802)
  Use consistent view of realms for authentication (elastic#38815)
  Stabilize RareClusterState (elastic#38671)
  Increase Timeout in UnicastZenPingTests (elastic#38893)
  Do not recommend installing vagrant-winrm elastic#38887
  _cat/indices with Security, hide names when wildcard (elastic#38824)
  SQL: fall back to using the field name for column label (elastic#38842)
  Fix LocalIndexFollowingIT#testRemoveRemoteConnection() test (elastic#38709)
  Remove joda time mentions in documentation (elastic#38720)
  Add enabled status for token and api key service (elastic#38687)
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this pull request Feb 27, 2019
* Use actual master node, not just a master elligible node when trying to cancel publication. This only works on the master and for unlucky seeds we never try the master within the 10s that the busy assert runs.
* Closes elastic#36813
original-brownbear added a commit that referenced this pull request Feb 28, 2019
* Use actual master node, not just a master elligible node when trying to cancel publication. This only works on the master and for unlucky seeds we never try the master within the 10s that the busy assert runs.
* Closes #36813
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Indexing/Distributed A catch all label for anything in the Distributed Area. Please avoid if you can. >test Issues or PRs that are addressing/adding tests v7.2.0 v8.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[CI] Failure in org.elasticsearch.indices.state.RareClusterStateIT.testDelayedMappingPropagationOnReplica
4 participants