Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] org.elasticsearch.index.reindex.CancelTests #26758

Closed
andyb-elastic opened this issue Sep 22, 2017 · 6 comments
Closed

[CI] org.elasticsearch.index.reindex.CancelTests #26758

andyb-elastic opened this issue Sep 22, 2017 · 6 comments
Assignees
Labels
:Distributed Indexing/CRUD A catch all label for issues around indexing, updating and getting a doc by id. Not search. >test Issues or PRs that are addressing/adding tests >test-failure Triaged test failures from CI

Comments

@andyb-elastic
Copy link
Contributor

It looks like the reindex response times out because some slice tasks don't complete in time (though they do appear to all acknowledge the cancellation). I couldn't find a cause. Neither of these reproduce, and they fail very rarely.

https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+5.5+multijob-unix-compatibility/os=fedora/141

ERROR   32.1s J0 | CancelTests.testUpdateByQueryCancelWithWorkers <<< FAILURES!
   > Throwable #1: java.lang.RuntimeException: Exception while waiting for the response. Running tasks: {"tasks":{"ip0U-jljTBy-m2SBtaJb1g:2794":{"node":"ip0U-jljTBy-m2SBtaJb1g","id":2794,"type":"transport","action":"indices:data/write/update/byquery","status":{"slice_id":3,"total":102,"updated":1,"created":0,"deleted":0,"batches":1,"version_conflicts":0,"noops":0,"retries":{"bulk":0,"search":0},"throttled_millis":0,"requests_per_second":0.0,"canceled":"by user request","throttled_until_millis":9223372006812},"description":"update-by-query [reindex-cancel-index]","start_time_in_millis":1505837598023,"running_time_in_nanos":30191506121,"cancellable":true,"parent_task_id":"ip0U-jljTBy-m2SBtaJb1g:2754"},"ip0U-jljTBy-m2SBtaJb1g:2796":{"node":"ip0U-jljTBy-m2SBtaJb1g","id":2796,"type":"transport","action":"indices:data/write/update/byquery","status":{"slice_id":4,"total":62,"updated":1,"created":0,"deleted":0,"batches":1,"version_conflicts":0,"noops":0,"retries":{"bulk":0,"search":0},"throttled_millis":0,"requests_per_second":0.0,"canceled":"by user request","throttled_until_millis":9223372006813},"description":"update-by-query [reindex-cancel-index]","start_time_in_millis":1505837598024,"running_time_in_nanos":30190750346,"cancellable":true,"parent_task_id":"ip0U-jljTBy-m2SBtaJb1g:2754"}}}
   > 	at __randomizedtesting.SeedInfo.seed([75824B2D7BD47309:BA7D75A63797A55B]:0)
   > 	at org.elasticsearch.index.reindex.CancelTests.testCancel(CancelTests.java:168)
   > 	at org.elasticsearch.index.reindex.CancelTests.testUpdateByQueryCancelWithWorkers(CancelTests.java:249)
   > 	at java.lang.Thread.run(Thread.java:748)
   > Caused by: java.util.concurrent.TimeoutException: Timeout waiting for task.
   > 	at org.elasticsearch.common.util.concurrent.BaseFuture$Sync.get(BaseFuture.java:232)
   > 	at org.elasticsearch.common.util.concurrent.BaseFuture.get(BaseFuture.java:67)
   > 	at org.elasticsearch.index.reindex.CancelTests.testCancel(CancelTests.java:164)
   > 	... 38 more
  2> REPRODUCE WITH: gradle :modules:reindex:test -Dtests.seed=75824B2D7BD47309 -Dtests.class=org.elasticsearch.index.reindex.CancelTests -Dtests.method="testUpdateByQueryCancelWithWorkers" -Dtests.security.manager=true -Dtests.locale=vi -Dtests.timezone=Africa/Addis_Ababa

build-141-CancelTests.txt

https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+5.5+multijob-unix-compatibility/os=fedora/147

ERROR   30.8s J0 | CancelTests.testDeleteByQueryCancelWithWorkers <<< FAILURES!
   > Throwable #1: java.lang.RuntimeException: Exception while waiting for the response. Running tasks: {"tasks":{"hyCyfCVLTeezK3EWp56Kkg:5514":{"node":"hyCyfCVLTeezK3EWp56Kkg","id":5514,"type":"transport","action":"indices:data/write/delete/byquery","status":{"slice_id":4,"total":22,"updated":0,"created":0,"deleted":10,"batches":10,"version_conflicts":0,"noops":0,"retries":{"bulk":0,"search":0},"throttled_millis":0,"requests_per_second":0.0,"canceled":"by user request","throttled_until_millis":9223372006847},"description":"delete-by-query [reindex-cancel-index]","start_time_in_millis":1506007215407,"running_time_in_nanos":30092761872,"cancellable":true,"parent_task_id":"hyCyfCVLTeezK3EWp56Kkg:5491"},"hyCyfCVLTeezK3EWp56Kkg:5492":{"node":"hyCyfCVLTeezK3EWp56Kkg","id":5492,"type":"transport","action":"indices:data/write/delete/byquery","status":{"slice_id":0,"total":20,"updated":0,"created":0,"deleted":13,"batches":13,"version_conflicts":0,"noops":0,"retries":{"bulk":0,"search":0},"throttled_millis":0,"requests_per_second":0.0,"canceled":"by user request","throttled_until_millis":9223372006847},"description":"delete-by-query [reindex-cancel-index]","start_time_in_millis":1506007215384,"running_time_in_nanos":30116660285,"cancellable":true,"parent_task_id":"hyCyfCVLTeezK3EWp56Kkg:5491"}}}
   > 	at __randomizedtesting.SeedInfo.seed([C3AA7D4DC8ACFF46:8CDCB30F1668EBA1]:0)
   > 	at org.elasticsearch.index.reindex.CancelTests.testCancel(CancelTests.java:168)
   > 	at org.elasticsearch.index.reindex.CancelTests.testDeleteByQueryCancelWithWorkers(CancelTests.java:259)
   > 	at java.lang.Thread.run(Thread.java:748)
   > Caused by: java.util.concurrent.TimeoutException: Timeout waiting for task.
   > 	at org.elasticsearch.common.util.concurrent.BaseFuture$Sync.get(BaseFuture.java:232)
   > 	at org.elasticsearch.common.util.concurrent.BaseFuture.get(BaseFuture.java:67)
   > 	at org.elasticsearch.index.reindex.CancelTests.testCancel(CancelTests.java:164)
   > 	... 38 more
  2> REPRODUCE WITH: gradle :modules:reindex:test -Dtests.seed=C3AA7D4DC8ACFF46 -Dtests.class=org.elasticsearch.index.reindex.CancelTests -Dtests.method="testDeleteByQueryCancelWithWorkers" -Dtests.security.manager=true -Dtests.locale=sl-SI -Dtests.timezone=Europe/Zurich

build-147-CancelTests.txt

@andyb-elastic andyb-elastic added :Reindex API >test Issues or PRs that are addressing/adding tests labels Sep 22, 2017
@andyb-elastic andyb-elastic self-assigned this Sep 22, 2017
@clintongormley clintongormley added the >test-failure Triaged test failures from CI label Jan 11, 2018
@nik9000
Copy link
Member

nik9000 commented Jan 23, 2018

I'm going to snag this and see if I can reproduce it.

@nik9000 nik9000 assigned nik9000 and unassigned andyb-elastic Jan 23, 2018
nik9000 added a commit that referenced this issue Jan 23, 2018
The test failure tracked by #26758 occurs when we cancel a running reindex
request that has been sliced into many children. The main reindex
response *looks* canceled but none of the children look canceled. This
is super strange because for the main request to look canceled for any
length of time one of the children has to be canceled.

This change adds additional logging to the test so we have more to go on
to debug this the next time it fails.
nik9000 added a commit that referenced this issue Jan 23, 2018
The test failure tracked by #26758 occurs when we cancel a running reindex
request that has been sliced into many children. The main reindex
response *looks* canceled but none of the children look canceled. This
is super strange because for the main request to look canceled for any
length of time one of the children has to be canceled.

This change adds additional logging to the test so we have more to go on
to debug this the next time it fails.
@nik9000
Copy link
Member

nik9000 commented Jan 23, 2018

Of course I can't reproduce it. But I've pushed some code that'll give us more information the next time it fails....

@nik9000
Copy link
Member

nik9000 commented Jan 30, 2018

I pushed 6f64e97 to see if that'll track down the cause. Then it failed locally for me while I was testing the backport, tripping the new assertion! Maybe I found it!

nik9000 added a commit that referenced this issue Jan 30, 2018
This gives the test longer to block its updates. Now that we're checking
if the updates actually blocked saw that they may not do so in the
normal 10 seconds on a highly loaded system. And our jenkins machines
often function like highly loaded systems. Maybe this fixes #26758!
@nik9000
Copy link
Member

nik9000 commented Jan 30, 2018

I'm going to let this one stay closed. It it occurs again I'll reopen.

@lcawl lcawl added :Distributed Indexing/CRUD A catch all label for issues around indexing, updating and getting a doc by id. Not search. and removed :Reindex API labels Feb 13, 2018
cbuescher pushed a commit that referenced this issue Feb 15, 2018
This gives the test longer to block its updates. Now that we're checking
if the updates actually blocked saw that they may not do so in the
normal 10 seconds on a highly loaded system. And our jenkins machines
often function like highly loaded systems. Maybe this fixes #26758!
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Indexing/CRUD A catch all label for issues around indexing, updating and getting a doc by id. Not search. >test Issues or PRs that are addressing/adding tests >test-failure Triaged test failures from CI
Projects
None yet
Development

No branches or pull requests

4 participants