-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CI] Deleting data frame analytics jobs can throw all shards failed #60462
Comments
Pinging @elastic/ml-core (:ml) |
Same thing in https://gradle-enterprise.elastic.co/s/ouvoly2gv5m4a
This happens in the cleanup code after the test completes, in |
Another occurrence: https://gradle-enterprise.elastic.co/s/ouvoly2gv5m4a |
The server side log for that last failure shows this:
Therefore it seems likely that the problem happens when deleting the job stats while the stats index isn't yet ready. This also ties up with |
I think the problem may be coming from the async nature of the Suppose a test does something that triggers the The cleanup in between tests will delete all the indices. In the most recent failure mentioned in this issue the failed test is Looking at the run up to the failure,
We have had situations before where the test harness and the server code fight each other to delete and recreate indices and/or index templates. I am not sure what the solution is in this case. @benwtrent is there a way we can cancel the |
@droberts195 would better integration with node lifecycle fix it? Or are these nodes the same between tests? |
It's a single node for the whole test suite. If you download https://storage.cloud.google.com/elasticsearch-ci-artifacts/jobs/elastic+elasticsearch+7.x+multijob-windows-compatibility/os=windows-2012-r2/build/132.tar.bz2, expand it and then look in It would be too slow if a separate JVM had to be spun up for every YAML test. |
@droberts195 what if delete analytics job didn't throw when there was a problem deleting from the stats index? I suppose at that point there is the worry of leftover stats documents :/ But we would have this issue in production anyways if a user ran into this.
Yaml tests require all interactions are done via a REST API. So, if we added something like this, it would have to be an API call. There is also no way to wait for the stats service to become idle (outside of a separate API call). Ingest pipeline deletion is is a cluster state removal action. Then other things are triggered off of that, so there is no way to put a |
Another failure https://gradle-enterprise.elastic.co/s/2p4o7efd3lmrk |
We discussed this in the team meeting and agreed to do this. Plus we'll also clean up orphaned stats documents in the nightly maintenance like we clean up orphaned state documents. |
Two tests failed with the same reason during deletion of a data frame analytics job:
p0=ml/start_data_frame_analytics/Test start with compatible fields but no data
Build scan: https://gradle-enterprise.elastic.co/s/3vyzs4yrmork2
Repro line:
Failure:
yaml=ml/data_frame_analytics_crud/Test update given all updatable settings
Build scan: https://gradle-enterprise.elastic.co/s/zaqfxhphcg72g
Repro line:
./gradlew ':x-pack:plugin:ml:qa:ml-with-security:integTestRunner' --tests "org.elasticsearch.smoketest.MlWithSecurityIT.test {yaml=ml/data_frame_analytics_crud/Test update given all updatable settings}" -Dtests.seed=C15EB5D34B41D230
Failure:
The text was updated successfully, but these errors were encountered: