-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Detection Engine] Fixing ML FTR tests #182183
Conversation
Flaky run with only ML FTR tests running (25x): https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/5819 Let's see if we can't reproduce the failure in isolation. |
These don't appear to fail in isolation, so let's see if another test is what's causing the failure.
None of the isolated tests failed, so I've added debugging code and I'm now running 50x tests not in isolation: https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/5821 |
We got a single failure in 50 executions, but there was no useful debug info. Trying again with a broader pattern since that might tell us more.
We got a failure in the above run, but the debugging info did not provide anything useful. I've tried broadening the debugging data we're collecting, and ran it again: https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/5824 |
My theory is that this dynamic template is being used in a rare situation where data is being inserted before the index mappings have been applied. If true, removing this will (at least) cause a different error to be produced in the same situation, if not fix the issue.
While there was a failure in the previous round, many of the runs were cancelled so I ran again: https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/5825. The results were mostly the same, although I did observe the tests fail when no anomaly data/mappings were present, which all but eliminates "dirty environment" as the potential cause here, which would leave only "race condition" as the explanation. I've triggered another run without the |
Interesting development: the previous run of 60x without the 200x build: https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/5832 |
This reverts commit fde0334. The error was still caused when this was absent, meaning it's not involved.
Something's happening within es_archiver, and I'm trying to figure out what.
I added some more verbose debugging in bdab2be; running another 60x now: https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/5848 |
* Debug concurrency setting (Might reduce this to 1 to eliminate that as an issue) * Debug order of file streams (to ensure mappings are being picked up first) * Debug index creation (to see how/whether the ML index is being created) * Debug index creation response (to see if there's some non-fatal error/warning).
The previous run had two failures. The debug info showed es_archiver receiving the documents as they're written in the archive (e.g. as Since we consistently see that the anomalies index has no mappings in these failures, I'm adding more debugging around the creation of the index and mappings, and triggering another run: https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/5860 |
The last run was unusably verbose, as I neglected to limit debugging to just the tests/calls I cared about. New run: https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/5861 |
No failures in the previous 60 runs; it's possible that the act of logging the actions before taking them gives enough time for the race condition to resolve consistently. Going to try another 60 to see. https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/5882 |
I was investigating a bug in ML suppression and in the course of doing so ended up adding the I tracked down the last change to this archive to #133510, which notably modified the data but not the mappings for this archive. Checking out and inspecting the data between before/after that commit, we can see what was changed: git diff --no-index --word-diff=porcelain --word-diff-regex=. <(head -n1 data_old.json) <(head -n1 data.json) diff --git a/dev/fd/63 b/dev/fd/62
--- a/dev/fd/63
+++ b/dev/fd/62
@@ -1 +1 @@
{"type":"doc","value":{"id":"
+v3_
linux_anomalous_network_activity_
-ecs_
record_1586274300000_900_0_-96106189301704594950079884115725560577_5","index":".ml-anomalies-custom-
+v3_
linux_anomalous_network_activity
-_ecs
","source":{"actual":[1],"bucket_span":900,"by_field_name":"process.name","by_field_value":"store","detector_index":0,"function":"rare","function_description":"rare","host.name":["mothra"],"influencers":[{"influencer_field_name":"user.name","influencer_field_values":["root"]},{"influencer_field_name":"process.name","influencer_field_values":["store"]},{"influencer_field_name":"host.name","influencer_field_values":["mothra"]}],"initial_record_score":33.36147565024334,"is_interim":false,"job_id":"
+v3_
linux_anomalous_network_activity
-_ecs
","multi_bucket_impact":0,"probability":0.007820139656036713,"process.name":["store"],"record_score":33.36147565024334,"result_type":"record","timestamp":1605567488000,"typical":[0.007820139656036711],"user.name":["root"]}}} This is just the first line, but you can see that the |
These had previously diverged, and I suspect are causing sporadic failures.
I updated the mappings to match data in 03f6073, and have triggered a new 100x build on that: https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/5933. Edit: and another 100x since the former was flaky: https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/5934 |
All of the failures in the previous build were test timeouts (3/100), which raises confidence that the error was due to the data/mapping mismatch. Merging the latest |
Flaky Test Runner Stats🟠 Some tests failed. - kibana-flaky-test-suite-runner#5975[❌] x-pack/test/security_solution_api_integration/test_suites/detections_response/detection_engine/rule_execution_logic/trial_license_complete_tier/configs/ess.config.ts: 73/100 tests passed. |
With the mappings changes, the error has now changed, but we're still occasionally getting test failures due to timeouts. I _suspect_ that the fact of debugging may be interrupting/blocking some other process, occasionally, so I'm removing them and running this again to see how it behaves.
Well, the mapping change has certainly changed the error, but now we're just getting random timeouts on our test runs. I'm going to see if removing the debugging output will address that, at all: https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/6021 |
Flaky Test Runner Stats🟠 Some tests failed. - kibana-flaky-test-suite-runner#6021[❌] x-pack/test/security_solution_api_integration/test_suites/detections_response/detection_engine/rule_execution_logic/trial_license_complete_tier/configs/ess.config.ts: 116/200 tests passed. |
Update: with ML Suppression now merged, most of the data-related fixes are in Those tests are now skipped in When we last checked in, the silent failures seemed to be due to no alerts being generated by the rule. Silent failure aside (which I'm also investigating), it's not yet clear whether the alerts aren't being generated due to:
|
New build to see where we're at: build |
Flaky Test Runner Stats🟠 Some tests failed. - kibana-flaky-test-suite-runner#6502[❌] x-pack/test/security_solution_api_integration/test_suites/detections_response/detection_engine/rule_execution_logic/trial_license_complete_tier/configs/ess.config.ts: 73/100 tests passed. |
The big idea with this commit is to log both the actual ML API results in the rule, along with a "pretty close" raw ES call. This should allow us to determine whether the ML API is misbehaving, somehow.
Since the tests seem to be hanging waiting for a successful rule run, let's see how it's failing.
Running the latest changes here. Lots more output, hopefully we can see if the ML API is behaving as we expect. If nothing else, we'll see how the rule is failing that's causing the timeout. |
Flaky Test Runner Stats🎉 All tests passed! - kibana-flaky-test-suite-runner#6509[✅] x-pack/test/security_solution_api_integration/test_suites/detections_response/detection_engine/rule_execution_logic/trial_license_complete_tier/configs/ess.config.ts: 100/100 tests passed. |
No failures in the last run of 100 |
Flaky Test Runner Stats🟠 Some tests failed. - kibana-flaky-test-suite-runner#6516[❌] x-pack/test/security_solution_api_integration/test_suites/detections_response/detection_engine/rule_execution_logic/trial_license_complete_tier/configs/ess.config.ts: 123/150 tests passed. |
Alright, I think I finally found the cause of these failures in the response to our "setup modules" request to ML. Attaching here for posterity: Setup Modules Failure Response{
"jobs": [
{ "id": "v3_linux_anomalous_network_port_activity", "success": true },
{
"id": "v3_linux_anomalous_network_activity",
"success": false,
"error": {
"error": {
"root_cause": [
{
"type": "no_shard_available_action_exception",
"reason": "[ftr][127.0.0.1:9300][indices:data/read/search[phase/query]]"
}
],
"type": "search_phase_execution_exception",
"reason": "all shards failed",
"phase": "query",
"grouped": true,
"failed_shards": [
{
"shard": 0,
"index": ".ml-anomalies-custom-v3_linux_network_configuration_discovery",
"node": "dKzpvp06ScO0OxqHilETEA",
"reason": {
"type": "no_shard_available_action_exception",
"reason": "[ftr][127.0.0.1:9300][indices:data/read/search[phase/query]]"
}
}
]
},
"status": 503
}
}
],
"datafeeds": [
{
"id": "datafeed-v3_linux_anomalous_network_port_activity",
"success": true,
"started": false,
"awaitingMlNodeAllocation": false
},
{
"id": "datafeed-v3_linux_anomalous_network_activity",
"success": false,
"started": false,
"awaitingMlNodeAllocation": false,
"error": {
"error": {
"root_cause": [
{
"type": "resource_not_found_exception",
"reason": "No known job with id 'v3_linux_anomalous_network_activity'"
}
],
"type": "resource_not_found_exception",
"reason": "No known job with id 'v3_linux_anomalous_network_activity'"
},
"status": 404
}
}
],
"kibana": {}
}
I'm still investigating what the error means, but we can see in the most recent build that all of the failures have that same errant response (while the green runs do not), so this looks very promising. Beyond the error itself, it appears that multiple jobs fail to be set up because of a single job index being unavailable, as can be observed in this run: Multiple Job Failures due to (reportedly) single job index{
"jobs": [
{ "id": "v3_linux_anomalous_network_port_activity", "success": true }, // NB: JOB WAS SUCCESSFUL
{ "id": "v3_linux_rare_metadata_process", "success": true },
{
"id": "v3_linux_rare_metadata_user",
"success": false,
"error": {
"error": {
"root_cause": [
{
"type": "no_shard_available_action_exception",
"reason": "[ftr][127.0.0.1:9300][indices:data/read/search[phase/query]]"
}
],
"type": "search_phase_execution_exception",
"reason": "all shards failed",
"phase": "query",
"grouped": true,
"failed_shards": [
{
"shard": 0,
"index": ".ml-anomalies-custom-v3_linux_anomalous_network_port_activity", // NB: FAILURE DUE TO OTHER INDEX
"node": "OiEtZdepT-ep8cToYLs-7w",
"reason": {
"type": "no_shard_available_action_exception",
"reason": "[ftr][127.0.0.1:9300][indices:data/read/search[phase/query]]"
}
}
]
},
"status": 503
}
},
{
"id": "v3_rare_process_by_host_linux",
"success": false,
"error": {
"error": {
"root_cause": [
{
"type": "no_shard_available_action_exception",
"reason": "[ftr][127.0.0.1:9300][indices:data/read/search[phase/query]]"
},
{ "type": "no_shard_available_action_exception", "reason": null }
],
"type": "search_phase_execution_exception",
"reason": "all shards failed",
"phase": "query",
"grouped": true,
"failed_shards": [
{
"shard": 0,
"index": ".ml-anomalies-custom-v3_linux_anomalous_network_port_activity",
"node": "OiEtZdepT-ep8cToYLs-7w",
"reason": {
"type": "no_shard_available_action_exception",
"reason": "[ftr][127.0.0.1:9300][indices:data/read/search[phase/query]]"
}
},
{
"shard": 0,
"index": ".ml-anomalies-custom-v3_linux_network_connection_discovery",
"node": null,
"reason": {
"type": "no_shard_available_action_exception",
"reason": null
}
}
]
},
"status": 503
}
},
{
"id": "v3_linux_anomalous_network_activity",
"success": false,
"error": {
"error": {
"root_cause": [
{
"type": "no_shard_available_action_exception",
"reason": "[ftr][127.0.0.1:9300][indices:data/read/search[phase/query]]"
},
{ "type": "no_shard_available_action_exception", "reason": null }
],
"type": "search_phase_execution_exception",
"reason": "all shards failed",
"phase": "query",
"grouped": true,
"failed_shards": [
{
"shard": 0,
"index": ".ml-anomalies-custom-v3_linux_anomalous_network_port_activity",
"node": "OiEtZdepT-ep8cToYLs-7w",
"reason": {
"type": "no_shard_available_action_exception",
"reason": "[ftr][127.0.0.1:9300][indices:data/read/search[phase/query]]"
}
},
{
"shard": 0,
"index": ".ml-anomalies-custom-v3_linux_network_connection_discovery",
"node": null,
"reason": {
"type": "no_shard_available_action_exception",
"reason": null
}
}
]
},
"status": 503
}
}
]
}
```</details> |
As a quick and simple fix to the error we occasionally encounter, it might be as simple as trying the call again! I'll run these changes in the flaky test runner and see if the sporadic issues will eventually resolve themselves.
@yctercero pointed out that the simple solution here might simply be to retry that setup call until all the jobs have been installed. d8334cb accomplishes that, and here is the accompanying 150x flaky run. 🤞 |
Flaky Test Runner Stats🎉 All tests passed! - kibana-flaky-test-suite-runner#6517[✅] x-pack/test/security_solution_api_integration/test_suites/detections_response/detection_engine/rule_execution_logic/trial_license_complete_tier/configs/ess.config.ts: 150/150 tests passed. |
The previous 150x run with the retry logic is green; that most likely means we have a solution! 🎉 However, since the failure rate was so low for this I'm running another 200x to see if something doesn't pop up: https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/6525 |
Flaky Test Runner Stats🎉 All tests passed! - kibana-flaky-test-suite-runner#6525[✅] x-pack/test/security_solution_api_integration/test_suites/detections_response/detection_engine/rule_execution_logic/trial_license_complete_tier/configs/ess.config.ts: 200/200 tests passed. |
The flakiness here ends up being caused by sporadic unavailability of shards during module setup. The underlying cause of that unavailability is likely a race condition between ML, ES, and/or FTR, but luckily we don't need to worry about that because simply retrying the API call causes it to eventually succeed. In those cases, some of the jobs will report a 4xx status, but that's expected. This is the result of a lot of prodding and CPU cycles on CI; see elastic#182183 for the full details.
Alright, I'm happy with the last 350 tests being green. I'm running another batch of flaky runs on #188155, but I'm going to close this for now. |
This call was found to be sporadically failing in elastic#182183. This applies the same changes made in elastic#188155, but for Cypress tests instead of FTR.
## Summary The full chronicle of this endeavor can be found [here](#182183), but [this comment](#182183 (comment)) summarizes the identified issue: > I [finally found](https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/6516#01909dde-a3e8-4e47-b255-b1ff7cac8f8d/6-2368) the cause of these failures in the response to our "setup modules" request to ML. Attaching here for posterity: > > <details> > <summary>Setup Modules Failure Response</summary> > > ```json > { > "jobs": [ > { "id": "v3_linux_anomalous_network_port_activity", "success": true }, > { > "id": "v3_linux_anomalous_network_activity", > "success": false, > "error": { > "error": { > "root_cause": [ > { > "type": "no_shard_available_action_exception", > "reason": "[ftr][127.0.0.1:9300][indices:data/read/search[phase/query]]" > } > ], > "type": "search_phase_execution_exception", > "reason": "all shards failed", > "phase": "query", > "grouped": true, > "failed_shards": [ > { > "shard": 0, > "index": ".ml-anomalies-custom-v3_linux_network_configuration_discovery", > "node": "dKzpvp06ScO0OxqHilETEA", > "reason": { > "type": "no_shard_available_action_exception", > "reason": "[ftr][127.0.0.1:9300][indices:data/read/search[phase/query]]" > } > } > ] > }, > "status": 503 > } > } > ], > "datafeeds": [ > { > "id": "datafeed-v3_linux_anomalous_network_port_activity", > "success": true, > "started": false, > "awaitingMlNodeAllocation": false > }, > { > "id": "datafeed-v3_linux_anomalous_network_activity", > "success": false, > "started": false, > "awaitingMlNodeAllocation": false, > "error": { > "error": { > "root_cause": [ > { > "type": "resource_not_found_exception", > "reason": "No known job with id 'v3_linux_anomalous_network_activity'" > } > ], > "type": "resource_not_found_exception", > "reason": "No known job with id 'v3_linux_anomalous_network_activity'" > }, > "status": 404 > } > } > ], > "kibana": {} > } > > ``` > </details> This branch, then, fixes said issue by (relatively simply) retrying the failed API call until it succeeds. ### Related Issues Addresses: - #171426 - #187478 - #187614 - #182009 - #171426 ### Checklist - [x] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios - [x] [Flaky Test Runner](https://ci-stats.kibana.dev/trigger_flaky_test_runner/1) was used on any tests changed - [x] [ESS Rule Execution FTR x 200](https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/6528) - [x] [Serverless Rule Execution FTR x 200](https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/6529) ### For maintainers - [x] This was checked for breaking API changes and was [labeled appropriately](https://www.elastic.co/guide/en/kibana/master/contributing.html#kibana-release-notes-process)
## Summary The full chronicle of this endeavor can be found [here](elastic#182183), but [this comment](elastic#182183 (comment)) summarizes the identified issue: > I [finally found](https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/6516#01909dde-a3e8-4e47-b255-b1ff7cac8f8d/6-2368) the cause of these failures in the response to our "setup modules" request to ML. Attaching here for posterity: > > <details> > <summary>Setup Modules Failure Response</summary> > > ```json > { > "jobs": [ > { "id": "v3_linux_anomalous_network_port_activity", "success": true }, > { > "id": "v3_linux_anomalous_network_activity", > "success": false, > "error": { > "error": { > "root_cause": [ > { > "type": "no_shard_available_action_exception", > "reason": "[ftr][127.0.0.1:9300][indices:data/read/search[phase/query]]" > } > ], > "type": "search_phase_execution_exception", > "reason": "all shards failed", > "phase": "query", > "grouped": true, > "failed_shards": [ > { > "shard": 0, > "index": ".ml-anomalies-custom-v3_linux_network_configuration_discovery", > "node": "dKzpvp06ScO0OxqHilETEA", > "reason": { > "type": "no_shard_available_action_exception", > "reason": "[ftr][127.0.0.1:9300][indices:data/read/search[phase/query]]" > } > } > ] > }, > "status": 503 > } > } > ], > "datafeeds": [ > { > "id": "datafeed-v3_linux_anomalous_network_port_activity", > "success": true, > "started": false, > "awaitingMlNodeAllocation": false > }, > { > "id": "datafeed-v3_linux_anomalous_network_activity", > "success": false, > "started": false, > "awaitingMlNodeAllocation": false, > "error": { > "error": { > "root_cause": [ > { > "type": "resource_not_found_exception", > "reason": "No known job with id 'v3_linux_anomalous_network_activity'" > } > ], > "type": "resource_not_found_exception", > "reason": "No known job with id 'v3_linux_anomalous_network_activity'" > }, > "status": 404 > } > } > ], > "kibana": {} > } > > ``` > </details> This branch, then, fixes said issue by (relatively simply) retrying the failed API call until it succeeds. ### Related Issues Addresses: - elastic#171426 - elastic#187478 - elastic#187614 - elastic#182009 - elastic#171426 ### Checklist - [x] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios - [x] [Flaky Test Runner](https://ci-stats.kibana.dev/trigger_flaky_test_runner/1) was used on any tests changed - [x] [ESS Rule Execution FTR x 200](https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/6528) - [x] [Serverless Rule Execution FTR x 200](https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/6529) ### For maintainers - [x] This was checked for breaking API changes and was [labeled appropriately](https://www.elastic.co/guide/en/kibana/master/contributing.html#kibana-release-notes-process) (cherry picked from commit 3df635e)
… (#188259) # Backport This will backport the following commits from `main` to `8.15`: - [[Detection Engine] Addresses Flakiness in ML FTR tests (#188155)](#188155) <!--- Backport version: 8.9.8 --> ### Questions ? Please refer to the [Backport tool documentation](https://github.com/sqren/backport) <!--BACKPORT [{"author":{"name":"Ryland Herrick","email":"ryalnd@gmail.com"},"sourceCommit":{"committedDate":"2024-07-12T19:10:25Z","message":"[Detection Engine] Addresses Flakiness in ML FTR tests (#188155)\n\n## Summary\r\n\r\nThe full chronicle of this endeavor can be found\r\n[here](#182183), but [this\r\ncomment](https://github.com/elastic/kibana/pull/182183#issuecomment-2221517519)\r\nsummarizes the identified issue:\r\n\r\n> I [finally\r\nfound](https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/6516#01909dde-a3e8-4e47-b255-b1ff7cac8f8d/6-2368)\r\nthe cause of these failures in the response to our \"setup modules\"\r\nrequest to ML. Attaching here for posterity:\r\n>\r\n> <details>\r\n> <summary>Setup Modules Failure Response</summary>\r\n> \r\n> ```json\r\n> {\r\n> \"jobs\": [\r\n> { \"id\": \"v3_linux_anomalous_network_port_activity\", \"success\": true },\r\n> {\r\n> \"id\": \"v3_linux_anomalous_network_activity\",\r\n> \"success\": false,\r\n> \"error\": {\r\n> \"error\": {\r\n> \"root_cause\": [\r\n> {\r\n> \"type\": \"no_shard_available_action_exception\",\r\n> \"reason\":\r\n\"[ftr][127.0.0.1:9300][indices:data/read/search[phase/query]]\"\r\n> }\r\n> ],\r\n> \"type\": \"search_phase_execution_exception\",\r\n> \"reason\": \"all shards failed\",\r\n> \"phase\": \"query\",\r\n> \"grouped\": true,\r\n> \"failed_shards\": [\r\n> {\r\n> \"shard\": 0,\r\n> \"index\":\r\n\".ml-anomalies-custom-v3_linux_network_configuration_discovery\",\r\n> \"node\": \"dKzpvp06ScO0OxqHilETEA\",\r\n> \"reason\": {\r\n> \"type\": \"no_shard_available_action_exception\",\r\n> \"reason\":\r\n\"[ftr][127.0.0.1:9300][indices:data/read/search[phase/query]]\"\r\n> }\r\n> }\r\n> ]\r\n> },\r\n> \"status\": 503\r\n> }\r\n> }\r\n> ],\r\n> \"datafeeds\": [\r\n> {\r\n> \"id\": \"datafeed-v3_linux_anomalous_network_port_activity\",\r\n> \"success\": true,\r\n> \"started\": false,\r\n> \"awaitingMlNodeAllocation\": false\r\n> },\r\n> {\r\n> \"id\": \"datafeed-v3_linux_anomalous_network_activity\",\r\n> \"success\": false,\r\n> \"started\": false,\r\n> \"awaitingMlNodeAllocation\": false,\r\n> \"error\": {\r\n> \"error\": {\r\n> \"root_cause\": [\r\n> {\r\n> \"type\": \"resource_not_found_exception\",\r\n> \"reason\": \"No known job with id 'v3_linux_anomalous_network_activity'\"\r\n> }\r\n> ],\r\n> \"type\": \"resource_not_found_exception\",\r\n> \"reason\": \"No known job with id 'v3_linux_anomalous_network_activity'\"\r\n> },\r\n> \"status\": 404\r\n> }\r\n> }\r\n> ],\r\n> \"kibana\": {}\r\n> }\r\n> \r\n> ```\r\n> </details>\r\n\r\nThis branch, then, fixes said issue by (relatively simply) retrying the\r\nfailed API call until it succeeds.\r\n\r\n### Related Issues\r\nAddresses:\r\n- https://github.com/elastic/kibana/issues/171426\r\n- https://github.com/elastic/kibana/issues/187478\r\n- https://github.com/elastic/kibana/issues/187614\r\n- https://github.com/elastic/kibana/issues/182009\r\n- https://github.com/elastic/kibana/issues/171426\r\n\r\n### Checklist\r\n\r\n- [x] [Unit or functional\r\ntests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)\r\nwere updated or added to match the most common scenarios\r\n- [x] [Flaky Test\r\nRunner](https://ci-stats.kibana.dev/trigger_flaky_test_runner/1) was\r\nused on any tests changed\r\n- [x] [ESS Rule Execution FTR x\r\n200](https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/6528)\r\n- [x] [Serverless Rule Execution FTR x\r\n200](https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/6529)\r\n\r\n\r\n### For maintainers\r\n\r\n- [x] This was checked for breaking API changes and was [labeled\r\nappropriately](https://www.elastic.co/guide/en/kibana/master/contributing.html#kibana-release-notes-process)","sha":"3df635ef4a8c86c41c91ac5f59198a9b67d1dc8b","branchLabelMapping":{"^v8.16.0$":"main","^v(\\d+).(\\d+).\\d+$":"$1.$2"}},"sourcePullRequest":{"labels":["release_note:skip","backport:skip","Feature:Detection Rules","Feature:ML Rule","Feature:Security ML Jobs","Feature:Rule Creation","Team:Detection Engine","Feature:Rule Edit","v8.16.0"],"number":188155,"url":"https://github.com/elastic/kibana/pull/188155","mergeCommit":{"message":"[Detection Engine] Addresses Flakiness in ML FTR tests (#188155)\n\n## Summary\r\n\r\nThe full chronicle of this endeavor can be found\r\n[here](#182183), but [this\r\ncomment](https://github.com/elastic/kibana/pull/182183#issuecomment-2221517519)\r\nsummarizes the identified issue:\r\n\r\n> I [finally\r\nfound](https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/6516#01909dde-a3e8-4e47-b255-b1ff7cac8f8d/6-2368)\r\nthe cause of these failures in the response to our \"setup modules\"\r\nrequest to ML. Attaching here for posterity:\r\n>\r\n> <details>\r\n> <summary>Setup Modules Failure Response</summary>\r\n> \r\n> ```json\r\n> {\r\n> \"jobs\": [\r\n> { \"id\": \"v3_linux_anomalous_network_port_activity\", \"success\": true },\r\n> {\r\n> \"id\": \"v3_linux_anomalous_network_activity\",\r\n> \"success\": false,\r\n> \"error\": {\r\n> \"error\": {\r\n> \"root_cause\": [\r\n> {\r\n> \"type\": \"no_shard_available_action_exception\",\r\n> \"reason\":\r\n\"[ftr][127.0.0.1:9300][indices:data/read/search[phase/query]]\"\r\n> }\r\n> ],\r\n> \"type\": \"search_phase_execution_exception\",\r\n> \"reason\": \"all shards failed\",\r\n> \"phase\": \"query\",\r\n> \"grouped\": true,\r\n> \"failed_shards\": [\r\n> {\r\n> \"shard\": 0,\r\n> \"index\":\r\n\".ml-anomalies-custom-v3_linux_network_configuration_discovery\",\r\n> \"node\": \"dKzpvp06ScO0OxqHilETEA\",\r\n> \"reason\": {\r\n> \"type\": \"no_shard_available_action_exception\",\r\n> \"reason\":\r\n\"[ftr][127.0.0.1:9300][indices:data/read/search[phase/query]]\"\r\n> }\r\n> }\r\n> ]\r\n> },\r\n> \"status\": 503\r\n> }\r\n> }\r\n> ],\r\n> \"datafeeds\": [\r\n> {\r\n> \"id\": \"datafeed-v3_linux_anomalous_network_port_activity\",\r\n> \"success\": true,\r\n> \"started\": false,\r\n> \"awaitingMlNodeAllocation\": false\r\n> },\r\n> {\r\n> \"id\": \"datafeed-v3_linux_anomalous_network_activity\",\r\n> \"success\": false,\r\n> \"started\": false,\r\n> \"awaitingMlNodeAllocation\": false,\r\n> \"error\": {\r\n> \"error\": {\r\n> \"root_cause\": [\r\n> {\r\n> \"type\": \"resource_not_found_exception\",\r\n> \"reason\": \"No known job with id 'v3_linux_anomalous_network_activity'\"\r\n> }\r\n> ],\r\n> \"type\": \"resource_not_found_exception\",\r\n> \"reason\": \"No known job with id 'v3_linux_anomalous_network_activity'\"\r\n> },\r\n> \"status\": 404\r\n> }\r\n> }\r\n> ],\r\n> \"kibana\": {}\r\n> }\r\n> \r\n> ```\r\n> </details>\r\n\r\nThis branch, then, fixes said issue by (relatively simply) retrying the\r\nfailed API call until it succeeds.\r\n\r\n### Related Issues\r\nAddresses:\r\n- https://github.com/elastic/kibana/issues/171426\r\n- https://github.com/elastic/kibana/issues/187478\r\n- https://github.com/elastic/kibana/issues/187614\r\n- https://github.com/elastic/kibana/issues/182009\r\n- https://github.com/elastic/kibana/issues/171426\r\n\r\n### Checklist\r\n\r\n- [x] [Unit or functional\r\ntests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)\r\nwere updated or added to match the most common scenarios\r\n- [x] [Flaky Test\r\nRunner](https://ci-stats.kibana.dev/trigger_flaky_test_runner/1) was\r\nused on any tests changed\r\n- [x] [ESS Rule Execution FTR x\r\n200](https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/6528)\r\n- [x] [Serverless Rule Execution FTR x\r\n200](https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/6529)\r\n\r\n\r\n### For maintainers\r\n\r\n- [x] This was checked for breaking API changes and was [labeled\r\nappropriately](https://www.elastic.co/guide/en/kibana/master/contributing.html#kibana-release-notes-process)","sha":"3df635ef4a8c86c41c91ac5f59198a9b67d1dc8b"}},"sourceBranch":"main","suggestedTargetBranches":[],"targetPullRequestStates":[{"branch":"main","label":"v8.16.0","labelRegex":"^v8.16.0$","isSourceBranch":true,"state":"MERGED","url":"https://github.com/elastic/kibana/pull/188155","number":188155,"mergeCommit":{"message":"[Detection Engine] Addresses Flakiness in ML FTR tests (#188155)\n\n## Summary\r\n\r\nThe full chronicle of this endeavor can be found\r\n[here](#182183), but [this\r\ncomment](https://github.com/elastic/kibana/pull/182183#issuecomment-2221517519)\r\nsummarizes the identified issue:\r\n\r\n> I [finally\r\nfound](https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/6516#01909dde-a3e8-4e47-b255-b1ff7cac8f8d/6-2368)\r\nthe cause of these failures in the response to our \"setup modules\"\r\nrequest to ML. Attaching here for posterity:\r\n>\r\n> <details>\r\n> <summary>Setup Modules Failure Response</summary>\r\n> \r\n> ```json\r\n> {\r\n> \"jobs\": [\r\n> { \"id\": \"v3_linux_anomalous_network_port_activity\", \"success\": true },\r\n> {\r\n> \"id\": \"v3_linux_anomalous_network_activity\",\r\n> \"success\": false,\r\n> \"error\": {\r\n> \"error\": {\r\n> \"root_cause\": [\r\n> {\r\n> \"type\": \"no_shard_available_action_exception\",\r\n> \"reason\":\r\n\"[ftr][127.0.0.1:9300][indices:data/read/search[phase/query]]\"\r\n> }\r\n> ],\r\n> \"type\": \"search_phase_execution_exception\",\r\n> \"reason\": \"all shards failed\",\r\n> \"phase\": \"query\",\r\n> \"grouped\": true,\r\n> \"failed_shards\": [\r\n> {\r\n> \"shard\": 0,\r\n> \"index\":\r\n\".ml-anomalies-custom-v3_linux_network_configuration_discovery\",\r\n> \"node\": \"dKzpvp06ScO0OxqHilETEA\",\r\n> \"reason\": {\r\n> \"type\": \"no_shard_available_action_exception\",\r\n> \"reason\":\r\n\"[ftr][127.0.0.1:9300][indices:data/read/search[phase/query]]\"\r\n> }\r\n> }\r\n> ]\r\n> },\r\n> \"status\": 503\r\n> }\r\n> }\r\n> ],\r\n> \"datafeeds\": [\r\n> {\r\n> \"id\": \"datafeed-v3_linux_anomalous_network_port_activity\",\r\n> \"success\": true,\r\n> \"started\": false,\r\n> \"awaitingMlNodeAllocation\": false\r\n> },\r\n> {\r\n> \"id\": \"datafeed-v3_linux_anomalous_network_activity\",\r\n> \"success\": false,\r\n> \"started\": false,\r\n> \"awaitingMlNodeAllocation\": false,\r\n> \"error\": {\r\n> \"error\": {\r\n> \"root_cause\": [\r\n> {\r\n> \"type\": \"resource_not_found_exception\",\r\n> \"reason\": \"No known job with id 'v3_linux_anomalous_network_activity'\"\r\n> }\r\n> ],\r\n> \"type\": \"resource_not_found_exception\",\r\n> \"reason\": \"No known job with id 'v3_linux_anomalous_network_activity'\"\r\n> },\r\n> \"status\": 404\r\n> }\r\n> }\r\n> ],\r\n> \"kibana\": {}\r\n> }\r\n> \r\n> ```\r\n> </details>\r\n\r\nThis branch, then, fixes said issue by (relatively simply) retrying the\r\nfailed API call until it succeeds.\r\n\r\n### Related Issues\r\nAddresses:\r\n- https://github.com/elastic/kibana/issues/171426\r\n- https://github.com/elastic/kibana/issues/187478\r\n- https://github.com/elastic/kibana/issues/187614\r\n- https://github.com/elastic/kibana/issues/182009\r\n- https://github.com/elastic/kibana/issues/171426\r\n\r\n### Checklist\r\n\r\n- [x] [Unit or functional\r\ntests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)\r\nwere updated or added to match the most common scenarios\r\n- [x] [Flaky Test\r\nRunner](https://ci-stats.kibana.dev/trigger_flaky_test_runner/1) was\r\nused on any tests changed\r\n- [x] [ESS Rule Execution FTR x\r\n200](https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/6528)\r\n- [x] [Serverless Rule Execution FTR x\r\n200](https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/6529)\r\n\r\n\r\n### For maintainers\r\n\r\n- [x] This was checked for breaking API changes and was [labeled\r\nappropriately](https://www.elastic.co/guide/en/kibana/master/contributing.html#kibana-release-notes-process)","sha":"3df635ef4a8c86c41c91ac5f59198a9b67d1dc8b"}}]}] BACKPORT-->
This API call was found to be sporadically failing in #182183. This applies the same changes made in #188155, but for Cypress tests instead of FTR. Since none of the cypress tests are currently skipped, this PR just serves to add robustness to the suite, which performs nearly identical setup to that of the FTR tests. I think the biggest difference is how often these tests are run vs FTRs. Combined with the low failure rate for the underlying issue, cypress's auto-retrying may smooth over many of these failures when they occur. ### Checklist - [x] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios - [ ] [Flaky Test Runner](https://ci-stats.kibana.dev/trigger_flaky_test_runner/1) was used on any tests changed - [ ] [Detection Engine Cypress - ESS x 200](https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/6530) - [ ] [Detection Engine Cypress - Serverless x 200](https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/6531)
This API call was found to be sporadically failing in elastic#182183. This applies the same changes made in elastic#188155, but for Cypress tests instead of FTR. Since none of the cypress tests are currently skipped, this PR just serves to add robustness to the suite, which performs nearly identical setup to that of the FTR tests. I think the biggest difference is how often these tests are run vs FTRs. Combined with the low failure rate for the underlying issue, cypress's auto-retrying may smooth over many of these failures when they occur. ### Checklist - [x] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios - [ ] [Flaky Test Runner](https://ci-stats.kibana.dev/trigger_flaky_test_runner/1) was used on any tests changed - [ ] [Detection Engine Cypress - ESS x 200](https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/6530) - [ ] [Detection Engine Cypress - Serverless x 200](https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/6531) (cherry picked from commit ed934e3)
#188483) # Backport This will backport the following commits from `main` to `8.15`: - [[Detection Engine] Fix flake in ML Rule Cypress tests (#188164)](#188164) <!--- Backport version: 9.4.3 --> ### Questions ? Please refer to the [Backport tool documentation](https://github.com/sqren/backport) <!--BACKPORT [{"author":{"name":"Ryland Herrick","email":"ryalnd@gmail.com"},"sourceCommit":{"committedDate":"2024-07-16T19:21:13Z","message":"[Detection Engine] Fix flake in ML Rule Cypress tests (#188164)\n\nThis API call was found to be sporadically failing in #182183. This\r\napplies the same changes made in #188155, but for Cypress tests instead\r\nof FTR.\r\n\r\nSince none of the cypress tests are currently skipped, this PR just\r\nserves to add robustness to the suite, which performs nearly identical\r\nsetup to that of the FTR tests. I think the biggest difference is how\r\noften these tests are run vs FTRs. Combined with the low failure rate\r\nfor the underlying issue, cypress's auto-retrying may smooth over many\r\nof these failures when they occur.\r\n\r\n\r\n### Checklist\r\n\r\n- [x] [Unit or functional\r\ntests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)\r\nwere updated or added to match the most common scenarios\r\n- [ ] [Flaky Test\r\nRunner](https://ci-stats.kibana.dev/trigger_flaky_test_runner/1) was\r\nused on any tests changed\r\n- [ ] [Detection Engine Cypress - ESS x\r\n200](https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/6530)\r\n- [ ] [Detection Engine Cypress - Serverless x\r\n200](https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/6531)","sha":"ed934e3253b47a6902904633530ec181037d4946","branchLabelMapping":{"^v8.16.0$":"main","^v(\\d+).(\\d+).\\d+$":"$1.$2"}},"sourcePullRequest":{"labels":["release_note:skip","Feature:Detection Rules","Feature:ML Rule","Feature:Security ML Jobs","Feature:Rule Creation","backport:prev-minor","Team:Detection Engine","Feature:Rule Edit","v8.16.0"],"title":"[Detection Engine] Fix flake in ML Rule Cypress tests","number":188164,"url":"https://github.com/elastic/kibana/pull/188164","mergeCommit":{"message":"[Detection Engine] Fix flake in ML Rule Cypress tests (#188164)\n\nThis API call was found to be sporadically failing in #182183. This\r\napplies the same changes made in #188155, but for Cypress tests instead\r\nof FTR.\r\n\r\nSince none of the cypress tests are currently skipped, this PR just\r\nserves to add robustness to the suite, which performs nearly identical\r\nsetup to that of the FTR tests. I think the biggest difference is how\r\noften these tests are run vs FTRs. Combined with the low failure rate\r\nfor the underlying issue, cypress's auto-retrying may smooth over many\r\nof these failures when they occur.\r\n\r\n\r\n### Checklist\r\n\r\n- [x] [Unit or functional\r\ntests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)\r\nwere updated or added to match the most common scenarios\r\n- [ ] [Flaky Test\r\nRunner](https://ci-stats.kibana.dev/trigger_flaky_test_runner/1) was\r\nused on any tests changed\r\n- [ ] [Detection Engine Cypress - ESS x\r\n200](https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/6530)\r\n- [ ] [Detection Engine Cypress - Serverless x\r\n200](https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/6531)","sha":"ed934e3253b47a6902904633530ec181037d4946"}},"sourceBranch":"main","suggestedTargetBranches":[],"targetPullRequestStates":[{"branch":"main","label":"v8.16.0","branchLabelMappingKey":"^v8.16.0$","isSourceBranch":true,"state":"MERGED","url":"https://github.com/elastic/kibana/pull/188164","number":188164,"mergeCommit":{"message":"[Detection Engine] Fix flake in ML Rule Cypress tests (#188164)\n\nThis API call was found to be sporadically failing in #182183. This\r\napplies the same changes made in #188155, but for Cypress tests instead\r\nof FTR.\r\n\r\nSince none of the cypress tests are currently skipped, this PR just\r\nserves to add robustness to the suite, which performs nearly identical\r\nsetup to that of the FTR tests. I think the biggest difference is how\r\noften these tests are run vs FTRs. Combined with the low failure rate\r\nfor the underlying issue, cypress's auto-retrying may smooth over many\r\nof these failures when they occur.\r\n\r\n\r\n### Checklist\r\n\r\n- [x] [Unit or functional\r\ntests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)\r\nwere updated or added to match the most common scenarios\r\n- [ ] [Flaky Test\r\nRunner](https://ci-stats.kibana.dev/trigger_flaky_test_runner/1) was\r\nused on any tests changed\r\n- [ ] [Detection Engine Cypress - ESS x\r\n200](https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/6530)\r\n- [ ] [Detection Engine Cypress - Serverless x\r\n200](https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/6531)","sha":"ed934e3253b47a6902904633530ec181037d4946"}}]}] BACKPORT--> Co-authored-by: Ryland Herrick <ryalnd@gmail.com>
Summary
🚧 🚧
Checklist
Delete any items that are not applicable to this PR.
Risk Matrix
Delete this section if it is not applicable to this PR.
Before closing this PR, invite QA, stakeholders, and other developers to identify risks that should be tested prior to the change/feature release.
When forming the risk matrix, consider some of the following examples and how they may potentially impact the change:
For maintainers