Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Fleet] cancel tasks when 3rd retry failed #147190

Merged
merged 3 commits into from
Dec 8, 2022

Conversation

juliaElastic
Copy link
Contributor

@juliaElastic juliaElastic commented Dec 7, 2022

Summary

Related to #144161

Found that on a bulk update tags task failure, the task didn't stop after 3 retries (should be over in less then a minute), the retries kept happening for 2 hours.
This change removes the retry task if 3 retries are reached.

Also testing in cloud deployment to see if the tags error can be reproduced with this fix.
I could reproduce the reported error locally, and seeing it goes away with this fix.

To verify:

  • Add at least 50k agents with the create_agents script in kibana repo
  • open Kibana, select the 50k agents, and open Actions / Add tags
  • Try this in a few seconds: add 2 new tags, and remove one of them
  • Wait about 30s, the agents should reflect the changes
  • Check the logs to see that the tasks are removed after 3rd retry is reached or successful.
  • Check that there are no more running tasks. Any running task can be found in Kibana Console by running this query: GET .kibana_task_manager/_search?q=task.taskType:"fleet:update_agent_tags:retry"

Locally simulated an error to test that the retry (and check) task is removed:

[2022-12-07T15:52:16.415+01:00][ERROR][plugins.fleet] Retry #3 of task fleet:update_agent_tags:retry:848984ab-c11d-4ebe-8d1f-606143dd656b failed: failing task
[2022-12-07T15:52:16.416+01:00][WARN ][plugins.fleet] Stopping after 3rd retry. Error: failing task
[2022-12-07T15:52:16.416+01:00][INFO ][plugins.fleet] Removing task fleet:update_agent_tags:retry:check:848984ab-c11d-4ebe-8d1f-606143dd656b
[2022-12-07T15:52:16.416+01:00][INFO ][plugins.fleet] Removing task fleet:update_agent_tags:retry:848984ab-c11d-4ebe-8d1f-606143dd656b

@juliaElastic juliaElastic added release_note:skip Skip the PR/issue when compiling release notes ci:cloud-deploy Create or update a Cloud deployment v8.7.0 v8.6.1 labels Dec 7, 2022
@juliaElastic juliaElastic self-assigned this Dec 7, 2022
@juliaElastic juliaElastic marked this pull request as ready for review December 7, 2022 15:45
@juliaElastic juliaElastic requested a review from a team as a code owner December 7, 2022 15:45
@botelastic botelastic bot added the Team:Fleet Team label for Observability Data Collection Fleet team label Dec 7, 2022
@elasticmachine
Copy link
Contributor

Pinging @elastic/fleet (Team:Fleet)

@kpollich kpollich changed the title cancel tasks when 3rd retry failed [Fleet] cancel tasks when 3rd retry failed Dec 7, 2022
@juliaElastic juliaElastic removed the ci:cloud-deploy Create or update a Cloud deployment label Dec 7, 2022
@kibana-ci
Copy link
Collaborator

💚 Build Succeeded

Metrics [docs]

Unknown metric groups

ESLint disabled in files

id before after diff
osquery 1 2 +1

ESLint disabled line counts

id before after diff
enterpriseSearch 19 21 +2
fleet 59 65 +6
osquery 109 115 +6
securitySolution 445 451 +6
total +20

Total ESLint disabled count

id before after diff
enterpriseSearch 20 22 +2
fleet 68 74 +6
osquery 110 117 +7
securitySolution 521 527 +6
total +21

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

cc @juliaElastic

@juliaElastic juliaElastic merged commit 431c32b into elastic:main Dec 8, 2022
kibanamachine pushed a commit to kibanamachine/kibana that referenced this pull request Dec 8, 2022
## Summary

Related to elastic#144161

Found that on a bulk update tags task failure, the task didn't stop
after 3 retries (should be over in less then a minute), the retries kept
happening for 2 hours.
This change removes the retry task if 3 retries are reached.

Also testing in cloud deployment to see if the tags error can be
reproduced with this fix.
I could reproduce the reported error locally, and seeing it goes away
with this fix.

To verify:
- Add at least 50k agents with the `create_agents` script in kibana repo
- open Kibana, select the 50k agents, and open Actions / Add tags
- Try this in a few seconds: add 2 new tags, and remove one of them
- Wait about 30s, the agents should reflect the changes
- Check the logs to see that the tasks are removed after 3rd retry is
reached or successful.
- Check that there are no more running tasks. Any running task can be
found in Kibana Console by running this query: `GET
.kibana_task_manager/_search?q=task.taskType:"fleet:update_agent_tags:retry"`

Locally simulated an error to test that the retry (and check) task is
removed:

```
[2022-12-07T15:52:16.415+01:00][ERROR][plugins.fleet] Retry elastic#3 of task fleet:update_agent_tags:retry:848984ab-c11d-4ebe-8d1f-606143dd656b failed: failing task
[2022-12-07T15:52:16.416+01:00][WARN ][plugins.fleet] Stopping after 3rd retry. Error: failing task
[2022-12-07T15:52:16.416+01:00][INFO ][plugins.fleet] Removing task fleet:update_agent_tags:retry:check:848984ab-c11d-4ebe-8d1f-606143dd656b
[2022-12-07T15:52:16.416+01:00][INFO ][plugins.fleet] Removing task fleet:update_agent_tags:retry:848984ab-c11d-4ebe-8d1f-606143dd656b
```

(cherry picked from commit 431c32b)
@kibanamachine
Copy link
Contributor

💚 All backports created successfully

Status Branch Result
8.6

Note: Successful backport PRs will be merged automatically after passing CI.

Questions ?

Please refer to the Backport tool documentation

kibanamachine added a commit that referenced this pull request Dec 8, 2022
# Backport

This will backport the following commits from `main` to `8.6`:
- [[Fleet] cancel tasks when 3rd retry failed
(#147190)](#147190)

<!--- Backport version: 8.9.7 -->

### Questions ?
Please refer to the [Backport tool
documentation](https://github.com/sqren/backport)

<!--BACKPORT [{"author":{"name":"Julia
Bardi","email":"90178898+juliaElastic@users.noreply.github.com"},"sourceCommit":{"committedDate":"2022-12-08T08:14:33Z","message":"[Fleet]
cancel tasks when 3rd retry failed (#147190)\n\n##
Summary\r\n\r\nRelated to
https://github.com/elastic/kibana/issues/144161\r\n\r\nFound that on a
bulk update tags task failure, the task didn't stop\r\nafter 3 retries
(should be over in less then a minute), the retries kept\r\nhappening
for 2 hours.\r\nThis change removes the retry task if 3 retries are
reached.\r\n\r\nAlso testing in cloud deployment to see if the tags
error can be\r\nreproduced with this fix.\r\nI could reproduce the
reported error locally, and seeing it goes away\r\nwith this
fix.\r\n\r\nTo verify:\r\n- Add at least 50k agents with the
`create_agents` script in kibana repo\r\n- open Kibana, select the 50k
agents, and open Actions / Add tags\r\n- Try this in a few seconds: add
2 new tags, and remove one of them\r\n- Wait about 30s, the agents
should reflect the changes\r\n- Check the logs to see that the tasks are
removed after 3rd retry is\r\nreached or successful.\r\n- Check that
there are no more running tasks. Any running task can be\r\nfound in
Kibana Console by running this query:
`GET\r\n.kibana_task_manager/_search?q=task.taskType:\"fleet:update_agent_tags:retry\"`\r\n\r\nLocally
simulated an error to test that the retry (and check) task
is\r\nremoved:\r\n\r\n```\r\n[2022-12-07T15:52:16.415+01:00][ERROR][plugins.fleet]
Retry #3 of task
fleet:update_agent_tags:retry:848984ab-c11d-4ebe-8d1f-606143dd656b
failed: failing task\r\n[2022-12-07T15:52:16.416+01:00][WARN
][plugins.fleet] Stopping after 3rd retry. Error: failing
task\r\n[2022-12-07T15:52:16.416+01:00][INFO ][plugins.fleet] Removing
task
fleet:update_agent_tags:retry:check:848984ab-c11d-4ebe-8d1f-606143dd656b\r\n[2022-12-07T15:52:16.416+01:00][INFO
][plugins.fleet] Removing task
fleet:update_agent_tags:retry:848984ab-c11d-4ebe-8d1f-606143dd656b\r\n```","sha":"431c32b894077fc5910380252086442083734fce","branchLabelMapping":{"^v8.7.0$":"main","^v(\\d+).(\\d+).\\d+$":"$1.$2"}},"sourcePullRequest":{"labels":["release_note:skip","Team:Fleet","v8.7.0","v8.6.1"],"number":147190,"url":"https://github.com/elastic/kibana/pull/147190","mergeCommit":{"message":"[Fleet]
cancel tasks when 3rd retry failed (#147190)\n\n##
Summary\r\n\r\nRelated to
https://github.com/elastic/kibana/issues/144161\r\n\r\nFound that on a
bulk update tags task failure, the task didn't stop\r\nafter 3 retries
(should be over in less then a minute), the retries kept\r\nhappening
for 2 hours.\r\nThis change removes the retry task if 3 retries are
reached.\r\n\r\nAlso testing in cloud deployment to see if the tags
error can be\r\nreproduced with this fix.\r\nI could reproduce the
reported error locally, and seeing it goes away\r\nwith this
fix.\r\n\r\nTo verify:\r\n- Add at least 50k agents with the
`create_agents` script in kibana repo\r\n- open Kibana, select the 50k
agents, and open Actions / Add tags\r\n- Try this in a few seconds: add
2 new tags, and remove one of them\r\n- Wait about 30s, the agents
should reflect the changes\r\n- Check the logs to see that the tasks are
removed after 3rd retry is\r\nreached or successful.\r\n- Check that
there are no more running tasks. Any running task can be\r\nfound in
Kibana Console by running this query:
`GET\r\n.kibana_task_manager/_search?q=task.taskType:\"fleet:update_agent_tags:retry\"`\r\n\r\nLocally
simulated an error to test that the retry (and check) task
is\r\nremoved:\r\n\r\n```\r\n[2022-12-07T15:52:16.415+01:00][ERROR][plugins.fleet]
Retry #3 of task
fleet:update_agent_tags:retry:848984ab-c11d-4ebe-8d1f-606143dd656b
failed: failing task\r\n[2022-12-07T15:52:16.416+01:00][WARN
][plugins.fleet] Stopping after 3rd retry. Error: failing
task\r\n[2022-12-07T15:52:16.416+01:00][INFO ][plugins.fleet] Removing
task
fleet:update_agent_tags:retry:check:848984ab-c11d-4ebe-8d1f-606143dd656b\r\n[2022-12-07T15:52:16.416+01:00][INFO
][plugins.fleet] Removing task
fleet:update_agent_tags:retry:848984ab-c11d-4ebe-8d1f-606143dd656b\r\n```","sha":"431c32b894077fc5910380252086442083734fce"}},"sourceBranch":"main","suggestedTargetBranches":["8.6"],"targetPullRequestStates":[{"branch":"main","label":"v8.7.0","labelRegex":"^v8.7.0$","isSourceBranch":true,"state":"MERGED","url":"https://github.com/elastic/kibana/pull/147190","number":147190,"mergeCommit":{"message":"[Fleet]
cancel tasks when 3rd retry failed (#147190)\n\n##
Summary\r\n\r\nRelated to
https://github.com/elastic/kibana/issues/144161\r\n\r\nFound that on a
bulk update tags task failure, the task didn't stop\r\nafter 3 retries
(should be over in less then a minute), the retries kept\r\nhappening
for 2 hours.\r\nThis change removes the retry task if 3 retries are
reached.\r\n\r\nAlso testing in cloud deployment to see if the tags
error can be\r\nreproduced with this fix.\r\nI could reproduce the
reported error locally, and seeing it goes away\r\nwith this
fix.\r\n\r\nTo verify:\r\n- Add at least 50k agents with the
`create_agents` script in kibana repo\r\n- open Kibana, select the 50k
agents, and open Actions / Add tags\r\n- Try this in a few seconds: add
2 new tags, and remove one of them\r\n- Wait about 30s, the agents
should reflect the changes\r\n- Check the logs to see that the tasks are
removed after 3rd retry is\r\nreached or successful.\r\n- Check that
there are no more running tasks. Any running task can be\r\nfound in
Kibana Console by running this query:
`GET\r\n.kibana_task_manager/_search?q=task.taskType:\"fleet:update_agent_tags:retry\"`\r\n\r\nLocally
simulated an error to test that the retry (and check) task
is\r\nremoved:\r\n\r\n```\r\n[2022-12-07T15:52:16.415+01:00][ERROR][plugins.fleet]
Retry #3 of task
fleet:update_agent_tags:retry:848984ab-c11d-4ebe-8d1f-606143dd656b
failed: failing task\r\n[2022-12-07T15:52:16.416+01:00][WARN
][plugins.fleet] Stopping after 3rd retry. Error: failing
task\r\n[2022-12-07T15:52:16.416+01:00][INFO ][plugins.fleet] Removing
task
fleet:update_agent_tags:retry:check:848984ab-c11d-4ebe-8d1f-606143dd656b\r\n[2022-12-07T15:52:16.416+01:00][INFO
][plugins.fleet] Removing task
fleet:update_agent_tags:retry:848984ab-c11d-4ebe-8d1f-606143dd656b\r\n```","sha":"431c32b894077fc5910380252086442083734fce"}},{"branch":"8.6","label":"v8.6.1","labelRegex":"^v(\\d+).(\\d+).\\d+$","isSourceBranch":false,"state":"NOT_CREATED"}]}]
BACKPORT-->

Co-authored-by: Julia Bardi <90178898+juliaElastic@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release_note:skip Skip the PR/issue when compiling release notes Team:Fleet Team label for Observability Data Collection Fleet team v8.6.0 v8.6.1 v8.7.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants