-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Ingest Manager] Retryable downloads of beats #19102
[Ingest Manager] Retryable downloads of beats #19102
Conversation
Pinging @elastic/ingest-management (Team:Ingest Management) |
💚 Build SucceededExpand to view the summary
Build stats
Test stats 🧪
Steps errorsExpand to view the steps failures
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like this, great to see a clean up and retry.
I think we should really cover this path, add a unit test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding the test, looks great!
Comment on what needs to be updated for it to land. Just a little change from the PR I merged with the GRPC flip.
// examples: | ||
// - Start does not need to run if process is running | ||
// - Fetch does not need to run if package is already present | ||
func (o *retryableOperations) Check() (bool, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check()
has become Check(app Application)
. Need to update this.
[Ingest Manager] Retryable downloads of beats (elastic#19102)
…ngs-archive * upstream/master: (119 commits) Update filebeat input docs (elastic#19110) Add ECS fields from log pipeline of PostgreSQL (elastic#19127) Init package libbeat/statestore (elastic#19117) [Ingest Manager] Retryable downloads of beats (elastic#19102) [DOCS] Add output.console to Functionbeat doc and Functionbeat reference file (elastic#18965) Add compatibility info (elastic#18929) Set ecszap version to v0.2.0 (elastic#19106) [filebeat][httpjson] Fix unit test function call (elastic#19124) [Filebeat][httpjson] Adds oauth2 support for httpjson input (elastic#18892) Allow host.* fields to be disabled in Suricata module (elastic#19107) Make selector string casing configurable (elastic#18854) Switch the GRPC communication where Agent is running the server and the beats are connecting back to Agent (elastic#18973) Disable host.* fields by default for netflow module (elastic#19087) Automatically fill zube teams on backports if available (elastic#18924) Fix crash on vsphere module (elastic#19078) [Ingest Manager] Download snapshot artifacts from snapshots repo (elastic#18685) [Ingest Manager] Basic Elastic Agent documentation (elastic#19030) Make user.id a string in system/users, in line with ECS (elastic#19019) [docs] Add 7.8 release highlights placeholder file (elastic#18493) Fix translate_sid's empty target field handling (elastic#18991) ...
[Ingest Manager] Retryable downloads of beats (elastic#19102)
What does this PR do?
Background:
when agent downloads an artifact and checksum does not match it yields a failure, but then it might occur that when download is performed again due to new config or whatever, download is skipped (because download was successful for some reason or packed artifacts are invalid).
Agent cleans up downloaded artifact only in case download yields error. so if this does not yield error but artifact is corrupted we might end up in a loop because it will try to verify artifact it find out it's incorrect and continues with failure... and so on
This PR changes this behavior a bit.
In case Verify fails. it cleans up downloaded artifacts (artifact + hash).
It also introduces retryable block within operation flow.
In this case we know than=t download+verify might be error prone so we can retry them if failure happens. (only if retry.enabled == true)
What this means for agent is that when it tries to install from corrupted artifact, it will remove artifact during Verify and re-download it again.
Why is it important?
Make download scneario more robust and repair loop faster
Checklist
CHANGELOG.next.asciidoc
orCHANGELOG-developer.next.asciidoc
.How to test
retry.enabled: true
See it fails with packed artifact, waits 30s and then downloads artifact from web