-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix input reload under Elastic-Agent #35250
Fix input reload under Elastic-Agent #35250
Conversation
This commit fixes the input reload issue under Elastic-Agent by creating an infinity retry logic in the ManagerV2. This is exactly the same logic used by the configuration reload on a standalone Beat. Now if when reloading inputs, there is at least one `common.ErrInputNotFinished` a `forceReload` flag is set to true and the debounce timer is started. This process will repeat untill no `common.ErrInputNotFinished` is returned. The `changeDebounce` period is increased to 1s and the `forceReloadDebounce` period is set to `10 x changeDebounce`.
90202aa
to
68df822
Compare
This pull request does not have a backport label.
To fixup this pull request, you need to add the backport labels for the needed
|
Use the `errors` package API instead of type conversion.
Address all lint issues
The tests implemented by this commit ensure that `ManagerV2` can recover from `common.ErrInputNotFinished` given that the underlying implementations have not changed. See the comments on the code for more details. Other E2E testes will be implemented to avoid regressions.
59ecf63
to
2ba009a
Compare
Pinging @elastic/elastic-agent (Team:Elastic-Agent) |
Let's get the test in #35178 (comment) implemented before we merge this to make sure we have truly solved the problem with a real instance of Filebeat. The code changed here has a history of unexpected bugs, so let's get as much automated testing added here as we can. |
Co-authored-by: Denis <denis@rdner.de>
Co-authored-by: Denis <denis@rdner.de>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!
return | ||
} | ||
waiting = true | ||
when = time.Now().Add(10 * time.Second) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why 10 seconds? Can this be faster? Will this test be flaky?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code on libbeat will wait for 10s in case of a failure to start the input. The main idea being to wait all events from the file who caused the "error" to be published. Even if we wait less than 10s here Filebeat will not respond quicker.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there some state or event update we can poll on instead? Is there a way to watch for the event you are waiting on directly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately not. On my machine the tests take 40s to run. It's not super long for an integration tests. Even if I reduce this constant to 500ms it still takes 40s (the first run is the current 10s wait, the second is a 500ms wait):
tiago@millennium-falcon beats/x-pack/filebeat/tests/integration v1.19.6 fix-input-reload-under-agent [$!] % go test -tags=integration -v ./...
=== RUN TestInputReloadUnderElasticAgent
input_reload_test.go:62: Temporary directory: /home/tiago/devel/beats/x-pack/filebeat/build/integration-tests/TestInputReloadUnderElasticAgent-1685462672
input_reload_test.go:68: Temporary directory '/home/tiago/devel/beats/x-pack/filebeat/build/integration-tests/TestInputReloadUnderElasticAgent-1685462672' removed
--- PASS: TestInputReloadUnderElasticAgent (40.37s)
PASS
ok github.com/elastic/beats/v7/x-pack/filebeat/tests/integration 40.565s
tiago@millennium-falcon beats/x-pack/filebeat/tests/integration v1.19.6 fix-input-reload-under-agent [$] % go test -tags=integration -v ./...
=== RUN TestInputReloadUnderElasticAgent
input_reload_test.go:62: Temporary directory: /home/tiago/devel/beats/x-pack/filebeat/build/integration-tests/TestInputReloadUnderElasticAgent-1685462799
input_reload_test.go:68: Temporary directory '/home/tiago/devel/beats/x-pack/filebeat/build/integration-tests/TestInputReloadUnderElasticAgent-1685462799' removed
--- PASS: TestInputReloadUnderElasticAgent (40.37s)
PASS
ok github.com/elastic/beats/v7/x-pack/filebeat/tests/integration 40.588s
tiago@millennium-falcon beats/x-pack/filebeat/tests/integration v1.19.6 fix-input-reload-under-agent [$!] %
Was able to successfully run the new integration test 👍 |
@rdner regarding #35250 (comment). All the code I'm testing is under |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of more typos.
Co-authored-by: Denis <denis@rdner.de>
Co-authored-by: Denis <denis@rdner.de>
…5250) This commit addresses the input reload issue in Elastic-Agent by introducing an infinite retry logic in the ManagerV2. The implemented logic mirrors the configuration reload behavior of a standalone Beat. When reloading inputs, if there is at least one occurrence of 'common.ErrInputNotFinished', the 'forceReload' flag is set to true, and the debounce timer is initiated. This process will repeat until no 'common.ErrInputNotFinished' error is encountered. Additionally, the 'changeDebounce' period is extended to 1 second, and the 'forceReloadDebounce' period is set to 10 times the 'changeDebounce' value. --------- Co-authored-by: Blake Rouse <blake.rouse@elastic.co> Co-authored-by: Anderson Queiroz <me@andersonq.me> Co-authored-by: Denis <denis@rdner.de>
@belimawr let's backport to 8.8 so this can be released in v8.8.1. |
…5250) This commit addresses the input reload issue in Elastic-Agent by introducing an infinite retry logic in the ManagerV2. The implemented logic mirrors the configuration reload behavior of a standalone Beat. When reloading inputs, if there is at least one occurrence of 'common.ErrInputNotFinished', the 'forceReload' flag is set to true, and the debounce timer is initiated. This process will repeat until no 'common.ErrInputNotFinished' error is encountered. Additionally, the 'changeDebounce' period is extended to 1 second, and the 'forceReloadDebounce' period is set to 10 times the 'changeDebounce' value. --------- Co-authored-by: Blake Rouse <blake.rouse@elastic.co> Co-authored-by: Anderson Queiroz <me@andersonq.me> Co-authored-by: Denis <denis@rdner.de> (cherry picked from commit 137bc81)
…5250) (#35641) This commit addresses the input reload issue in Elastic-Agent by introducing an infinite retry logic in the ManagerV2. The implemented logic mirrors the configuration reload behavior of a standalone Beat. When reloading inputs, if there is at least one occurrence of 'common.ErrInputNotFinished', the 'forceReload' flag is set to true, and the debounce timer is initiated. This process will repeat until no 'common.ErrInputNotFinished' error is encountered. Additionally, the 'changeDebounce' period is extended to 1 second, and the 'forceReloadDebounce' period is set to 10 times the 'changeDebounce' value. --------- Co-authored-by: Blake Rouse <blake.rouse@elastic.co> Co-authored-by: Anderson Queiroz <me@andersonq.me> Co-authored-by: Denis <denis@rdner.de> (cherry picked from commit 137bc81) Co-authored-by: Tiago Queiroz <tiago.queiroz@elastic.co>
What does this PR do?
This PR fixes the input reload issue under Elastic-Agent by creating an infinity retry logic in the ManagerV2. This is exactly the same logic used by the configuration reload on a standalone Beat.
Now if when reloading inputs, there is at least one
common.ErrInputNotFinished
aforceReload
flag is set to true and the debounce timer is started. This process will repeat untill nocommon.ErrInputNotFinished
is returned.The
changeDebounce
period is increased to 1s and theforceReloadDebounce
period is set to10 x changeDebounce
.Why is it important?
Fixes #33653
Closes #35178
Checklist
- [ ] I have made corresponding changes to the documentation- [ ] I have made corresponding change to the default configuration filesCHANGELOG.next.asciidoc
orCHANGELOG-developer.next.asciidoc
.Author's Checklist
errors
package API instead of type conversionHow to test this PR locally
1. Create the policies
You will need two polices, they can either be a policy managed by Fleet or policies used with a standalone Elastic-Agent.
/tmp/flog.log
Now you have two identical policies that will ingest the same file. For a standalone Elastic-Agent setup, you can just copy those policies. This effectively mimics the behaviour of having two different policies with the System integration collecting logs.
Here is an example of a simple policy that will also send the Elastic-Agent logs to ES. If using standalone mode make sure to have 3 files: two different policies (you only need to change the IDs and the
elastic-agent.yml
that is the running configuration for the Elastic-Agent.elastic-agent.yml
2. Deploy the Elastic-Agent
Deploy the Elastic-Agent on your test system (VM is highly recommended).
The easiest way to do this is to follow the steps on Feet to add a new Agent and instead of running
sudo ./elastic-agent install...
, you can run./elastic-agent enroll...
, like this:Then you can start the Elastic-Agent in the background by running:
If using Fleet, go to the Agent logs and set the log level to
debug
.3. Create a log file and keep adding data to it
My preferred way to do so is to use flog. Make sure to run it in the background or on another terminal/session.
4. Validate that you're getting both Elastic-Agent logs
Go to Kibana and make sure you have both log file
/tmp/flog.log
and Elastic-Agent logs are being ingested.5. Change the policy
If using Fleet, then change the policy on Fleet UI, if using a standalone Elastic-Agent copy the other policy over the
elastic-agent.yml
, the Elastic-Agent will automatically detect the change and propagate the new policy.You will have to change the policy a few times until the issue happens, on my tests it usually happens on the first try.
6. Make sure the Agent is always health
If this PR works, the Elastic-Agent will not go to a unhealthy state, and you will see logs like those:
Also you must not see logs like those:
elastic-agent.yml
, it reloads the file automaticallyRelated issues
## Use cases## Screenshots## Logs