Fix input reload under Elastic-Agent #35250

belimawr · 2023-04-27T15:54:51Z

What does this PR do?

This PR fixes the input reload issue under Elastic-Agent by creating an infinity retry logic in the ManagerV2. This is exactly the same logic used by the configuration reload on a standalone Beat.

Now if when reloading inputs, there is at least one common.ErrInputNotFinished a forceReload flag is set to true and the debounce timer is started. This process will repeat untill no common.ErrInputNotFinished is returned.

The changeDebounce period is increased to 1s and the forceReloadDebounce period is set to 10 x changeDebounce.

Why is it important?

Fixes #33653
Closes #35178

Checklist

My code follows the style guidelines of this project
I have commented my code, particularly in hard-to-understand areas
~~- [ ] I have made corresponding changes to the documentation~~
~~- [ ] I have made corresponding change to the default configuration files~~
I have added tests that prove my fix is effective or that my feature works
I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

Author's Checklist

Manual testing under Elastic-Agent on Kubernetes
Use the errors package API instead of type conversion
Improve "How to test this PR locally" section

How to test this PR locally

1. Create the policies

You will need two polices, they can either be a policy managed by Fleet or policies used with a standalone Elastic-Agent.

Create a policy with a single integration, the Custom Logs integration and configure it to read a single log file, e.g: /tmp/flog.log
Duplicate this integration on Fleet UI

Now you have two identical policies that will ingest the same file. For a standalone Elastic-Agent setup, you can just copy those policies. This effectively mimics the behaviour of having two different policies with the System integration collecting logs.

Here is an example of a simple policy that will also send the Elastic-Agent logs to ES. If using standalone mode make sure to have 3 files: two different policies (you only need to change the IDs and the elastic-agent.yml that is the running configuration for the Elastic-Agent.

elastic-agent.yml

id: 63ec76c0-e355-11ed-9305-056caabbcc7c
revision: 42
outputs:
  default:
    type: elasticsearch
    hosts:
      - >-
        https://my-cloud.gcp.elastic-cloud.com:443
agent:
  download:
    sourceURI: 'https://artifacts.elastic.co/downloads/'
  monitoring:
    enabled: true
    use_output: default
    namespace: default
    logs: true
    metrics: false
inputs:
  - id: logfile-logs-76aab6f0-ed55-11ed-9305-056c5b7bdf7c
    name: log-1
    revision: 1
    type: logfile
    use_output: default
    meta:
      package:
        name: log
        version: 2.0.0
    data_stream:
      namespace: default
    package_policy_id: 76aab6f0-ed55-11ed-9305-056c5b7bdf7c
    streams:
      - id: logfile-log.logs-76aab6f0-ed55-11ed-9305-056c5b7bdf7c
        data_stream:
          dataset: generic
        paths:
          - /tmp/flog.log

2. Deploy the Elastic-Agent

Deploy the Elastic-Agent on your test system (VM is highly recommended).
The easiest way to do this is to follow the steps on Feet to add a new Agent and instead of running sudo ./elastic-agent install..., you can run ./elastic-agent enroll..., like this:

./elastic-agent install --url=https://my-cloud.fleet.gcp.elastic-cloud.com:443 --enrollment-token=<my enrollment token>

Then you can start the Elastic-Agent in the background by running:

./elastic-agent &

If using Fleet, go to the Agent logs and set the log level to debug.

3. Create a log file and keep adding data to it

My preferred way to do so is to use flog. Make sure to run it in the background or on another terminal/session.

flog -d 1 -s 1 -l > /tmp/flog.log &

4. Validate that you're getting both Elastic-Agent logs

Go to Kibana and make sure you have both log file /tmp/flog.log and Elastic-Agent logs are being ingested.

5. Change the policy

If using Fleet, then change the policy on Fleet UI, if using a standalone Elastic-Agent copy the other policy over the elastic-agent.yml, the Elastic-Agent will automatically detect the change and propagate the new policy.

You will have to change the policy a few times until the issue happens, on my tests it usually happens on the first try.

6. Make sure the Agent is always health

If this PR works, the Elastic-Agent will not go to a unhealthy state, and you will see logs like those:

{"log.level":"debug","@timestamp":"2023-05-02T17:48:38.926+0200","log.logger":"centralmgmt.V2-manager","log.origin":{"file.name":"management/managerV2.go","file.line":693},"message":"file '/tmp/foo.log' is not finished, will retry starting the input soon","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2023-05-02T17:48:38.926+0200","log.logger":"centralmgmt.V2-manager","log.origin":{"file.name":"management/managerV2.go","file.line":695},"message":"ForceReload set to TRUE","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2023-05-02T17:48:48.926+0200","log.logger":"centralmgmt.V2-manager","log.origin":{"file.name":"management/managerV2.go","file.line":608},"message":"Skipped reloading output; configuration didn't change","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-05-02T17:48:48.926+0200","log.logger":"centralmgmt.V2-manager","log.origin":{"file.name":"management/managerV2.go","file.line":668},"message":"Reloading Beats inputs because forceReload is true. Set log level to debug to get more information about which inputs are causing this.","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2023-05-02T17:48:48.926+0200","log.logger":"centralmgmt.V2-manager","log.origin":{"file.name":"management/managerV2.go","file.line":710},"message":"ForceReload set to FALSE","ecs.version":"1.6.0"}

Also you must not see logs like those:

{"log.level":"info","@timestamp":"2022-11-09T18:55:33.554+0100","log.logger":"centralmgmt.fleet","log.origin":{"file.name":"management/manager.go","file.line":162},"message":"Status change to Failed: 1 error occurred:\n\t* 1 error: Error creating runner from config: Can only start an input when all related states are finished: {Id: native::149908-33, Finished: false, Fileinfo: &{flog.log 7722175 420 {729849109 63803613311 0x55d878467f20} {33 149908 1 33188 1000 1000 0 0 7722175 4096 15088 {1668016405 809851333} {1668016511 729849109} {1668016511 729849109} [0 0 0]}}, Source: /tmp/flog.log, Offset: 7724092, Timestamp: 2022-11-09 18:55:32.698690878 +0100 CET m=+20.865936562, TTL: -1ns, Type: log, Meta: map[], FileStateOS: 149908-33}\n\n","service.name":"filebeat","ecs.version":"1.6.0"}

Create an Elastic-Agent policy with a single 'Custom Logs' (log input)
Duplicate this policy. Either on Fleet UI or if using standalone Agent, copy the file and change all IDs (policy, input, etc)
Deploy the Elastic-Agent
Keep changing between the policies (for standalone keep copying the policies over the elastic-agent.yml, it reloads the file automatically
The Elastic-Agent should never be unhealthy.

Related issues

Closes Input reload not working as expected under Elastic-Agent #33653
Will be tested by [Testing] Input reload not working as expected under Elastic-Agent #35178

~~## Use cases~~
~~## Screenshots~~
~~## Logs~~

This commit fixes the input reload issue under Elastic-Agent by creating an infinity retry logic in the ManagerV2. This is exactly the same logic used by the configuration reload on a standalone Beat. Now if when reloading inputs, there is at least one `common.ErrInputNotFinished` a `forceReload` flag is set to true and the debounce timer is started. This process will repeat untill no `common.ErrInputNotFinished` is returned. The `changeDebounce` period is increased to 1s and the `forceReloadDebounce` period is set to `10 x changeDebounce`.

mergify · 2023-04-27T15:55:25Z

This pull request does not have a backport label.
If this is a bug or security fix, could you label this PR @belimawr? 🙏.
For such, you'll need to label your PR with:

The upcoming major version of the Elastic Stack
The upcoming minor version of the Elastic Stack (if you're not pushing a breaking change)

To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

backport-v8./d.0 is the label to automatically backport to the 8./d branch. /d is the digit

x-pack/libbeat/management/managerV2.go

elasticmachine · 2023-04-27T17:17:29Z

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS

Expand to view the summary

Build stats

Start Time: 2023-05-31T10:14:03.135+0000
Duration: 87 min 23 sec

Test stats 🧪

Test	Results
Failed	0
Passed	26445
Skipped	1975
Total	28420

💚 Flaky test report

Tests succeeded.

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

/test : Re-trigger the build.
/package : Generate the packages and run the E2E tests.
/beats-tester : Run the installation tests with beats-tester.
run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

Use the `errors` package API instead of type conversion.

Address all lint issues

The tests implemented by this commit ensure that `ManagerV2` can recover from `common.ErrInputNotFinished` given that the underlying implementations have not changed. See the comments on the code for more details. Other E2E testes will be implemented to avoid regressions.

elasticmachine · 2023-05-02T15:51:33Z

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

cmacknz · 2023-05-02T17:21:39Z

Let's get the test in #35178 (comment) implemented before we merge this to make sure we have truly solved the problem with a real instance of Filebeat.

The code changed here has a history of unexpected bugs, so let's get as much automated testing added here as we can.

Co-authored-by: Denis <denis@rdner.de>

blakerouse

Looks good!

x-pack/filebeat/tests/integration/input_reload_test.go

cmacknz · 2023-05-23T19:53:15Z

x-pack/filebeat/tests/integration/input_reload_test.go

+			return
+		}
+		waiting = true
+		when = time.Now().Add(10 * time.Second)


Why 10 seconds? Can this be faster? Will this test be flaky?

The code on libbeat will wait for 10s in case of a failure to start the input. The main idea being to wait all events from the file who caused the "error" to be published. Even if we wait less than 10s here Filebeat will not respond quicker.

Is there some state or event update we can poll on instead? Is there a way to watch for the event you are waiting on directly?

Unfortunately not. On my machine the tests take 40s to run. It's not super long for an integration tests. Even if I reduce this constant to 500ms it still takes 40s (the first run is the current 10s wait, the second is a 500ms wait):

tiago@millennium-falcon beats/x-pack/filebeat/tests/integration  v1.19.6  fix-input-reload-under-agent [$!] % go test -tags=integration -v ./... === RUN TestInputReloadUnderElasticAgent input_reload_test.go:62: Temporary directory: /home/tiago/devel/beats/x-pack/filebeat/build/integration-tests/TestInputReloadUnderElasticAgent-1685462672 input_reload_test.go:68: Temporary directory '/home/tiago/devel/beats/x-pack/filebeat/build/integration-tests/TestInputReloadUnderElasticAgent-1685462672' removed --- PASS: TestInputReloadUnderElasticAgent (40.37s) PASS ok github.com/elastic/beats/v7/x-pack/filebeat/tests/integration 40.565s tiago@millennium-falcon beats/x-pack/filebeat/tests/integration  v1.19.6  fix-input-reload-under-agent [$] % go test -tags=integration -v ./... === RUN TestInputReloadUnderElasticAgent input_reload_test.go:62: Temporary directory: /home/tiago/devel/beats/x-pack/filebeat/build/integration-tests/TestInputReloadUnderElasticAgent-1685462799 input_reload_test.go:68: Temporary directory '/home/tiago/devel/beats/x-pack/filebeat/build/integration-tests/TestInputReloadUnderElasticAgent-1685462799' removed --- PASS: TestInputReloadUnderElasticAgent (40.37s) PASS ok github.com/elastic/beats/v7/x-pack/filebeat/tests/integration 40.588s tiago@millennium-falcon beats/x-pack/filebeat/tests/integration  v1.19.6  fix-input-reload-under-agent [$!] %

x-pack/filebeat/tests/integration/input_reload_test.go

x-pack/libbeat/management/managerV2.go

cmacknz · 2023-05-24T18:11:14Z

Was able to successfully run the new integration test 👍

belimawr · 2023-05-30T17:00:22Z

@rdner regarding #35250 (comment).

All the code I'm testing is under x-pack, so I just kept everything there there.

belimawr · 2023-05-30T17:04:17Z

@rdner, @cmacknz I addressed/responded all your comments, could you re-review?

rdner

A couple of more typos.

x-pack/filebeat/tests/integration/input_reload_test.go

x-pack/libbeat/management/managerV2.go

Co-authored-by: Denis <denis@rdner.de>

…5250) This commit addresses the input reload issue in Elastic-Agent by introducing an infinite retry logic in the ManagerV2. The implemented logic mirrors the configuration reload behavior of a standalone Beat. When reloading inputs, if there is at least one occurrence of 'common.ErrInputNotFinished', the 'forceReload' flag is set to true, and the debounce timer is initiated. This process will repeat until no 'common.ErrInputNotFinished' error is encountered. Additionally, the 'changeDebounce' period is extended to 1 second, and the 'forceReloadDebounce' period is set to 10 times the 'changeDebounce' value. --------- Co-authored-by: Blake Rouse <blake.rouse@elastic.co> Co-authored-by: Anderson Queiroz <me@andersonq.me> Co-authored-by: Denis <denis@rdner.de>

cmacknz · 2023-06-01T14:46:37Z

@belimawr let's backport to 8.8 so this can be released in v8.8.1.

…5250) This commit addresses the input reload issue in Elastic-Agent by introducing an infinite retry logic in the ManagerV2. The implemented logic mirrors the configuration reload behavior of a standalone Beat. When reloading inputs, if there is at least one occurrence of 'common.ErrInputNotFinished', the 'forceReload' flag is set to true, and the debounce timer is initiated. This process will repeat until no 'common.ErrInputNotFinished' error is encountered. Additionally, the 'changeDebounce' period is extended to 1 second, and the 'forceReloadDebounce' period is set to 10 times the 'changeDebounce' value. --------- Co-authored-by: Blake Rouse <blake.rouse@elastic.co> Co-authored-by: Anderson Queiroz <me@andersonq.me> Co-authored-by: Denis <denis@rdner.de> (cherry picked from commit 137bc81)

…5250) (#35641) This commit addresses the input reload issue in Elastic-Agent by introducing an infinite retry logic in the ManagerV2. The implemented logic mirrors the configuration reload behavior of a standalone Beat. When reloading inputs, if there is at least one occurrence of 'common.ErrInputNotFinished', the 'forceReload' flag is set to true, and the debounce timer is initiated. This process will repeat until no 'common.ErrInputNotFinished' error is encountered. Additionally, the 'changeDebounce' period is extended to 1 second, and the 'forceReloadDebounce' period is set to 10 times the 'changeDebounce' value. --------- Co-authored-by: Blake Rouse <blake.rouse@elastic.co> Co-authored-by: Anderson Queiroz <me@andersonq.me> Co-authored-by: Denis <denis@rdner.de> (cherry picked from commit 137bc81) Co-authored-by: Tiago Queiroz <tiago.queiroz@elastic.co>

botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Apr 27, 2023

belimawr force-pushed the fix-input-reload-under-agent branch from 90202aa to 68df822 Compare April 27, 2023 15:55

belimawr added the Team:Elastic-Agent Label for the Agent team label Apr 27, 2023

botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Apr 27, 2023

belimawr self-assigned this Apr 27, 2023

belimawr commented Apr 27, 2023

View reviewed changes

x-pack/libbeat/management/managerV2.go Outdated Show resolved Hide resolved

belimawr commented Apr 27, 2023

View reviewed changes

x-pack/libbeat/management/managerV2.go Outdated Show resolved Hide resolved

AndersonQ reviewed Apr 27, 2023

View reviewed changes

x-pack/libbeat/management/managerV2.go Outdated Show resolved Hide resolved

belimawr added 2 commits April 28, 2023 12:52

PR improvements

7fce900

Use the `errors` package API instead of type conversion.

Address lint issues

4d0732b

Address all lint issues

belimawr added the backport-v8.8.0 Automated backport with mergify label Apr 28, 2023

belimawr added 2 commits May 2, 2023 12:53

Remove test debug logs

2ba009a

belimawr force-pushed the fix-input-reload-under-agent branch from 59ecf63 to 2ba009a Compare May 2, 2023 11:01

belimawr added 3 commits May 2, 2023 14:14

fmt and add lisence headers

66176c8

make lint happy again

b31bd28

Add changelog

c902ab8

belimawr marked this pull request as ready for review May 2, 2023 15:51

belimawr requested a review from a team as a code owner May 2, 2023 15:51

belimawr requested review from ycombinator and rdner and removed request for a team May 2, 2023 15:51

belimawr requested review from blakerouse and fearful-symmetry May 2, 2023 15:56

improve log entries

dd0a47e

belimawr and others added 3 commits May 17, 2023 12:49

stop logging gorourine when test finishes

49f314d

Apply suggestions from code review

e9ce324

Co-authored-by: Denis <denis@rdner.de>

Update x-pack/libbeat/management/managerV2.go

d36f486

Co-authored-by: Denis <denis@rdner.de>

blakerouse approved these changes May 17, 2023

View reviewed changes

AndersonQ approved these changes May 23, 2023

View reviewed changes

cmacknz reviewed May 23, 2023

View reviewed changes

x-pack/filebeat/tests/integration/input_reload_test.go Outdated Show resolved Hide resolved

cmacknz reviewed May 23, 2023

View reviewed changes

x-pack/filebeat/tests/integration/input_reload_test.go Show resolved Hide resolved

cmacknz reviewed May 23, 2023

View reviewed changes

x-pack/libbeat/management/managerV2.go Show resolved Hide resolved

cmacknz reviewed May 23, 2023

View reviewed changes

x-pack/libbeat/management/managerV2.go Outdated Show resolved Hide resolved

belimawr added 2 commits May 30, 2023 18:04

use env vars with defaults for credentials

a98f0a9

replace uses of Cause by errors.Unwrap

8eece56

belimawr requested review from rdner and cmacknz May 30, 2023 17:04

rdner reviewed May 31, 2023

View reviewed changes

x-pack/filebeat/tests/integration/input_reload_test.go Outdated Show resolved Hide resolved

x-pack/libbeat/management/managerV2.go Outdated Show resolved Hide resolved

belimawr and others added 2 commits May 31, 2023 12:12

Update x-pack/filebeat/tests/integration/input_reload_test.go

844194a

Co-authored-by: Denis <denis@rdner.de>

Update x-pack/libbeat/management/managerV2.go

a37a275

Co-authored-by: Denis <denis@rdner.de>

rdner approved these changes May 31, 2023

View reviewed changes

belimawr merged commit 137bc81 into elastic:main May 31, 2023

belimawr deleted the fix-input-reload-under-agent branch May 31, 2023 13:50

cmacknz added the backport-v8.8.0 Automated backport with mergify label Jun 1, 2023

mergify bot mentioned this pull request Jun 1, 2023

[8.8](backport #35250) Fix input reload under Elastic-Agent #35641

Merged

reakaleek mentioned this pull request Jul 19, 2023

Fix ironbank validation in 8.8 #36115

Closed

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix input reload under Elastic-Agent #35250

Fix input reload under Elastic-Agent #35250

belimawr commented Apr 27, 2023 •

edited

Loading

mergify bot commented Apr 27, 2023

elasticmachine commented Apr 27, 2023 •

edited by jenkins-beats-ci bot

Loading

Build stats

Test stats 🧪

elasticmachine commented May 2, 2023

cmacknz commented May 2, 2023 •

edited

Loading

blakerouse left a comment

cmacknz May 23, 2023

belimawr May 25, 2023

cmacknz May 25, 2023

belimawr May 30, 2023

cmacknz commented May 24, 2023

belimawr commented May 30, 2023

belimawr commented May 30, 2023

rdner left a comment

cmacknz commented Jun 1, 2023

Fix input reload under Elastic-Agent #35250

Fix input reload under Elastic-Agent #35250

Conversation

belimawr commented Apr 27, 2023 • edited Loading

What does this PR do?

Why is it important?

Checklist

Author's Checklist

How to test this PR locally

1. Create the policies

2. Deploy the Elastic-Agent

3. Create a log file and keep adding data to it

4. Validate that you're getting both Elastic-Agent logs

5. Change the policy

6. Make sure the Agent is always health

Related issues

mergify bot commented Apr 27, 2023

elasticmachine commented Apr 27, 2023 • edited by jenkins-beats-ci bot Loading

💚 Build Succeeded

Build stats

Test stats 🧪

💚 Flaky test report

🤖 GitHub comments

elasticmachine commented May 2, 2023

cmacknz commented May 2, 2023 • edited Loading

blakerouse left a comment

Choose a reason for hiding this comment

cmacknz May 23, 2023

Choose a reason for hiding this comment

belimawr May 25, 2023

Choose a reason for hiding this comment

cmacknz May 25, 2023

Choose a reason for hiding this comment

belimawr May 30, 2023

Choose a reason for hiding this comment

cmacknz commented May 24, 2023

belimawr commented May 30, 2023

belimawr commented May 30, 2023

rdner left a comment

Choose a reason for hiding this comment

cmacknz commented Jun 1, 2023

belimawr commented Apr 27, 2023 •

edited

Loading

elasticmachine commented Apr 27, 2023 •

edited by jenkins-beats-ci bot

Loading

cmacknz commented May 2, 2023 •

edited

Loading