Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix input reload under Elastic-Agent #35250

Merged
merged 47 commits into from
May 31, 2023

Conversation

belimawr
Copy link
Contributor

@belimawr belimawr commented Apr 27, 2023

What does this PR do?

This PR fixes the input reload issue under Elastic-Agent by creating an infinity retry logic in the ManagerV2. This is exactly the same logic used by the configuration reload on a standalone Beat.

Now if when reloading inputs, there is at least one common.ErrInputNotFinished a forceReload flag is set to true and the debounce timer is started. This process will repeat untill no common.ErrInputNotFinished is returned.

The changeDebounce period is increased to 1s and the forceReloadDebounce period is set to 10 x changeDebounce.

Why is it important?

Fixes #33653
Closes #35178

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
    - [ ] I have made corresponding changes to the documentation
    - [ ] I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

Author's Checklist

  • Manual testing under Elastic-Agent on Kubernetes
  • Use the errors package API instead of type conversion
  • Improve "How to test this PR locally" section

How to test this PR locally

1. Create the policies

You will need two polices, they can either be a policy managed by Fleet or policies used with a standalone Elastic-Agent.

  1. Create a policy with a single integration, the Custom Logs integration and configure it to read a single log file, e.g: /tmp/flog.log
  2. Duplicate this integration on Fleet UI

Now you have two identical policies that will ingest the same file. For a standalone Elastic-Agent setup, you can just copy those policies. This effectively mimics the behaviour of having two different policies with the System integration collecting logs.

Here is an example of a simple policy that will also send the Elastic-Agent logs to ES. If using standalone mode make sure to have 3 files: two different policies (you only need to change the IDs and the elastic-agent.yml that is the running configuration for the Elastic-Agent.

elastic-agent.yml
id: 63ec76c0-e355-11ed-9305-056caabbcc7c
revision: 42
outputs:
  default:
    type: elasticsearch
    hosts:
      - >-
        https://my-cloud.gcp.elastic-cloud.com:443
agent:
  download:
    sourceURI: 'https://artifacts.elastic.co/downloads/'
  monitoring:
    enabled: true
    use_output: default
    namespace: default
    logs: true
    metrics: false
inputs:
  - id: logfile-logs-76aab6f0-ed55-11ed-9305-056c5b7bdf7c
    name: log-1
    revision: 1
    type: logfile
    use_output: default
    meta:
      package:
        name: log
        version: 2.0.0
    data_stream:
      namespace: default
    package_policy_id: 76aab6f0-ed55-11ed-9305-056c5b7bdf7c
    streams:
      - id: logfile-log.logs-76aab6f0-ed55-11ed-9305-056c5b7bdf7c
        data_stream:
          dataset: generic
        paths:
          - /tmp/flog.log

2. Deploy the Elastic-Agent

Deploy the Elastic-Agent on your test system (VM is highly recommended).
The easiest way to do this is to follow the steps on Feet to add a new Agent and instead of running sudo ./elastic-agent install..., you can run ./elastic-agent enroll..., like this:

./elastic-agent install --url=https://my-cloud.fleet.gcp.elastic-cloud.com:443 --enrollment-token=<my enrollment token>

Then you can start the Elastic-Agent in the background by running:

./elastic-agent &

If using Fleet, go to the Agent logs and set the log level to debug.

3. Create a log file and keep adding data to it

My preferred way to do so is to use flog. Make sure to run it in the background or on another terminal/session.

flog -d 1 -s 1 -l > /tmp/flog.log &

4. Validate that you're getting both Elastic-Agent logs

Go to Kibana and make sure you have both log file /tmp/flog.log and Elastic-Agent logs are being ingested.

5. Change the policy

If using Fleet, then change the policy on Fleet UI, if using a standalone Elastic-Agent copy the other policy over the elastic-agent.yml, the Elastic-Agent will automatically detect the change and propagate the new policy.

You will have to change the policy a few times until the issue happens, on my tests it usually happens on the first try.

6. Make sure the Agent is always health

If this PR works, the Elastic-Agent will not go to a unhealthy state, and you will see logs like those:

{"log.level":"debug","@timestamp":"2023-05-02T17:48:38.926+0200","log.logger":"centralmgmt.V2-manager","log.origin":{"file.name":"management/managerV2.go","file.line":693},"message":"file '/tmp/foo.log' is not finished, will retry starting the input soon","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2023-05-02T17:48:38.926+0200","log.logger":"centralmgmt.V2-manager","log.origin":{"file.name":"management/managerV2.go","file.line":695},"message":"ForceReload set to TRUE","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2023-05-02T17:48:48.926+0200","log.logger":"centralmgmt.V2-manager","log.origin":{"file.name":"management/managerV2.go","file.line":608},"message":"Skipped reloading output; configuration didn't change","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-05-02T17:48:48.926+0200","log.logger":"centralmgmt.V2-manager","log.origin":{"file.name":"management/managerV2.go","file.line":668},"message":"Reloading Beats inputs because forceReload is true. Set log level to debug to get more information about which inputs are causing this.","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2023-05-02T17:48:48.926+0200","log.logger":"centralmgmt.V2-manager","log.origin":{"file.name":"management/managerV2.go","file.line":710},"message":"ForceReload set to FALSE","ecs.version":"1.6.0"}

Also you must not see logs like those:

{"log.level":"info","@timestamp":"2022-11-09T18:55:33.554+0100","log.logger":"centralmgmt.fleet","log.origin":{"file.name":"management/manager.go","file.line":162},"message":"Status change to Failed: 1 error occurred:\n\t* 1 error: Error creating runner from config: Can only start an input when all related states are finished: {Id: native::149908-33, Finished: false, Fileinfo: &{flog.log 7722175 420 {729849109 63803613311 0x55d878467f20} {33 149908 1 33188 1000 1000 0 0 7722175 4096 15088 {1668016405 809851333} {1668016511 729849109} {1668016511 729849109} [0 0 0]}}, Source: /tmp/flog.log, Offset: 7724092, Timestamp: 2022-11-09 18:55:32.698690878 +0100 CET m=+20.865936562, TTL: -1ns, Type: log, Meta: map[], FileStateOS: 149908-33}\n\n","service.name":"filebeat","ecs.version":"1.6.0"}
  • Create an Elastic-Agent policy with a single 'Custom Logs' (log input)
  • Duplicate this policy. Either on Fleet UI or if using standalone Agent, copy the file and change all IDs (policy, input, etc)
  • Deploy the Elastic-Agent
  • Keep changing between the policies (for standalone keep copying the policies over the elastic-agent.yml, it reloads the file automatically
  • The Elastic-Agent should never be unhealthy.

Related issues

## Use cases
## Screenshots
## Logs

@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Apr 27, 2023
This commit fixes the input reload issue under Elastic-Agent by
creating an infinity retry logic in the ManagerV2. This is exactly the
same logic used by the configuration reload on a standalone Beat.

Now if when reloading inputs, there is at least one
`common.ErrInputNotFinished` a `forceReload` flag is set to true and
the debounce timer is started. This process will repeat untill no
`common.ErrInputNotFinished` is returned.

The `changeDebounce` period is increased to 1s and the
`forceReloadDebounce` period is set to `10 x changeDebounce`.
@belimawr belimawr force-pushed the fix-input-reload-under-agent branch from 90202aa to 68df822 Compare April 27, 2023 15:55
@belimawr belimawr added the Team:Elastic-Agent Label for the Agent team label Apr 27, 2023
@botelastic botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Apr 27, 2023
@belimawr belimawr self-assigned this Apr 27, 2023
@mergify
Copy link
Contributor

mergify bot commented Apr 27, 2023

This pull request does not have a backport label.
If this is a bug or security fix, could you label this PR @belimawr? 🙏.
For such, you'll need to label your PR with:

  • The upcoming major version of the Elastic Stack
  • The upcoming minor version of the Elastic Stack (if you're not pushing a breaking change)

To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-v8./d.0 is the label to automatically backport to the 8./d branch. /d is the digit

@elasticmachine
Copy link
Collaborator

elasticmachine commented Apr 27, 2023

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Start Time: 2023-05-31T10:14:03.135+0000

  • Duration: 87 min 23 sec

Test stats 🧪

Test Results
Failed 0
Passed 26445
Skipped 1975
Total 28420

💚 Flaky test report

Tests succeeded.

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate the packages and run the E2E tests.

  • /beats-tester : Run the installation tests with beats-tester.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

Use the `errors` package API instead of type conversion.
Address all lint issues
@belimawr belimawr added the backport-v8.8.0 Automated backport with mergify label Apr 28, 2023
belimawr added 2 commits May 2, 2023 12:53
The tests implemented by this commit ensure that `ManagerV2` can
recover from `common.ErrInputNotFinished` given that the underlying
implementations have not changed. See the comments on the code for
more details.

Other E2E testes will be implemented to avoid regressions.
@belimawr belimawr force-pushed the fix-input-reload-under-agent branch from 59ecf63 to 2ba009a Compare May 2, 2023 11:01
@belimawr belimawr marked this pull request as ready for review May 2, 2023 15:51
@belimawr belimawr requested a review from a team as a code owner May 2, 2023 15:51
@belimawr belimawr requested review from ycombinator and rdner and removed request for a team May 2, 2023 15:51
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

@cmacknz
Copy link
Member

cmacknz commented May 2, 2023

Let's get the test in #35178 (comment) implemented before we merge this to make sure we have truly solved the problem with a real instance of Filebeat.

The code changed here has a history of unexpected bugs, so let's get as much automated testing added here as we can.

belimawr and others added 3 commits May 17, 2023 12:49
Copy link
Contributor

@blakerouse blakerouse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

return
}
waiting = true
when = time.Now().Add(10 * time.Second)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why 10 seconds? Can this be faster? Will this test be flaky?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code on libbeat will wait for 10s in case of a failure to start the input. The main idea being to wait all events from the file who caused the "error" to be published. Even if we wait less than 10s here Filebeat will not respond quicker.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there some state or event update we can poll on instead? Is there a way to watch for the event you are waiting on directly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately not. On my machine the tests take 40s to run. It's not super long for an integration tests. Even if I reduce this constant to 500ms it still takes 40s (the first run is the current 10s wait, the second is a 500ms wait):

tiago@millennium-falcon beats/x-pack/filebeat/tests/integration  v1.19.6  fix-input-reload-under-agent [$!] % go test -tags=integration -v ./...
=== RUN   TestInputReloadUnderElasticAgent                                                                      
    input_reload_test.go:62: Temporary directory: /home/tiago/devel/beats/x-pack/filebeat/build/integration-tests/TestInputReloadUnderElasticAgent-1685462672
    input_reload_test.go:68: Temporary directory '/home/tiago/devel/beats/x-pack/filebeat/build/integration-tests/TestInputReloadUnderElasticAgent-1685462672' removed
--- PASS: TestInputReloadUnderElasticAgent (40.37s)
PASS
ok      github.com/elastic/beats/v7/x-pack/filebeat/tests/integration   40.565s
tiago@millennium-falcon beats/x-pack/filebeat/tests/integration  v1.19.6  fix-input-reload-under-agent [$] % go test -tags=integration -v ./...
=== RUN   TestInputReloadUnderElasticAgent
    input_reload_test.go:62: Temporary directory: /home/tiago/devel/beats/x-pack/filebeat/build/integration-tests/TestInputReloadUnderElasticAgent-1685462799
    input_reload_test.go:68: Temporary directory '/home/tiago/devel/beats/x-pack/filebeat/build/integration-tests/TestInputReloadUnderElasticAgent-1685462799' removed
--- PASS: TestInputReloadUnderElasticAgent (40.37s)
PASS
ok      github.com/elastic/beats/v7/x-pack/filebeat/tests/integration   40.588s
tiago@millennium-falcon beats/x-pack/filebeat/tests/integration  v1.19.6  fix-input-reload-under-agent [$!] % 

@cmacknz
Copy link
Member

cmacknz commented May 24, 2023

Was able to successfully run the new integration test 👍

@belimawr
Copy link
Contributor Author

@rdner regarding #35250 (comment).

All the code I'm testing is under x-pack, so I just kept everything there there.

@belimawr
Copy link
Contributor Author

@rdner, @cmacknz I addressed/responded all your comments, could you re-review?

@belimawr belimawr requested review from rdner and cmacknz May 30, 2023 17:04
Copy link
Member

@rdner rdner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of more typos.

x-pack/filebeat/tests/integration/input_reload_test.go Outdated Show resolved Hide resolved
x-pack/libbeat/management/managerV2.go Outdated Show resolved Hide resolved
belimawr and others added 2 commits May 31, 2023 12:12
Co-authored-by: Denis <denis@rdner.de>
@belimawr belimawr merged commit 137bc81 into elastic:main May 31, 2023
@belimawr belimawr deleted the fix-input-reload-under-agent branch May 31, 2023 13:50
chrisberkhout pushed a commit that referenced this pull request Jun 1, 2023
…5250)

This commit addresses the input reload issue in Elastic-Agent by introducing an infinite retry logic in the ManagerV2. The implemented logic mirrors the configuration reload behavior of a standalone Beat.

When reloading inputs, if there is at least one occurrence of 'common.ErrInputNotFinished', the 'forceReload' flag is set to true, and the debounce timer is initiated. This process will repeat until no 'common.ErrInputNotFinished' error is encountered.

Additionally, the 'changeDebounce' period is extended to 1 second, and the 'forceReloadDebounce' period is set to 10 times the 'changeDebounce' value.

---------

Co-authored-by: Blake Rouse <blake.rouse@elastic.co>
Co-authored-by: Anderson Queiroz <me@andersonq.me>
Co-authored-by: Denis <denis@rdner.de>
@cmacknz cmacknz added the backport-v8.8.0 Automated backport with mergify label Jun 1, 2023
@cmacknz
Copy link
Member

cmacknz commented Jun 1, 2023

@belimawr let's backport to 8.8 so this can be released in v8.8.1.

mergify bot pushed a commit that referenced this pull request Jun 1, 2023
…5250)

This commit addresses the input reload issue in Elastic-Agent by introducing an infinite retry logic in the ManagerV2. The implemented logic mirrors the configuration reload behavior of a standalone Beat.

When reloading inputs, if there is at least one occurrence of 'common.ErrInputNotFinished', the 'forceReload' flag is set to true, and the debounce timer is initiated. This process will repeat until no 'common.ErrInputNotFinished' error is encountered.

Additionally, the 'changeDebounce' period is extended to 1 second, and the 'forceReloadDebounce' period is set to 10 times the 'changeDebounce' value.

---------

Co-authored-by: Blake Rouse <blake.rouse@elastic.co>
Co-authored-by: Anderson Queiroz <me@andersonq.me>
Co-authored-by: Denis <denis@rdner.de>
(cherry picked from commit 137bc81)
belimawr added a commit that referenced this pull request Jun 2, 2023
…5250) (#35641)

This commit addresses the input reload issue in Elastic-Agent by introducing an infinite retry logic in the ManagerV2. The implemented logic mirrors the configuration reload behavior of a standalone Beat.

When reloading inputs, if there is at least one occurrence of 'common.ErrInputNotFinished', the 'forceReload' flag is set to true, and the debounce timer is initiated. This process will repeat until no 'common.ErrInputNotFinished' error is encountered.

Additionally, the 'changeDebounce' period is extended to 1 second, and the 'forceReloadDebounce' period is set to 10 times the 'changeDebounce' value.

---------

Co-authored-by: Blake Rouse <blake.rouse@elastic.co>
Co-authored-by: Anderson Queiroz <me@andersonq.me>
Co-authored-by: Denis <denis@rdner.de>
(cherry picked from commit 137bc81)

Co-authored-by: Tiago Queiroz <tiago.queiroz@elastic.co>
@reakaleek reakaleek mentioned this pull request Jul 19, 2023
6 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-v8.8.0 Automated backport with mergify Team:Automation Label for the Observability productivity team Team:Elastic-Agent Label for the Agent team
Projects
None yet
8 participants