Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Safely start k8s watchers to avoid memory leak #35452

Merged
merged 1 commit into from
May 15, 2023

Conversation

ChrsMark
Copy link
Member

@ChrsMark ChrsMark commented May 12, 2023

What does this PR do?

Fixes #35439

Why is it important?

Background

At #35439, it has been reported that Metricbeat is facing a crucial memory leak that leads to multiple restarts each time Metricbeat Pod gets OOMKilled.
This is happening even in very basic clusters with even just the kubernetes module enabled with very few metricsets.

First we tried to relate it with #33307 and elastic/elastic-agent-autodiscover#31 however it didn't look to be the same since the related known "workaround"(add_resource_metadata.deployment: false) had no effect.

At the same time it was reported that adding the add_metadata: false was solving the issue but with the cost of losing all the k8s metadata for the kubernetes module.

Troubleshooting/investigation

After upgrading the version of k8s.io/client-go in 7.x to v0.23.4 we were getting W0512 20:42:30.555872 7 shared_informer.go:372] The sharedIndexInformer has started, run more than once is not allowed which made us realize that the watcher/informers are getting started every time the Start() (

func (m *enricher) Start() {
) method of the Enricher is called which actually means on every Fetch() call of the metricsets like at .

This leads us to consider that the problematic lines are at https://github.com/elastic/beats/pull/29133/files#diff-1d7b17f1ef239e6177db5ee660f2234bfd69c82e93494d4e46ee97f8a8617cfdR361-R372 which in 8.x were fixed with 2f79c77#diff-1d7b17f1ef239e6177db5ee660f2234bfd69c82e93494d4e46ee97f8a8617cfdR386 but the backport #33165 happened after that fix. So we never included the fix in 7.x.

Long story short the original PR introduced an issue which was fixed in main but then only the original PR was backported to 7.x without the fix. This lead us to having an 7.x with the introduced bug.

Indeed, the respective fix in k8s library that would protect us was only introduced in v1.23 as it looks like from kubernetes/client-go@f0bcda0. So @gizas that's why we don't notice the memory leak when we upgrade the library to v1.23 since this version would protect us from starting the informers accidentally many times.

Tests

Using
metricbeat-kubernetes.yaml.txt

  1. [18h-19h]: Deploying a very basic Metricbeat we are reproducing the with v7.17.9 (image: docker.elastic.co/beats/metricbeat:7.17.9)
  2. [19h-23h]: Upgrading the library seems to solve the issue but floods our logs with the Informer's starts errors. (image: chrismark/metricbeat-oom:v0.0.1-rc1)
  3. [23h-...]: Fixing the watcher/informer start process solves the issue permanently. (image: chrismark/metricbeat-oom:v0.0.2)

metricbeat-oom

Signed-off-by: ChrsMark <chrismarkou92@gmail.com>
@ChrsMark ChrsMark requested a review from gizas May 12, 2023 21:45
@ChrsMark ChrsMark requested a review from a team as a code owner May 12, 2023 21:45
@ChrsMark ChrsMark self-assigned this May 12, 2023
@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label May 12, 2023
@elasticmachine
Copy link
Collaborator

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Start Time: 2023-05-12T21:45:40.125+0000

  • Duration: 47 min 49 sec

Test stats 🧪

Test Results
Failed 0
Passed 2551
Skipped 558
Total 3109

💚 Flaky test report

Tests succeeded.

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate the packages and run the E2E tests.

  • /beats-tester : Run the installation tests with beats-tester.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

@ChrsMark ChrsMark added the Team:Cloudnative-Monitoring Label for the Cloud Native Monitoring team label May 13, 2023
@botelastic botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label May 13, 2023
@ChrsMark ChrsMark requested review from a team and constanca-m and removed request for a team May 13, 2023 07:11
@gizas
Copy link
Contributor

gizas commented May 15, 2023

I have been testing with image chrismark/metricbeat-oom:v0.0.1-rc2 (that includes the fix) and can not reproduce the issue of Memory increase anymore.
Also warning for The sharedIndexInformer has started, run more than once is not allowed does not appear in the logs

@ChrsMark ChrsMark merged commit 7899ad9 into elastic:7.17 May 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
7.17-candidate Team:Cloudnative-Monitoring Label for the Cloud Native Monitoring team v7.17.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants