Safely start k8s watchers to avoid memory leak #35452

ChrsMark · 2023-05-12T21:45:23Z

What does this PR do?

Why is it important?

Background

At #35439, it has been reported that Metricbeat is facing a crucial memory leak that leads to multiple restarts each time Metricbeat Pod gets OOMKilled.
This is happening even in very basic clusters with even just the kubernetes module enabled with very few metricsets.

First we tried to relate it with #33307 and elastic/elastic-agent-autodiscover#31 however it didn't look to be the same since the related known "workaround"(add_resource_metadata.deployment: false) had no effect.

At the same time it was reported that adding the add_metadata: false was solving the issue but with the cost of losing all the k8s metadata for the kubernetes module.

Troubleshooting/investigation

After upgrading the version of k8s.io/client-go in 7.x to v0.23.4 we were getting W0512 20:42:30.555872 7 shared_informer.go:372] The sharedIndexInformer has started, run more than once is not allowed which made us realize that the watcher/informers are getting started every time the Start() (

beats/metricbeat/module/kubernetes/util/kubernetes.go

Line 380 in 673233d

func (m *enricher) Start() {

) method of the Enricher is called which actually means on every Fetch() call of the metricsets like at

beats/metricbeat/module/kubernetes/pod/pod.go

Line 90 in 3fdb372

m.enricher.Start()

.

This leads us to consider that the problematic lines are at https://github.com/elastic/beats/pull/29133/files#diff-1d7b17f1ef239e6177db5ee660f2234bfd69c82e93494d4e46ee97f8a8617cfdR361-R372 which in 8.x were fixed with 2f79c77#diff-1d7b17f1ef239e6177db5ee660f2234bfd69c82e93494d4e46ee97f8a8617cfdR386 but the backport #33165 happened after that fix. So we never included the fix in 7.x.

Long story short the original PR introduced an issue which was fixed in main but then only the original PR was backported to 7.x without the fix. This lead us to having an 7.x with the introduced bug.

Indeed, the respective fix in k8s library that would protect us was only introduced in v1.23 as it looks like from kubernetes/client-go@f0bcda0. So @gizas that's why we don't notice the memory leak when we upgrade the library to v1.23 since this version would protect us from starting the informers accidentally many times.

Tests

Using
metricbeat-kubernetes.yaml.txt

[18h-19h]: Deploying a very basic Metricbeat we are reproducing the with v7.17.9 (image: docker.elastic.co/beats/metricbeat:7.17.9)
[19h-23h]: Upgrading the library seems to solve the issue but floods our logs with the Informer's starts errors. (image: chrismark/metricbeat-oom:v0.0.1-rc1)
[23h-...]: Fixing the watcher/informer start process solves the issue permanently. (image: chrismark/metricbeat-oom:v0.0.2)

Signed-off-by: ChrsMark <chrismarkou92@gmail.com>

elasticmachine · 2023-05-12T22:32:37Z

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS

Expand to view the summary

Build stats

Start Time: 2023-05-12T21:45:40.125+0000
Duration: 47 min 49 sec

Test stats 🧪

Test	Results
Failed	0
Passed	2551
Skipped	558
Total	3109

💚 Flaky test report

Tests succeeded.

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

/test : Re-trigger the build.
/package : Generate the packages and run the E2E tests.
/beats-tester : Run the installation tests with beats-tester.
run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

gizas · 2023-05-15T08:33:46Z

I have been testing with image chrismark/metricbeat-oom:v0.0.1-rc2 (that includes the fix) and can not reproduce the issue of Memory increase anymore.
Also warning for The sharedIndexInformer has started, run more than once is not allowed does not appear in the logs

Safely start k8s watchers to avoid memory leak

39d4dda

Signed-off-by: ChrsMark <chrismarkou92@gmail.com>

ChrsMark added v7.17.0 7.17-candidate labels May 12, 2023

ChrsMark requested a review from gizas May 12, 2023 21:45

ChrsMark requested a review from a team as a code owner May 12, 2023 21:45

ChrsMark self-assigned this May 12, 2023

botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label May 12, 2023

ChrsMark added the Team:Cloudnative-Monitoring Label for the Cloud Native Monitoring team label May 13, 2023

botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label May 13, 2023

ChrsMark requested review from a team and constanca-m and removed request for a team May 13, 2023 07:11

MichaelKatsoulis approved these changes May 15, 2023

View reviewed changes

gizas approved these changes May 15, 2023

View reviewed changes

ChrsMark merged commit 7899ad9 into elastic:7.17 May 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Safely start k8s watchers to avoid memory leak #35452

Safely start k8s watchers to avoid memory leak #35452

ChrsMark commented May 12, 2023 •

edited

Loading

elasticmachine commented May 12, 2023

Build stats

Test stats 🧪

gizas commented May 15, 2023

Safely start k8s watchers to avoid memory leak #35452

Safely start k8s watchers to avoid memory leak #35452

Conversation

ChrsMark commented May 12, 2023 • edited Loading

What does this PR do?

Why is it important?

Background

Troubleshooting/investigation

Tests

elasticmachine commented May 12, 2023

💚 Build Succeeded

Build stats

Test stats 🧪

💚 Flaky test report

🤖 GitHub comments

gizas commented May 15, 2023

ChrsMark commented May 12, 2023 •

edited

Loading