Safely start k8s watchers to avoid memory leak #35452
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What does this PR do?
Fixes #35439
Why is it important?
Background
At #35439, it has been reported that Metricbeat is facing a crucial memory leak that leads to multiple restarts each time Metricbeat Pod gets
OOMKilled
.This is happening even in very basic clusters with even just the
kubernetes
module enabled with very few metricsets.First we tried to relate it with #33307 and elastic/elastic-agent-autodiscover#31 however it didn't look to be the same since the related known "workaround"(
add_resource_metadata.deployment: false
) had no effect.At the same time it was reported that adding the
add_metadata: false
was solving the issue but with the cost of losing all the k8s metadata for thekubernetes
module.Troubleshooting/investigation
After upgrading the version of
k8s.io/client-go
in 7.x tov0.23.4
we were gettingW0512 20:42:30.555872 7 shared_informer.go:372] The sharedIndexInformer has started, run more than once is not allowed
which made us realize that the watcher/informers are getting started every time theStart()
(beats/metricbeat/module/kubernetes/util/kubernetes.go
Line 380 in 673233d
Enricher
is called which actually means on everyFetch()
call of the metricsets like atbeats/metricbeat/module/kubernetes/pod/pod.go
Line 90 in 3fdb372
This leads us to consider that the problematic lines are at https://github.com/elastic/beats/pull/29133/files#diff-1d7b17f1ef239e6177db5ee660f2234bfd69c82e93494d4e46ee97f8a8617cfdR361-R372 which in
8.x
were fixed with 2f79c77#diff-1d7b17f1ef239e6177db5ee660f2234bfd69c82e93494d4e46ee97f8a8617cfdR386 but the backport #33165 happened after that fix. So we never included the fix in7.x
.Long story short the original PR introduced an issue which was fixed in
main
but then only the original PR was backported to 7.x without the fix. This lead us to having an7.x
with the introduced bug.Indeed, the respective fix in k8s library that would protect us was only introduced in v1.23 as it looks like from kubernetes/client-go@f0bcda0. So @gizas that's why we don't notice the memory leak when we upgrade the library to v1.23 since this version would protect us from starting the informers accidentally many times.
Tests
Using
metricbeat-kubernetes.yaml.txt
docker.elastic.co/beats/metricbeat:7.17.9
)chrismark/metricbeat-oom:v0.0.1-rc1
)chrismark/metricbeat-oom:v0.0.2
)