`add_resource_metadata.cronjob` overloads the memory usage #31

ChrsMark · 2022-10-25T09:35:29Z

As reported at elastic/beats#33307, the kubernetes autodiscovery provider can lead to OOM kills for Beats Pods in clusters with specific type of workloads, ie Cronjobs.

The purpose of this issue is the following:

Consider making add_resource_metadata.cronjob: false the default since we know it's an "expensive" feature.
Document the nature of this setting and its implications
Consider improving the way we retrieve the objects or get the "owner" name by trimming the suffix of the Object's name ie: hello (the name of the cronjob) out of hello-1234 (the name of the job).

For full context and previous analysis see the summary at elastic/beats#33307 (comment).

cc: @eedugon @gizas @jsoriano

The text was updated successfully, but these errors were encountered:

jsoriano · 2022-10-25T11:29:23Z

@ChrsMark instead of making Get requests each time we get the metadata of an owner maybe we can keep an informer for each kind of owner we are interested on, this will probably help if we frequently make requests for the same objects, as would happen if we query for the name of a cronjob for every pod it has generated. This may rise the average use of memory, as will keep all these objects in memory, but may reduce the peak memory usage and reduce GC pressure (apart of reducing network traffic).

ChrsMark · 2022-10-25T15:39:17Z

Thanks @jsoriano ! Yes that would also work! I won't expect much increase if you consider that we have a similar watcher/informer for Pods. So in case of having too many Jobs you will have the same amount of events for their respective Pods, so I would expect an "x2" factor in usage which is far better compared to what we observe with the current implementation.

At the same time I'm not sure though, if it worth it having a complete watcher/informer/store implementation just to retrieve a single field 🤔 . But we can evaluate this of course since it seems promising.

jsoriano · 2022-10-25T16:16:02Z

At the same time I'm not sure though, if it worth it having a complete watcher/informer/store implementation just to retrieve a single field thinking . But we can evaluate this of course since it seems promising.

Maybe a bit hacky, but the list/watch cache that we use for informers allow to implement the List and Watch functions to return arbitrary runtime.Objects, we could maybe return almost empty objects that only contain the single field that we want.

eedugon · 2022-10-26T08:13:43Z

@jsoriano , @ChrsMark : I feel I'm missing something here..... could anyone explain in what way the problem we are facing is NOT a memory leak (because it's perceived like that)?

I'm not a fan of actions like:

Set add_resource_metadata.cronjob: false by default because it's problematic --> that just hides the problem, it doesn't solve it.
Document the nature of this setting and its implications --> What's the implication? That our beat explodes and eats as much memory as the host can give?

Probably I'm missing the details of what's exactly going on, in order to understand why a beat with 200 simple CronJobs that run every minute starts eating and eating memory until it fetches the whole 8G of RAM that the host has (or it reaches its limit).

Is there a technical explanation for this to be considered a huge memory need more than a leak in the implementation?

In my example, 200 CronJobs is a static number, that for sure will create a lot of new pods and jobs per minute, but the pods and jobs will eventually be deleted (otherwise the error would be on the Kubernetes side), so the final number of existing elements should be stable, if I'm not mistaken, hence I don't understand why the beat memory need increase over time when it's doing nothing more than checking what's running in the system via the autodiscover implementation.

Is it possible to get a clear explanation of the issue and why exactly the memory grows and grows?

ChrsMark · 2022-10-26T08:23:23Z

@eedugon the problem lies inside the k8s library implementation according to our observations. So we are not 100% of what can go wrong here. From my observations by using https://github.com/ChrsMark/k8sdiscovery I see that after some point the memory usage is stabilized in high level but stable though. I guess that in Beats this stable level is quite high since we fetch the metadata from various places. So I would say it's not a leak unless we prove that it is :).

In any case the idea of querying and fetching the Job objects from the k8s API every time a related Pod is added/updated/deleted is not good.
I would be just fine with just removing the feature since we only talk about a simple field which the cronjob.name when a Pod belongs to a Job.
The idea of @jsoriano to use an informer instead would work better though so if we actually want the feature and we want to ensure its performance under any circumstances we can take this way. Hope that helps.

eedugon · 2022-10-28T08:47:15Z

@ChrsMark , thanks for the explanation, it really helps!

Losing the CronJob name of pods related to cron jobs won't be nice, because the CronJob name will be the only thing known by the users (Job names will be dynamic and Pod names are dynamic). IMHO we should try to avoid losing the CronJob name just because the library explodes the memory usage in a way that we don't clearly understand.

Based on your description the issue could rely in the k8s library. In such case it might make sense to raise an issue there with some proofs of concepts (maybe with your k8discover repo) to see if they manage to detect if they either have a memory leak, a tremendous and unneeded memory usage that can improved or if the issue in some way is within the k8s API returning unexpected amount of data in certain calls.

I will do some extra tests to see if I get the same behavior as you (I see that after some point the memory usage is stabilized in high level but stable though).

Update: A simple gke cluster with 100 CronJobs that creates pods every minute makes 16G nodes to crash due to OOM. I will test now with 32G nodes, but at the moment I don't see the stabilization you mention, and in my opinion the amount of metadata of the cluster should be stable as pods are being created and deleted every minute.

With a node of 32G after 4 days running the beat memory is still growing. @ChrsMark , I don't see here a high utilization but a continuous growth, and even in the hypothetical case of the memory usage being stabilized, this amount of memory could be considered a leak in my opinion:

ChrsMark · 2023-01-03T13:01:26Z

We have another report in a Discuss thread: https://discuss.elastic.co/t/filebeat-memory-leak-via-filebeat-autodiscover-and-200-000-goroutines/.

@gizas I think it's about time to prioritize this :)

ChrsMark · 2023-05-17T14:00:40Z

FYI @eedugon I'm working on a possible fix for this issue:

I'm testing by deploying on 3 node GKE cluster with the following Cronjobs' related Resources:

➜  ~ k get job -A | wc -l
966
➜  ~ k get cronjob -A | wc -l
201
➜  ~ k get pod -A | wc -l
771

The image that I have built from the source PRs is the chrismark/metricbeat-watchers:v0.0.2-rc4 (branched from current main/8.9.0)

Below providing some screenshots that indicate positive results:

@eedugon would you maybe have any environments that you could also test it so as to double check that I'm actually solving the issue?

cc: @gizas @tommyers-elastic

gizas · 2023-10-03T14:08:17Z

I revive the discussion here for the option to disable by default the add_resource_metadata.cronjob: false and add_resource_metadata.deployment: false (See code here)

On one hand the enhancement with watchers has been part of 8.9.0 in elastic/elastic-agent#2711
On the other we have issues like this https://github.com/elastic/sdh-beats/issues/3863. This shows us that we might try on one hand to fix OOM with calls to API but on the other we increase the memory of agent. So because the trade off is difficult I suggest again to disable the metadata enrichment for deployments and cronjobs by default.

The enrichment only adds deployment name and cronjob name, so I think is not a big loss to have those disabled. On the contrary, the addition only of two fields can possible cause many problems in big scale.

Additionally the extra names (for deployments inside a replicaset and for cronjob/job inside a pod) can be added by removing the suffixes (in code programatically or with an ingest pipeline)

eg.

❯ k get jobs
NAME   COMPLETIONS   DURATION   AGE
pi     0/1           42s        42s
❯ k get pods -A
default              pi-hlc8c                                     0/1     Completed          0              52s

❯ k describe pod pi-hlc8c
Name:             pi-hlc8c
...
Controlled By:  Job/pi

So for Jobs: job-=pod.name

❯ k get deployments.apps -n kube-system
coredns   2/2     2            2           19d
❯ k get replicasets.apps -n kube-system
coredns-5d78c9869d   2         2         2       19d

k describe replicasets.apps -n kube-system coredns-5d78c9869d
Name:           coredns-5d78c9869d
...
                deployment.kubernetes.io/revision: 1
Controlled By:  Deployment/coredns

So for replicaset: deployment-=replicaset.name

cc @tommyers-elastic

ChrsMark · 2023-10-03T14:20:32Z

The enrichment only adds deployment name and cronjob name, so I think is not a big loss to have those disabled. On the contrary, the addition only of two fields can possible cause many problems in big scale.

The trade-off of the added value from these 2 fields compared to the performance overhead we pay back to retrieve them makes me think that we should:

disable those by default
additionally provide an ingest pipeline that could potentially extract (optionally) extract the values from the existent replicaset/cronjob.

It's a fact that changing the default setting would be a breaking change but in that case it's for a reasonable reason when we have investigate all possible technical solutions. Indeed requesting the API only for 2 single fields is an overkill to my mind and should be disabled by default.

gizas · 2023-10-03T14:34:42Z

Some more findings looking the kube-state-metrics:

For pod :

curl http://kube-state-metrics:8080/metrics | grep -i kube_pod_owner

kube_pod_owner{namespace="default",pod="pi-hlc8c",uid="98a4eb90-e1e5-417e-8b9f-afcb22229009",owner_kind="Job",owner_name="pi",owner_is_controller="true"} 1

For replicaset:

curl http://kube-state-metrics:8080/metrics | grep -i kube_replicaset_owner
 
kube_replicaset_owner{namespace="kube-system",replicaset="coredns-5d78c9869d",owner_kind="Deployment",owner_name="coredns",owner_is_controller="true"} 1

So I can even open a new enhancement request to solve both
Both kube_replicaset_owner and kube_pod_owner are STABLE fields according to https://github.com/kubernetes/kube-state-metrics/blob/main/docs/pod-metrics.md

Disabling AddResourceMetadataConfig.Deployment: false and AddResourceMetadataConfig.Cronjob: false as per discussion here #31 (comment)

ChrsMark added Team:Cloudnative-Monitoring Label for the Cloud Native Monitoring team bug Something isn't working enhancement New feature or request labels Oct 25, 2022

ChrsMark mentioned this issue Oct 25, 2022

[Metricbeat] Possible memory leak with autodiscover elastic/beats#33307

Closed

3 tasks

ChrsMark mentioned this issue Mar 8, 2023

Create k8s asset collector elastic/assetbeat#38

Merged

This was referenced May 13, 2023

Safely start k8s watchers to avoid memory leak elastic/beats#35452

Merged

Add replicaset and job metaGenerator based on watcher #44

Merged

Use replicaset and Job MetaGen based on watchers elastic/beats#35483

Merged

ChrsMark mentioned this issue May 18, 2023

Use replicaset and Job MetaGen based on watchers elastic/elastic-agent#2711

Merged

thbkrkr mentioned this issue Jul 4, 2023

Filebeat/autodiscover ES_data_should_pass_validations fails in 8.9.0-SNAPSHOT elastic/cloud-on-k8s#6946

Closed

This was referenced Oct 4, 2023

Disable Deployment and Replicaset enrichment by default #62

Merged

[Kubernetes Module] Add missing metrics for replicaset and pod owners elastic/beats#36746

Closed

rodrigc mentioned this issue Oct 11, 2023

Potential memory leak issue with filebeat and metricbeat elastic/beats#35796

Closed

gizas added a commit that referenced this issue Oct 11, 2023

Disable Deployment and Replicaset enrichment by default (#62)

285f0bb

Disabling AddResourceMetadataConfig.Deployment: false and AddResourceMetadataConfig.Cronjob: false as per discussion here #31 (comment)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`add_resource_metadata.cronjob` overloads the memory usage #31

`add_resource_metadata.cronjob` overloads the memory usage #31

ChrsMark commented Oct 25, 2022 •

edited

Loading

jsoriano commented Oct 25, 2022 •

edited

Loading

ChrsMark commented Oct 25, 2022

jsoriano commented Oct 25, 2022

eedugon commented Oct 26, 2022

ChrsMark commented Oct 26, 2022

eedugon commented Oct 28, 2022 •

edited

Loading

ChrsMark commented Jan 3, 2023

ChrsMark commented May 17, 2023

gizas commented Oct 3, 2023 •

edited

Loading

ChrsMark commented Oct 3, 2023 •

edited

Loading

gizas commented Oct 3, 2023 •

edited

Loading

add_resource_metadata.cronjob overloads the memory usage #31

add_resource_metadata.cronjob overloads the memory usage #31

Comments

ChrsMark commented Oct 25, 2022 • edited Loading

jsoriano commented Oct 25, 2022 • edited Loading

ChrsMark commented Oct 25, 2022

jsoriano commented Oct 25, 2022

eedugon commented Oct 26, 2022

ChrsMark commented Oct 26, 2022

eedugon commented Oct 28, 2022 • edited Loading

ChrsMark commented Jan 3, 2023

ChrsMark commented May 17, 2023

gizas commented Oct 3, 2023 • edited Loading

ChrsMark commented Oct 3, 2023 • edited Loading

gizas commented Oct 3, 2023 • edited Loading

`add_resource_metadata.cronjob` overloads the memory usage #31

`add_resource_metadata.cronjob` overloads the memory usage #31

ChrsMark commented Oct 25, 2022 •

edited

Loading

jsoriano commented Oct 25, 2022 •

edited

Loading

eedugon commented Oct 28, 2022 •

edited

Loading

gizas commented Oct 3, 2023 •

edited

Loading

ChrsMark commented Oct 3, 2023 •

edited

Loading

gizas commented Oct 3, 2023 •

edited

Loading