Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add_resource_metadata.cronjob overloads the memory usage #31

Open
ChrsMark opened this issue Oct 25, 2022 · 11 comments
Open

add_resource_metadata.cronjob overloads the memory usage #31

ChrsMark opened this issue Oct 25, 2022 · 11 comments
Labels
bug Something isn't working enhancement New feature or request Team:Cloudnative-Monitoring Label for the Cloud Native Monitoring team

Comments

@ChrsMark
Copy link
Member

ChrsMark commented Oct 25, 2022

As reported at elastic/beats#33307, the kubernetes autodiscovery provider can lead to OOM kills for Beats Pods in clusters with specific type of workloads, ie Cronjobs.

The purpose of this issue is the following:

  1. Consider making add_resource_metadata.cronjob: false the default since we know it's an "expensive" feature.
  2. Document the nature of this setting and its implications
  3. Consider improving the way we retrieve the objects or get the "owner" name by trimming the suffix of the Object's name ie: hello (the name of the cronjob) out of hello-1234 (the name of the job).

For full context and previous analysis see the summary at elastic/beats#33307 (comment).

cc: @eedugon @gizas @jsoriano

@ChrsMark ChrsMark added Team:Cloudnative-Monitoring Label for the Cloud Native Monitoring team bug Something isn't working enhancement New feature or request labels Oct 25, 2022
@jsoriano
Copy link
Member

jsoriano commented Oct 25, 2022

@ChrsMark instead of making Get requests each time we get the metadata of an owner maybe we can keep an informer for each kind of owner we are interested on, this will probably help if we frequently make requests for the same objects, as would happen if we query for the name of a cronjob for every pod it has generated. This may rise the average use of memory, as will keep all these objects in memory, but may reduce the peak memory usage and reduce GC pressure (apart of reducing network traffic).

@ChrsMark
Copy link
Member Author

Thanks @jsoriano ! Yes that would also work! I won't expect much increase if you consider that we have a similar watcher/informer for Pods. So in case of having too many Jobs you will have the same amount of events for their respective Pods, so I would expect an "x2" factor in usage which is far better compared to what we observe with the current implementation.

At the same time I'm not sure though, if it worth it having a complete watcher/informer/store implementation just to retrieve a single field 🤔 . But we can evaluate this of course since it seems promising.

@jsoriano
Copy link
Member

At the same time I'm not sure though, if it worth it having a complete watcher/informer/store implementation just to retrieve a single field thinking . But we can evaluate this of course since it seems promising.

Maybe a bit hacky, but the list/watch cache that we use for informers allow to implement the List and Watch functions to return arbitrary runtime.Objects, we could maybe return almost empty objects that only contain the single field that we want.

@eedugon
Copy link
Contributor

eedugon commented Oct 26, 2022

@jsoriano , @ChrsMark : I feel I'm missing something here..... could anyone explain in what way the problem we are facing is NOT a memory leak (because it's perceived like that)?

I'm not a fan of actions like:

  • Set add_resource_metadata.cronjob: false by default because it's problematic --> that just hides the problem, it doesn't solve it.

  • Document the nature of this setting and its implications --> What's the implication? That our beat explodes and eats as much memory as the host can give?

Probably I'm missing the details of what's exactly going on, in order to understand why a beat with 200 simple CronJobs that run every minute starts eating and eating memory until it fetches the whole 8G of RAM that the host has (or it reaches its limit).

Is there a technical explanation for this to be considered a huge memory need more than a leak in the implementation?

In my example, 200 CronJobs is a static number, that for sure will create a lot of new pods and jobs per minute, but the pods and jobs will eventually be deleted (otherwise the error would be on the Kubernetes side), so the final number of existing elements should be stable, if I'm not mistaken, hence I don't understand why the beat memory need increase over time when it's doing nothing more than checking what's running in the system via the autodiscover implementation.

Is it possible to get a clear explanation of the issue and why exactly the memory grows and grows?

@ChrsMark
Copy link
Member Author

@eedugon the problem lies inside the k8s library implementation according to our observations. So we are not 100% of what can go wrong here. From my observations by using https://github.com/ChrsMark/k8sdiscovery I see that after some point the memory usage is stabilized in high level but stable though. I guess that in Beats this stable level is quite high since we fetch the metadata from various places. So I would say it's not a leak unless we prove that it is :).

In any case the idea of querying and fetching the Job objects from the k8s API every time a related Pod is added/updated/deleted is not good.
I would be just fine with just removing the feature since we only talk about a simple field which the cronjob.name when a Pod belongs to a Job.
The idea of @jsoriano to use an informer instead would work better though so if we actually want the feature and we want to ensure its performance under any circumstances we can take this way. Hope that helps.

@eedugon
Copy link
Contributor

eedugon commented Oct 28, 2022

@ChrsMark , thanks for the explanation, it really helps!

Losing the CronJob name of pods related to cron jobs won't be nice, because the CronJob name will be the only thing known by the users (Job names will be dynamic and Pod names are dynamic). IMHO we should try to avoid losing the CronJob name just because the library explodes the memory usage in a way that we don't clearly understand.

Based on your description the issue could rely in the k8s library. In such case it might make sense to raise an issue there with some proofs of concepts (maybe with your k8discover repo) to see if they manage to detect if they either have a memory leak, a tremendous and unneeded memory usage that can improved or if the issue in some way is within the k8s API returning unexpected amount of data in certain calls.

I will do some extra tests to see if I get the same behavior as you (I see that after some point the memory usage is stabilized in high level but stable though).

Update: A simple gke cluster with 100 CronJobs that creates pods every minute makes 16G nodes to crash due to OOM. I will test now with 32G nodes, but at the moment I don't see the stabilization you mention, and in my opinion the amount of metadata of the cluster should be stable as pods are being created and deleted every minute.

With a node of 32G after 4 days running the beat memory is still growing. @ChrsMark , I don't see here a high utilization but a continuous growth, and even in the hypothetical case of the memory usage being stabilized, this amount of memory could be considered a leak in my opinion:

image

@ChrsMark
Copy link
Member Author

ChrsMark commented Jan 3, 2023

We have another report in a Discuss thread: https://discuss.elastic.co/t/filebeat-memory-leak-via-filebeat-autodiscover-and-200-000-goroutines/.

@gizas I think it's about time to prioritize this :)

@ChrsMark
Copy link
Member Author

FYI @eedugon I'm working on a possible fix for this issue:

I'm testing by deploying on 3 node GKE cluster with the following Cronjobs' related Resources:

~ k get job -A | wc -l
966~ k get cronjob -A | wc -l
201~ k get pod -A | wc -l
771

The image that I have built from the source PRs is the chrismark/metricbeat-watchers:v0.0.2-rc4 (branched from current main/8.9.0)

Below providing some screenshots that indicate positive results:

mettricbeat-working-watchers-oom-3h
metricbeat-fixed-discover

@eedugon would you maybe have any environments that you could also test it so as to double check that I'm actually solving the issue?

cc: @gizas @tommyers-elastic

@gizas
Copy link
Contributor

gizas commented Oct 3, 2023

I revive the discussion here for the option to disable by default the add_resource_metadata.cronjob: false and add_resource_metadata.deployment: false (See code here)

On one hand the enhancement with watchers has been part of 8.9.0 in elastic/elastic-agent#2711
On the other we have issues like this https://github.com/elastic/sdh-beats/issues/3863. This shows us that we might try on one hand to fix OOM with calls to API but on the other we increase the memory of agent. So because the trade off is difficult I suggest again to disable the metadata enrichment for deployments and cronjobs by default.

The enrichment only adds deployment name and cronjob name, so I think is not a big loss to have those disabled. On the contrary, the addition only of two fields can possible cause many problems in big scale.

Additionally the extra names (for deployments inside a replicaset and for cronjob/job inside a pod) can be added by removing the suffixes (in code programatically or with an ingest pipeline)

eg.

❯ k get jobs
NAME   COMPLETIONS   DURATION   AGE
pi     0/1           42s        42s
❯ k get pods -A
default              pi-hlc8c                                     0/1     Completed          0              52s

❯ k describe pod pi-hlc8c
Name:             pi-hlc8c
...
Controlled By:  Job/pi

So for Jobs: job-=pod.name

❯ k get deployments.apps -n kube-system
coredns   2/2     2            2           19d
❯ k get replicasets.apps -n kube-system
coredns-5d78c9869d   2         2         2       19d

k describe replicasets.apps -n kube-system coredns-5d78c9869d
Name:           coredns-5d78c9869d
...
                deployment.kubernetes.io/revision: 1
Controlled By:  Deployment/coredns

So for replicaset: deployment-=replicaset.name

cc @tommyers-elastic

@ChrsMark
Copy link
Member Author

ChrsMark commented Oct 3, 2023

The enrichment only adds deployment name and cronjob name, so I think is not a big loss to have those disabled. On the contrary, the addition only of two fields can possible cause many problems in big scale.

The trade-off of the added value from these 2 fields compared to the performance overhead we pay back to retrieve them makes me think that we should:

  1. disable those by default
  2. additionally provide an ingest pipeline that could potentially extract (optionally) extract the values from the existent replicaset/cronjob.

It's a fact that changing the default setting would be a breaking change but in that case it's for a reasonable reason when we have investigate all possible technical solutions. Indeed requesting the API only for 2 single fields is an overkill to my mind and should be disabled by default.

@gizas
Copy link
Contributor

gizas commented Oct 3, 2023

Some more findings looking the kube-state-metrics:

For pod :

curl http://kube-state-metrics:8080/metrics | grep -i kube_pod_owner

kube_pod_owner{namespace="default",pod="pi-hlc8c",uid="98a4eb90-e1e5-417e-8b9f-afcb22229009",owner_kind="Job",owner_name="pi",owner_is_controller="true"} 1

For replicaset:

curl http://kube-state-metrics:8080/metrics | grep -i kube_replicaset_owner
 
kube_replicaset_owner{namespace="kube-system",replicaset="coredns-5d78c9869d",owner_kind="Deployment",owner_name="coredns",owner_is_controller="true"} 1

So I can even open a new enhancement request to solve both
Both kube_replicaset_owner and kube_pod_owner are STABLE fields according to https://github.com/kubernetes/kube-state-metrics/blob/main/docs/pod-metrics.md

gizas added a commit that referenced this issue Oct 11, 2023
Disabling AddResourceMetadataConfig.Deployment: false and
AddResourceMetadataConfig.Cronjob: false as per discussion here
#31 (comment)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request Team:Cloudnative-Monitoring Label for the Cloud Native Monitoring team
Projects
None yet
Development

No branches or pull requests

4 participants