-
Notifications
You must be signed in to change notification settings - Fork 544
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
High CPU usage by the olm operator #2874
Comments
Hey @mykhailo-b , That repo (https://github.com/openshift/operator-framework-olm) is where OCP and OKD releases live. Today, there isn't a relationship between specific versioned releases of this repository and that downstream OpenShift version -- the openshift branches are generally a snapshot in time + a set of curated commits that are pulled onto a given release branch. The fix you referenced is quite old and actually predates the inception of that downstream openshift repository, so it's definitely included -- keep in mind that that repo is not a fork so there isn't commit matching, but you can search commit message history after initial inception (which happened around the end of the year in 2020) for specific commits if you are interested. So, all that being said, I think it's unlikely that the commit you referenced is related to a performance problem you're having in OKD 4.11. Could you give us any more information about the specific cpu issue? Are you seeing it on the olm-operator? The catalog-operator? What's the topology of your cluster? Any specific data you have about the cpu profile would be helpful. |
Hi @kevinrizza Thank you for your quick reply. Our context is the following:
Could you please guide us on what is wrong, we are not sure that is safe to go ahead with such a workaround and we had to switch of a cluster-version-operator, because it replaces our custom changes with the original ones. BR, |
Hello there, I appreciate your patience on this matter. I confirmed that the latest version of OLM was spamming the api server with operator CR status updates. I then created a branch of OLM from master and reverted the commit introduced in #2697 which resolved the issue. #2697 was created to address an issue where the operator CR status didn't capture all resources associated with an operator. The fix will need to address those needs while not introducing spamming the api server with status updates. |
Hi @awgreene , Thank you for the investigation! BR, |
I hope to create a PR fixing this issue later this week. In a worst case scenario where a suitable fix cannot be found, I will consider reverting #2697 to at least resolve the unacceptable CPU usage. I've applied the Best, Alex |
Hey @EugeneMospan, I took a look and found that the operator CR includes a list of related components in its status. The list of components was ordered by GVK but GVK types weren't ordered by namespace/name, potentially causing OLM to spam the server. The changes in the #2880 should address the issue you've hit. I suspect that it will take a few days to move the API changes out of the vendored dir and into github.com/operator-framework/api, but feel free to test the image changes with this image: |
Thank you @awgreene we will try and come back to you |
@awgreene I've applied the fix to one cluster, at first glance it is not spamming requests to update Operator status. If issue comes back, I will let you know BR, |
Thanks @EugeneMospan! |
Hello, I'm hitting this issue on OCP 4.11.13 as well, I confirm that @awgreene olm image fix the high cpu load. |
Also confirming this image fixed high cpu consumption (OCP v4.11.20)
|
I noticed same high OLM CPU usage on a 4.10.47 cluster. I restarted the pod using the @awgreene provided image above. It did not change CPU consumption. It runs continuously consuming between ~400-800 mCPU on my 4.10 cluster. |
I notice the same behaviour on 4.11.0-0.okd-2023-01-14-152430, its not present on 4.12.0-0.okd-2023-02-04-212953. |
Hello, I don't think allowing this ticket to act as a generic tracker for "OLM CPU Utilization is High" is the best path forward. #2880 fixed a specific issue causing OLM to spam the API server. If you still see OLM using high CPU utilization, please create a new ticket and capture the exact steps to reproduce. |
Greetings. We faced the problem of high CPU usage by the olm operator in openshift 4.11
https://github.com/okd-project/okd/releases/download/4.11.0-0.okd-2022-08-20-022919/release.txt
We examined the source code and images of the operator
( https://quay.io/repository/openshift/okd-content/manifest/sha256:6ad02f2e27937f4ec449718c27dbbb0870b55c910b21f4a22f202ce1cfb56d6f )
and found out that the operator is built from this repository https://github.com/openshift/operator-framework-olm
It seems that this repository uses an outdated version of operator-lifecycle-manager
https://github.com/openshift/operator-framework-olm/tree/master/staging/operator-lifecycle-manager
https://github.com/openshift/operator-framework-olm/tree/master/vendor/github.com/operator-framework/operator-lifecycle-manager
Although it is indicated that this is version 0.19.0
https://github.com/openshift/operator-framework-olm/blob/master/staging/operator-lifecycle-manager/OLM_VERSION
but in reality it's not.
We were especially interested in the absence of this fix
b85df58
Can you comment on our findings ?
The text was updated successfully, but these errors were encountered: