High CPU usage by the olm operator #2874

mykhailo-b · 2022-10-18T16:09:19Z

Greetings. We faced the problem of high CPU usage by the olm operator in openshift 4.11

https://github.com/okd-project/okd/releases/download/4.11.0-0.okd-2022-08-20-022919/release.txt

We examined the source code and images of the operator
( https://quay.io/repository/openshift/okd-content/manifest/sha256:6ad02f2e27937f4ec449718c27dbbb0870b55c910b21f4a22f202ce1cfb56d6f )
and found out that the operator is built from this repository https://github.com/openshift/operator-framework-olm

It seems that this repository uses an outdated version of operator-lifecycle-manager
https://github.com/openshift/operator-framework-olm/tree/master/staging/operator-lifecycle-manager
https://github.com/openshift/operator-framework-olm/tree/master/vendor/github.com/operator-framework/operator-lifecycle-manager

Although it is indicated that this is version 0.19.0
https://github.com/openshift/operator-framework-olm/blob/master/staging/operator-lifecycle-manager/OLM_VERSION
but in reality it's not.

We were especially interested in the absence of this fix
b85df58

Can you comment on our findings ?

The text was updated successfully, but these errors were encountered:

kevinrizza · 2022-10-18T16:21:54Z

Hey @mykhailo-b ,

That repo (https://github.com/openshift/operator-framework-olm) is where OCP and OKD releases live. Today, there isn't a relationship between specific versioned releases of this repository and that downstream OpenShift version -- the openshift branches are generally a snapshot in time + a set of curated commits that are pulled onto a given release branch.

The fix you referenced is quite old and actually predates the inception of that downstream openshift repository, so it's definitely included -- keep in mind that that repo is not a fork so there isn't commit matching, but you can search commit message history after initial inception (which happened around the end of the year in 2020) for specific commits if you are interested.

So, all that being said, I think it's unlikely that the commit you referenced is related to a performance problem you're having in OKD 4.11. Could you give us any more information about the specific cpu issue? Are you seeing it on the olm-operator? The catalog-operator? What's the topology of your cluster? Any specific data you have about the cpu profile would be helpful.

EugeneMospan · 2022-10-19T09:27:19Z

Hi @kevinrizza

Thank you for your quick reply.
Let me step in because we were working together with @mykhailo-b on the issue.

Our context is the following:

We are using OKD 4.11
We have OpenShift Container Storage installed into the cluster
We are seeing that Olm-operator continuously consumes CPU about 700mCores and it is continuously updating the status resource of the kind: Operator, which has a name on our side ocs-operator.openshift-storage
We figured it out by setting debug level of logging for olm-operator. You can see logs on the screen below
Then we started looking into the code and find this line of code for version 0.19

operator-lifecycle-manager/pkg/controller/operators/operator_controller.go

Line 281 in 864b58d

r.unsetLastResourceVersion(name)
We tried to build and image including this functionality and deployed it our cluster
After this olm-operator stopped continuously reconciling kind: Operator ocs-operator.openshift-storage and as a result olm-operator stopped consuming CPU

Could you please guide us on what is wrong, we are not sure that is safe to go ahead with such a workaround and we had to switch of a cluster-version-operator, because it replaces our custom changes with the original ones.

BR,
Eugene

awgreene · 2022-10-24T17:48:27Z

Hello there,

I appreciate your patience on this matter. I confirmed that the latest version of OLM was spamming the api server with operator CR status updates. I then created a branch of OLM from master and reverted the commit introduced in #2697 which resolved the issue.

#2697 was created to address an issue where the operator CR status didn't capture all resources associated with an operator. The fix will need to address those needs while not introducing spamming the api server with status updates.

EugeneMospan · 2022-10-24T19:38:30Z

Hi @awgreene ,

Thank you for the investigation!
Could you please guide us on when the fix will be introduced to OKD itself? At the moment what we do to avoid CPU consumption is not an optimal way ...

BR,
Eugene

awgreene · 2022-10-25T14:37:50Z

@EugeneMospan,

I hope to create a PR fixing this issue later this week. In a worst case scenario where a suitable fix cannot be found, I will consider reverting #2697 to at least resolve the unacceptable CPU usage.

I've applied the priority/critical-urgent label to convey the severity of this issue.

Best,

Alex

awgreene · 2022-10-27T10:42:28Z

Hey @EugeneMospan,

I took a look and found that the operator CR includes a list of related components in its status. The list of components was ordered by GVK but GVK types weren't ordered by namespace/name, potentially causing OLM to spam the server. The changes in the #2880 should address the issue you've hit.

I suspect that it will take a few days to move the API changes out of the vendored dir and into github.com/operator-framework/api, but feel free to test the image changes with this image: quay.io/agreene/olm:operator-api-spam

EugeneMospan · 2022-10-27T13:57:44Z

Thank you @awgreene we will try and come back to you

EugeneMospan · 2022-10-27T18:01:55Z

@awgreene I've applied the fix to one cluster, at first glance it is not spamming requests to update Operator status. If issue comes back, I will let you know

BR,
Eugene

awgreene · 2022-10-31T17:07:19Z

Thanks @EugeneMospan!

beelzetron · 2022-11-16T06:56:47Z

Hello, I'm hitting this issue on OCP 4.11.13 as well, I confirm that @awgreene olm image fix the high cpu load.

kcalmond · 2023-01-11T16:48:09Z

Also confirming this image fixed high cpu consumption (OCP v4.11.20)

imageID: >-
        quay.io/agreene/olm@sha256:2a7a8754e1bbf3e96e27cbfd35aed8811e4d32338a751818f054ee213da1a95d
      image: 'quay.io/agreene/olm:operator-api-spam'

kcalmond · 2023-01-22T21:02:34Z

I noticed same high OLM CPU usage on a 4.10.47 cluster. I restarted the pod using the @awgreene provided image above. It did not change CPU consumption. It runs continuously consuming between ~400-800 mCPU on my 4.10 cluster.

sfritze · 2023-02-08T08:56:21Z

I noticed same high OLM CPU usage on a 4.10.47 cluster. I restarted the pod using the @awgreene provided image above. It did not change CPU consumption. It runs continuously consuming between ~400-800 mCPU on my 4.10 cluster.

I notice the same behaviour on 4.11.0-0.okd-2023-01-14-152430, its not present on 4.12.0-0.okd-2023-02-04-212953.

awgreene · 2023-02-08T13:38:15Z

Hello, I don't think allowing this ticket to act as a generic tracker for "OLM CPU Utilization is High" is the best path forward. #2880 fixed a specific issue causing OLM to spam the API server. If you still see OLM using high CPU utilization, please create a new ticket and capture the exact steps to reproduce.

awgreene self-assigned this Oct 25, 2022

awgreene added the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Oct 25, 2022

awgreene mentioned this issue Oct 27, 2022

Fix RichReference Ordering #2880

Merged

awgreene mentioned this issue Oct 27, 2022

Add sort.Sort support to operatorsv1.Components operator-framework/api#271

Closed

awgreene closed this as completed Feb 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

High CPU usage by the olm operator #2874

High CPU usage by the olm operator #2874

mykhailo-b commented Oct 18, 2022

kevinrizza commented Oct 18, 2022

EugeneMospan commented Oct 19, 2022

awgreene commented Oct 24, 2022

EugeneMospan commented Oct 24, 2022

awgreene commented Oct 25, 2022 •

edited

Loading

awgreene commented Oct 27, 2022 •

edited

Loading

EugeneMospan commented Oct 27, 2022

EugeneMospan commented Oct 27, 2022

awgreene commented Oct 31, 2022

beelzetron commented Nov 16, 2022

kcalmond commented Jan 11, 2023 •

edited

Loading

kcalmond commented Jan 22, 2023 •

edited

Loading

sfritze commented Feb 8, 2023

awgreene commented Feb 8, 2023

High CPU usage by the olm operator #2874

High CPU usage by the olm operator #2874

Comments

mykhailo-b commented Oct 18, 2022

kevinrizza commented Oct 18, 2022

EugeneMospan commented Oct 19, 2022

awgreene commented Oct 24, 2022

EugeneMospan commented Oct 24, 2022

awgreene commented Oct 25, 2022 • edited Loading

awgreene commented Oct 27, 2022 • edited Loading

EugeneMospan commented Oct 27, 2022

EugeneMospan commented Oct 27, 2022

awgreene commented Oct 31, 2022

beelzetron commented Nov 16, 2022

kcalmond commented Jan 11, 2023 • edited Loading

kcalmond commented Jan 22, 2023 • edited Loading

sfritze commented Feb 8, 2023

awgreene commented Feb 8, 2023

awgreene commented Oct 25, 2022 •

edited

Loading

awgreene commented Oct 27, 2022 •

edited

Loading

kcalmond commented Jan 11, 2023 •

edited

Loading

kcalmond commented Jan 22, 2023 •

edited

Loading