Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High CPU usage by the olm operator #2874

Closed
mykhailo-b opened this issue Oct 18, 2022 · 14 comments
Closed

High CPU usage by the olm operator #2874

mykhailo-b opened this issue Oct 18, 2022 · 14 comments
Assignees
Labels
priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now.

Comments

@mykhailo-b
Copy link

Greetings. We faced the problem of high CPU usage by the olm operator in openshift 4.11

https://github.com/okd-project/okd/releases/download/4.11.0-0.okd-2022-08-20-022919/release.txt

We examined the source code and images of the operator
( https://quay.io/repository/openshift/okd-content/manifest/sha256:6ad02f2e27937f4ec449718c27dbbb0870b55c910b21f4a22f202ce1cfb56d6f )
and found out that the operator is built from this repository https://github.com/openshift/operator-framework-olm

It seems that this repository uses an outdated version of operator-lifecycle-manager
https://github.com/openshift/operator-framework-olm/tree/master/staging/operator-lifecycle-manager
https://github.com/openshift/operator-framework-olm/tree/master/vendor/github.com/operator-framework/operator-lifecycle-manager

Although it is indicated that this is version 0.19.0
https://github.com/openshift/operator-framework-olm/blob/master/staging/operator-lifecycle-manager/OLM_VERSION
but in reality it's not.

We were especially interested in the absence of this fix
b85df58

Can you comment on our findings ?

@kevinrizza
Copy link
Member

Hey @mykhailo-b ,

That repo (https://github.com/openshift/operator-framework-olm) is where OCP and OKD releases live. Today, there isn't a relationship between specific versioned releases of this repository and that downstream OpenShift version -- the openshift branches are generally a snapshot in time + a set of curated commits that are pulled onto a given release branch.

The fix you referenced is quite old and actually predates the inception of that downstream openshift repository, so it's definitely included -- keep in mind that that repo is not a fork so there isn't commit matching, but you can search commit message history after initial inception (which happened around the end of the year in 2020) for specific commits if you are interested.

So, all that being said, I think it's unlikely that the commit you referenced is related to a performance problem you're having in OKD 4.11. Could you give us any more information about the specific cpu issue? Are you seeing it on the olm-operator? The catalog-operator? What's the topology of your cluster? Any specific data you have about the cpu profile would be helpful.

@EugeneMospan
Copy link

Hi @kevinrizza

Thank you for your quick reply.
Let me step in because we were working together with @mykhailo-b on the issue.

Our context is the following:

  1. We are using OKD 4.11
  2. We have OpenShift Container Storage installed into the cluster
  3. We are seeing that Olm-operator continuously consumes CPU about 700mCores and it is continuously updating the status resource of the kind: Operator, which has a name on our side ocs-operator.openshift-storage
  4. We figured it out by setting debug level of logging for olm-operator. You can see logs on the screen below
    MicrosoftTeams-image (26)
  5. Then we started looking into the code and find this line of code for version 0.19
  6. We tried to build and image including this functionality and deployed it our cluster
  7. After this olm-operator stopped continuously reconciling kind: Operator ocs-operator.openshift-storage and as a result olm-operator stopped consuming CPU

Could you please guide us on what is wrong, we are not sure that is safe to go ahead with such a workaround and we had to switch of a cluster-version-operator, because it replaces our custom changes with the original ones.

BR,
Eugene

@awgreene
Copy link
Member

Hello there,

I appreciate your patience on this matter. I confirmed that the latest version of OLM was spamming the api server with operator CR status updates. I then created a branch of OLM from master and reverted the commit introduced in #2697 which resolved the issue.

#2697 was created to address an issue where the operator CR status didn't capture all resources associated with an operator. The fix will need to address those needs while not introducing spamming the api server with status updates.

@EugeneMospan
Copy link

Hi @awgreene ,

Thank you for the investigation!
Could you please guide us on when the fix will be introduced to OKD itself? At the moment what we do to avoid CPU consumption is not an optimal way ...

BR,
Eugene

@awgreene
Copy link
Member

awgreene commented Oct 25, 2022

@EugeneMospan,

I hope to create a PR fixing this issue later this week. In a worst case scenario where a suitable fix cannot be found, I will consider reverting #2697 to at least resolve the unacceptable CPU usage.

I've applied the priority/critical-urgent label to convey the severity of this issue.

Best,

Alex

@awgreene awgreene self-assigned this Oct 25, 2022
@awgreene awgreene added the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Oct 25, 2022
@awgreene
Copy link
Member

awgreene commented Oct 27, 2022

Hey @EugeneMospan,

I took a look and found that the operator CR includes a list of related components in its status. The list of components was ordered by GVK but GVK types weren't ordered by namespace/name, potentially causing OLM to spam the server. The changes in the #2880 should address the issue you've hit.

I suspect that it will take a few days to move the API changes out of the vendored dir and into github.com/operator-framework/api, but feel free to test the image changes with this image: quay.io/agreene/olm:operator-api-spam

@EugeneMospan
Copy link

Thank you @awgreene we will try and come back to you

@EugeneMospan
Copy link

@awgreene I've applied the fix to one cluster, at first glance it is not spamming requests to update Operator status. If issue comes back, I will let you know

BR,
Eugene

@awgreene
Copy link
Member

Thanks @EugeneMospan!

@beelzetron
Copy link

Hello, I'm hitting this issue on OCP 4.11.13 as well, I confirm that @awgreene olm image fix the high cpu load.

@kcalmond
Copy link

kcalmond commented Jan 11, 2023

Also confirming this image fixed high cpu consumption (OCP v4.11.20)

imageID: >-
        quay.io/agreene/olm@sha256:2a7a8754e1bbf3e96e27cbfd35aed8811e4d32338a751818f054ee213da1a95d
      image: 'quay.io/agreene/olm:operator-api-spam'

@kcalmond
Copy link

kcalmond commented Jan 22, 2023

I noticed same high OLM CPU usage on a 4.10.47 cluster. I restarted the pod using the @awgreene provided image above. It did not change CPU consumption. It runs continuously consuming between ~400-800 mCPU on my 4.10 cluster.

@sfritze
Copy link

sfritze commented Feb 8, 2023

I noticed same high OLM CPU usage on a 4.10.47 cluster. I restarted the pod using the @awgreene provided image above. It did not change CPU consumption. It runs continuously consuming between ~400-800 mCPU on my 4.10 cluster.

I notice the same behaviour on 4.11.0-0.okd-2023-01-14-152430, its not present on 4.12.0-0.okd-2023-02-04-212953.

@awgreene
Copy link
Member

awgreene commented Feb 8, 2023

Hello, I don't think allowing this ticket to act as a generic tracker for "OLM CPU Utilization is High" is the best path forward. #2880 fixed a specific issue causing OLM to spam the API server. If you still see OLM using high CPU utilization, please create a new ticket and capture the exact steps to reproduce.

@awgreene awgreene closed this as completed Feb 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now.
Projects
None yet
Development

No branches or pull requests

7 participants