Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce CPU utilization #304

Open
Tracked by #724
amshuman-kr opened this issue Aug 7, 2019 · 11 comments
Open
Tracked by #724

Reduce CPU utilization #304

amshuman-kr opened this issue Aug 7, 2019 · 11 comments
Labels
area/auto-scaling Auto-scaling (CA/HPA/VPA/HVPA, predominantly control plane, but also otherwise) related area/performance Performance (across all domains, such as control plane, networking, storage, etc.) related component/mcm Machine Controller Manager (including Node Problem Detector, Cluster Auto Scaler, etc.) lifecycle/rotten Nobody worked on this for 12 months (final aging stage) needs/planning Needs (more) planning with other MCM maintainers priority/4 Priority (lower number equals higher priority) size/m Size of pull request is medium (see gardener-robot robot/bots/size.py)

Comments

@amshuman-kr
Copy link

Issue

During performance tests, of all the control-plane components, the CPU utilisation of MCM was comparable to kube-apiserver and even more than etcd. This is surprising as MCM is handling at least two orders of magnitude less number of resources when compared to the other control-plane components.

Reducing the CPU utilisation will help improve the scalability of MCM as well as gardener.

Solution

Profile and optimize the CPU utilization of MCM.

@PadmaB PadmaB added area/performance Performance (across all domains, such as control plane, networking, storage, etc.) related area/scalability labels Aug 7, 2019
@prashanth26
Copy link
Contributor

I think most of the CPU utilization would be due to the reconcilation of machine objects. They get reconciled on every node object update. And a node object is updated every 10-30s due to update on condition checks like - memory, disk, network, readiness. I think we could reduce this reconciliation intervel either at the MCM or on the node level. We can take a call while fixing this issue.

@gardener-robot-ci-1 gardener-robot-ci-1 added lifecycle/stale Nobody worked on this for 6 months (will further age) and removed lifecycle/stale Nobody worked on this for 6 months (will further age) labels Oct 7, 2019
@amshuman-kr
Copy link
Author

I think we should also look at the number of goroutines and retry logic. But let's profile first.

@majst01
Copy link

majst01 commented Dec 2, 2019

Hi,
we see a constantly increasing CPU usage of mcm in our environment with only 3 nodes in the cluster.
pproof shows a vast amount of parked threads:

runtime.gopark
/usr/local/go/src/runtime/proc.go

  Total:       29300      29300 (flat, cum)   100%
runtime.selectgo
/usr/local/go/src/runtime/select.go

  Total:           2      29113 (flat, cum) 99.35%

But i must admit that i have no idea how this could happen tbh.
Ideas ?

@amshuman-kr
Copy link
Author

@majst01 #341 addressed the constant increase in CPU usage. Does it not work for you?

We have kept the current issue open to track the optimisation of the baseline CPU usage.

@majst01
Copy link

majst01 commented Dec 2, 2019

We already have #341, i will try the most recent version as well and report back.

@amshuman-kr
Copy link
Author

We already have #341

Thanks. We saw improvement with #341. But this is good information for us. We will also check from our end.

@majst01
Copy link

majst01 commented Dec 2, 2019

I doubt that the changes upstream since #341 change any behavior here. I will also have a look for unclosed channels et.al.

@amshuman-kr
Copy link
Author

I doubt that the changes upstream since #341 change any behavior here.

Yes. If you already have #341, there are no further relevant changes that might help.

@majst01
Copy link

majst01 commented Dec 2, 2019

I tried to run https://github.com/golangci/golangci-lint on the code base but failed actually with:

WARN [runner] Can't run linter goanalysis_metalinter: assign: failed prerequisites: inspect@github.com/gardener/machine-controller-manager/pkg/client/clientset/internalversion/typed/machine/internalversion 
WARN [runner] Can't run linter unused: buildssa: analysis skipped: errors in package: [/home/stefan/dev/devops/cloud-native/metal/metal-pod/machine-controller-manager/pkg/client/clientset/internalversion/typed/machine/internalversion/machine_client.go:10:6: MachineInterface redeclared in this block /home/stefan/dev/devops/cloud-native/metal/metal-pod/machine-controller-manager/pkg/client/clientset/internalversion/typed/machine/internalversion/machine.go:21:6:       other declaration of MachineInterface /home/stefan/dev/devops/cloud-native/metal/metal-pod/machine-controller-manager/pkg/client/clientset/internalversion/typed/machine/internalversion/machine.go:112:20: machine.Machine undefined (type *machine.Machine has no field or method Machine) /home/stefan/dev/devops/cloud-native/metal/metal-pod/machine-controller-manager/pkg/client/clientset/internalversion/typed/machine/internalversion/machine.go:97:20: machine.Machine undefined (type *machine.Machine has no field or method Machine) /home/stefan/dev/devops/cloud-native/metal/metal-pod/machine-controller-manager/pkg/client/clientset/internalversion/typed/machine/internalversion/machine.go:85:20: machine.Machine undefined (type *machine.Machine has no field or method Machine)]

We lint all of our code in CI to prevent obvious bugs, but this kind of problem never occurred.

@hardikdr
Copy link
Member

hardikdr commented Dec 2, 2019

We do link check here: https://github.com/gardener/machine-controller-manager/blob/master/.ci/check#L61 , in case it helps.

Also, what version of MCM were you rebasing/using?

@majst01
Copy link

majst01 commented Dec 2, 2019

We are using master

@ghost ghost added lifecycle/stale Nobody worked on this for 6 months (will further age) and removed lifecycle/stale Nobody worked on this for 6 months (will further age) labels Feb 1, 2020
@ghost ghost added component/mcm Machine Controller Manager (including Node Problem Detector, Cluster Auto Scaler, etc.) and removed component/machine-controller-manager labels Mar 7, 2020
@ghost ghost added the lifecycle/stale Nobody worked on this for 6 months (will further age) label May 7, 2020
@gardener-robot gardener-robot added area/auto-scaling Auto-scaling (CA/HPA/VPA/HVPA, predominantly control plane, but also otherwise) related and removed area/scalability labels Jun 5, 2020
@prashanth26 prashanth26 removed the lifecycle/stale Nobody worked on this for 6 months (will further age) label Aug 13, 2020
@gardener-robot gardener-robot added the lifecycle/stale Nobody worked on this for 6 months (will further age) label Oct 13, 2020
@gardener-robot gardener-robot added lifecycle/rotten Nobody worked on this for 12 months (final aging stage) and removed lifecycle/stale Nobody worked on this for 6 months (will further age) labels Dec 13, 2020
@himanshu-kun himanshu-kun added size/m Size of pull request is medium (see gardener-robot robot/bots/size.py) and removed lifecycle/rotten Nobody worked on this for 12 months (final aging stage) labels Feb 21, 2023
@himanshu-kun himanshu-kun added priority/4 Priority (lower number equals higher priority) needs/planning Needs (more) planning with other MCM maintainers labels Feb 21, 2023
@gardener-robot gardener-robot added the lifecycle/stale Nobody worked on this for 6 months (will further age) label Oct 31, 2023
@gardener-robot gardener-robot added lifecycle/rotten Nobody worked on this for 12 months (final aging stage) and removed lifecycle/stale Nobody worked on this for 6 months (will further age) labels Jul 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/auto-scaling Auto-scaling (CA/HPA/VPA/HVPA, predominantly control plane, but also otherwise) related area/performance Performance (across all domains, such as control plane, networking, storage, etc.) related component/mcm Machine Controller Manager (including Node Problem Detector, Cluster Auto Scaler, etc.) lifecycle/rotten Nobody worked on this for 12 months (final aging stage) needs/planning Needs (more) planning with other MCM maintainers priority/4 Priority (lower number equals higher priority) size/m Size of pull request is medium (see gardener-robot robot/bots/size.py)
Projects
None yet
Development

No branches or pull requests

8 participants