Reduce CPU utilization #304

amshuman-kr · 2019-08-07T12:06:41Z

Issue

During performance tests, of all the control-plane components, the CPU utilisation of MCM was comparable to kube-apiserver and even more than etcd. This is surprising as MCM is handling at least two orders of magnitude less number of resources when compared to the other control-plane components.

Reducing the CPU utilisation will help improve the scalability of MCM as well as gardener.

Solution

Profile and optimize the CPU utilization of MCM.

prashanth26 · 2019-08-08T05:24:13Z

I think most of the CPU utilization would be due to the reconcilation of machine objects. They get reconciled on every node object update. And a node object is updated every 10-30s due to update on condition checks like - memory, disk, network, readiness. I think we could reduce this reconciliation intervel either at the MCM or on the node level. We can take a call while fixing this issue.

amshuman-kr · 2019-11-22T06:44:35Z

I think we should also look at the number of goroutines and retry logic. But let's profile first.

majst01 · 2019-12-02T08:21:04Z

Hi,
we see a constantly increasing CPU usage of mcm in our environment with only 3 nodes in the cluster.
pproof shows a vast amount of parked threads:

runtime.gopark
/usr/local/go/src/runtime/proc.go

  Total:       29300      29300 (flat, cum)   100%
runtime.selectgo
/usr/local/go/src/runtime/select.go

  Total:           2      29113 (flat, cum) 99.35%

But i must admit that i have no idea how this could happen tbh.
Ideas ?

amshuman-kr · 2019-12-02T08:30:33Z

@majst01 #341 addressed the constant increase in CPU usage. Does it not work for you?

We have kept the current issue open to track the optimisation of the baseline CPU usage.

majst01 · 2019-12-02T09:07:05Z

We already have #341, i will try the most recent version as well and report back.

amshuman-kr · 2019-12-02T09:15:13Z

We already have #341

Thanks. We saw improvement with #341. But this is good information for us. We will also check from our end.

majst01 · 2019-12-02T09:21:28Z

I doubt that the changes upstream since #341 change any behavior here. I will also have a look for unclosed channels et.al.

amshuman-kr · 2019-12-02T09:24:44Z

I doubt that the changes upstream since #341 change any behavior here.

Yes. If you already have #341, there are no further relevant changes that might help.

majst01 · 2019-12-02T09:28:28Z

I tried to run https://github.com/golangci/golangci-lint on the code base but failed actually with:

WARN [runner] Can't run linter goanalysis_metalinter: assign: failed prerequisites: inspect@github.com/gardener/machine-controller-manager/pkg/client/clientset/internalversion/typed/machine/internalversion 
WARN [runner] Can't run linter unused: buildssa: analysis skipped: errors in package: [/home/stefan/dev/devops/cloud-native/metal/metal-pod/machine-controller-manager/pkg/client/clientset/internalversion/typed/machine/internalversion/machine_client.go:10:6: MachineInterface redeclared in this block /home/stefan/dev/devops/cloud-native/metal/metal-pod/machine-controller-manager/pkg/client/clientset/internalversion/typed/machine/internalversion/machine.go:21:6:       other declaration of MachineInterface /home/stefan/dev/devops/cloud-native/metal/metal-pod/machine-controller-manager/pkg/client/clientset/internalversion/typed/machine/internalversion/machine.go:112:20: machine.Machine undefined (type *machine.Machine has no field or method Machine) /home/stefan/dev/devops/cloud-native/metal/metal-pod/machine-controller-manager/pkg/client/clientset/internalversion/typed/machine/internalversion/machine.go:97:20: machine.Machine undefined (type *machine.Machine has no field or method Machine) /home/stefan/dev/devops/cloud-native/metal/metal-pod/machine-controller-manager/pkg/client/clientset/internalversion/typed/machine/internalversion/machine.go:85:20: machine.Machine undefined (type *machine.Machine has no field or method Machine)]

We lint all of our code in CI to prevent obvious bugs, but this kind of problem never occurred.

hardikdr · 2019-12-02T09:40:14Z

We do link check here: https://github.com/gardener/machine-controller-manager/blob/master/.ci/check#L61 , in case it helps.

Also, what version of MCM were you rebasing/using?

majst01 · 2019-12-02T09:43:45Z

We are using master

PadmaB added area/performance Performance (across all domains, such as control plane, networking, storage, etc.) related area/scalability labels Aug 7, 2019

gardener-robot-ci-1 added lifecycle/stale Nobody worked on this for 6 months (will further age) and removed lifecycle/stale Nobody worked on this for 6 months (will further age) labels Oct 7, 2019

ghost added lifecycle/stale Nobody worked on this for 6 months (will further age) and removed lifecycle/stale Nobody worked on this for 6 months (will further age) labels Feb 1, 2020

ghost added component/mcm Machine Controller Manager (including Node Problem Detector, Cluster Auto Scaler, etc.) and removed component/machine-controller-manager labels Mar 7, 2020

ghost added the lifecycle/stale Nobody worked on this for 6 months (will further age) label May 7, 2020

gardener-robot added area/auto-scaling Auto-scaling (CA/HPA/VPA/HVPA, predominantly control plane, but also otherwise) related and removed area/scalability labels Jun 5, 2020

prashanth26 removed the lifecycle/stale Nobody worked on this for 6 months (will further age) label Aug 13, 2020

gardener-robot added the lifecycle/stale Nobody worked on this for 6 months (will further age) label Oct 13, 2020

gardener-robot added lifecycle/rotten Nobody worked on this for 12 months (final aging stage) and removed lifecycle/stale Nobody worked on this for 6 months (will further age) labels Dec 13, 2020

himanshu-kun mentioned this issue Feb 21, 2023

Refactor MCM to leverage controller-runtime #724

Open

49 tasks

himanshu-kun added size/m Size of pull request is medium (see gardener-robot robot/bots/size.py) and removed lifecycle/rotten Nobody worked on this for 12 months (final aging stage) labels Feb 21, 2023

himanshu-kun added priority/4 Priority (lower number equals higher priority) needs/planning Needs (more) planning with other MCM maintainers labels Feb 21, 2023

gardener-robot added the lifecycle/stale Nobody worked on this for 6 months (will further age) label Oct 31, 2023

gardener-robot added lifecycle/rotten Nobody worked on this for 12 months (final aging stage) and removed lifecycle/stale Nobody worked on this for 6 months (will further age) labels Jul 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce CPU utilization #304

Reduce CPU utilization #304

amshuman-kr commented Aug 7, 2019

prashanth26 commented Aug 8, 2019

amshuman-kr commented Nov 22, 2019

majst01 commented Dec 2, 2019

amshuman-kr commented Dec 2, 2019

majst01 commented Dec 2, 2019

amshuman-kr commented Dec 2, 2019

majst01 commented Dec 2, 2019

amshuman-kr commented Dec 2, 2019

majst01 commented Dec 2, 2019

hardikdr commented Dec 2, 2019

majst01 commented Dec 2, 2019

Reduce CPU utilization #304

Reduce CPU utilization #304

Comments

amshuman-kr commented Aug 7, 2019

Issue

Solution

prashanth26 commented Aug 8, 2019

amshuman-kr commented Nov 22, 2019

majst01 commented Dec 2, 2019

amshuman-kr commented Dec 2, 2019

majst01 commented Dec 2, 2019

amshuman-kr commented Dec 2, 2019

majst01 commented Dec 2, 2019

amshuman-kr commented Dec 2, 2019

majst01 commented Dec 2, 2019

hardikdr commented Dec 2, 2019

majst01 commented Dec 2, 2019