Enable NAP in 'aaa' cluster #1604

mborsz · 2021-02-03T09:11:04Z

Context:

Currently perf-dash has cpu request/limit set to 3 cores and it still insufficient:

Also is being continuously OOMKilled with memory limit set to 8GiB.

The node pool added in #659 for perfdash is using n1-standard-4 which provides 4 cores and 12 GiB of allocatable memory.

I would like to increase perf-dash's requests to at least 6 cpus and 16 GiB (to give some room for growth), but I'm not able to do that due to the used machine type.

Instead of adding 'pool-3' node pool with larger machine type, I would like to propose enabling Node auto-provisioning on this cluster that will be adding such node pools as needed, reducing maintenance burden.

mborsz · 2021-02-03T09:15:20Z

I would like to hear feedback on this change. I understand that there may be some reasons why we may want to avoid enabling this, but I would like to be sure that this is a conscious that we do not enable this here and continue manually adding larger and larger node pools.

spiffxp · 2021-02-11T07:17:31Z

/uncc @mikedanese @nikhita
/cc @spiffxp @BenTheElder
/assign @thockin @dims @ameukam
I'd like to hear your opinions. I have very little experience managing aaa specifically, or clusters using this feature.

dims · 2021-02-11T11:58:08Z

we forgot @cblecker :)

ameukam · 2021-02-11T15:06:26Z

@mborsz Did you consider the possibility we can change the instance type to n1-standard-8 or another instance type for the node-pool pool-2 instead of enable the auto-scaling at the cluster level ?

thockin · 2021-02-11T17:04:09Z

I'm skeptical of why perf-dash needs that much memory - the trajectory is problematic. That said, NAP seems like the rightest answer. This is sort of exactly what it is for. So I am +1

Isn't there an overall cluster limit on resources, too?

mborsz · 2021-02-12T08:26:55Z

I'm skeptical of why perf-dash needs that much memory - the trajectory is problematic. That said, NAP seems like the rightest answer. This is sort of exactly what it is for. So I am +1

perf-dash resource usage: perf-dash keeps all the data in memory and periodically refreshes state from GCS. We are continuously adding new tests or change existing ones (e.g. kubernetes/test-infra#20448), this will cause memory growth for sure.

Most of the cpu and memory is used when perf-dash either initializes or refreshes state, most likely for network buffers and json deserialization. For most of the time resource usage is around ~700MiB and cpu usage jumps between 0 and 3 cores (the current limit) when it refreshes state.

I'm think we can optimize perf-dash to reduce resource usage, but I'm not sure if this is a good area to invest time of our team there. For comparison: perf dash consumes currently 3 cores and 10Gi of memory, while e.g. our 5000 node test consumes ~5000 cores and ~20TiB of memory. I think that e.g. optimizing our 5000 node test has better ROI and we have some achievements there (e.g. kubernetes/perf-tests#561).

mborsz · 2021-02-12T08:30:22Z

Maciej Borsz Did you consider the possibility we can change the instance type to n1-standard-8 or another instance type for the node-pool pool-2 instead of enable the auto-scaling at the cluster level ?

Yes, it's not possible to change instance type, we need to create a new node pool with larger machine. Technically we can do that, but I'm afraid that some day we will need to add even larger node pool. This is why I suggest starting using NAP.

BenTheElder · 2021-02-12T09:42:32Z

I have no experience with NAP.

If we review the configuration for requested resources though then this seems fine. Instead of debating how big the node pool config should be set to we can debate how big the pod requests should be 🙃

thockin · 2021-02-12T16:13:00Z

+1 Ben This is the way. /lgtm /approve

…

On Fri, Feb 12, 2021 at 1:42 AM Benjamin Elder ***@***.***> wrote: I have no experience with NAP. If we review the configuration for requested resources though then this seems fine. Instead of debating how big the node pool config should be set to we can debate how big the pod requests should be 🙃 — You are receiving this because you were assigned. Reply to this email directly, view it on GitHub <#1604 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABKWAVHQP74Q3SQZ6LZ3F5DS6TZZTANCNFSM4XAPB62A> .

k8s-ci-robot · 2021-02-12T16:13:13Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mborsz, thockin

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [thockin]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

thockin · 2021-02-12T16:13:43Z

I didn't mean to unilaterally approve - looking for consensus :)

ameukam · 2021-02-12T19:50:28Z

infra/gcp/clusters/projects/kubernetes-public/aaa/10-cluster-configuration.tf

+  // Enable NAP
+  cluster_autoscaling {
+    enabled = true
+    resource_limits = [


I may be wrong, but I think resource_limits is a sub-block of cluster_autoscaling. You'll need to split the resource types.

Fixed, thanks!

ameukam · 2021-02-23T18:17:51Z

/lgtm
/hold
If others want to comment.

mm4tt · 2021-04-16T11:06:04Z

Can we merge this? Or are we waiting for someone else approval?

Version 2.33 requires more than 10GB of memory and we cannot bump the mem limit without kubernetes/k8s.io#1604

thockin · 2021-04-16T19:20:34Z

I think we can enable this but someone needs to run terraform on it ASAP, so whomever has time should remove the hold and do it. Not me, at least not today.

mborsz · 2021-04-29T18:04:34Z

@thockin can you suggest anyone who will be able to run terraform on this in the next couple of days?

mborsz · 2021-04-30T06:44:26Z

@BenTheElder @ameukam @spiffxp can you run terraform on this?

ameukam · 2021-04-30T08:34:01Z

@mborsz Will run this Monday if nobody is faster than me.

@spiffxp NAP will create instances of the E2 family for any new workload.

spiffxp · 2021-05-04T19:30:45Z

I have no objections but can't commit to running this just now, if @ameukam wants to take it, go for it

ameukam · 2021-05-04T21:16:58Z

/hold cancel

ameukam · 2021-05-04T21:20:50Z

Running terraform apply -auto-approve -target google_container_cluster.cluster

ameukam · 2021-05-04T21:51:13Z

Change applied successfully :

Apply complete! Resources: 0 added, 1 changed, 0 destroyed.

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. wg/k8s-infra labels Feb 3, 2021

k8s-ci-robot requested review from mikedanese and nikhita February 3, 2021 09:11

k8s-ci-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Feb 3, 2021

mborsz mentioned this pull request Feb 3, 2021

Bump cpu and memory requests in perfdash kubernetes/perf-tests#1697

Merged

Base automatically changed from master to main February 9, 2021 00:35

k8s-ci-robot assigned ameukam, dims and thockin Feb 11, 2021

k8s-ci-robot requested review from BenTheElder and spiffxp and removed request for mikedanese and nikhita February 11, 2021 07:17

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 12, 2021

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 12, 2021

thockin removed the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 12, 2021

ameukam reviewed Feb 12, 2021

View reviewed changes

Enable NAP in 'aaa' cluster

a5d4b32

mborsz force-pushed the patch-2 branch from dd4d507 to a5d4b32 Compare February 23, 2021 07:51

k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. and removed lgtm "Looks good to me", indicates that a PR is ready to be merged. labels Feb 23, 2021

k8s-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lgtm "Looks good to me", indicates that a PR is ready to be merged. labels Feb 23, 2021

mm4tt added a commit to mm4tt/perf-tests that referenced this pull request Apr 16, 2021

Downgrade perf-dash to 2.32

7f0fa8c

Version 2.33 requires more than 10GB of memory and we cannot bump the mem limit without kubernetes/k8s.io#1604

mm4tt mentioned this pull request Apr 16, 2021

Downgrade perf-dash to 2.32 kubernetes/perf-tests#1784

Merged

mm4tt mentioned this pull request Apr 29, 2021

Upgrade of perfdash to 2.33 is blocked kubernetes/perf-tests#1792

Closed

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 4, 2021

k8s-ci-robot merged commit 04ddf8d into kubernetes:main May 4, 2021

k8s-ci-robot added this to the v1.22 milestone May 4, 2021

ameukam mentioned this pull request May 4, 2021

Terraform state is not fully applied #2000

Closed

mborsz mentioned this pull request May 5, 2021

Increase NAP limits in aaa cluster #2003

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable NAP in 'aaa' cluster #1604

Enable NAP in 'aaa' cluster #1604

mborsz commented Feb 3, 2021

mborsz commented Feb 3, 2021

spiffxp commented Feb 11, 2021

dims commented Feb 11, 2021

ameukam commented Feb 11, 2021

thockin commented Feb 11, 2021

mborsz commented Feb 12, 2021

mborsz commented Feb 12, 2021

BenTheElder commented Feb 12, 2021

thockin commented Feb 12, 2021 via email

k8s-ci-robot commented Feb 12, 2021

thockin commented Feb 12, 2021

ameukam Feb 12, 2021

mborsz Feb 23, 2021

ameukam commented Feb 23, 2021

mm4tt commented Apr 16, 2021

thockin commented Apr 16, 2021

mborsz commented Apr 29, 2021

mborsz commented Apr 30, 2021

ameukam commented Apr 30, 2021

spiffxp commented May 4, 2021

ameukam commented May 4, 2021

ameukam commented May 4, 2021

ameukam commented May 4, 2021

Enable NAP in 'aaa' cluster #1604

Enable NAP in 'aaa' cluster #1604

Conversation

mborsz commented Feb 3, 2021

mborsz commented Feb 3, 2021

spiffxp commented Feb 11, 2021

dims commented Feb 11, 2021

ameukam commented Feb 11, 2021

thockin commented Feb 11, 2021

mborsz commented Feb 12, 2021

mborsz commented Feb 12, 2021

BenTheElder commented Feb 12, 2021

thockin commented Feb 12, 2021 via email

k8s-ci-robot commented Feb 12, 2021

thockin commented Feb 12, 2021

ameukam Feb 12, 2021

Choose a reason for hiding this comment

mborsz Feb 23, 2021

Choose a reason for hiding this comment

ameukam commented Feb 23, 2021

mm4tt commented Apr 16, 2021

thockin commented Apr 16, 2021

mborsz commented Apr 29, 2021

mborsz commented Apr 30, 2021

ameukam commented Apr 30, 2021

spiffxp commented May 4, 2021

ameukam commented May 4, 2021

ameukam commented May 4, 2021

ameukam commented May 4, 2021