-
Notifications
You must be signed in to change notification settings - Fork 813
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable NAP in 'aaa' cluster #1604
Conversation
I would like to hear feedback on this change. I understand that there may be some reasons why we may want to avoid enabling this, but I would like to be sure that this is a conscious that we do not enable this here and continue manually adding larger and larger node pools. |
/uncc @mikedanese @nikhita |
we forgot @cblecker :) |
@mborsz Did you consider the possibility we can change the instance type to |
I'm skeptical of why perf-dash needs that much memory - the trajectory is problematic. That said, NAP seems like the rightest answer. This is sort of exactly what it is for. So I am +1 Isn't there an overall cluster limit on resources, too? |
perf-dash resource usage: perf-dash keeps all the data in memory and periodically refreshes state from GCS. We are continuously adding new tests or change existing ones (e.g. kubernetes/test-infra#20448), this will cause memory growth for sure. Most of the cpu and memory is used when perf-dash either initializes or refreshes state, most likely for network buffers and json deserialization. For most of the time resource usage is around ~700MiB and cpu usage jumps between 0 and 3 cores (the current limit) when it refreshes state. I'm think we can optimize perf-dash to reduce resource usage, but I'm not sure if this is a good area to invest time of our team there. For comparison: perf dash consumes currently 3 cores and 10Gi of memory, while e.g. our 5000 node test consumes ~5000 cores and ~20TiB of memory. I think that e.g. optimizing our 5000 node test has better ROI and we have some achievements there (e.g. kubernetes/perf-tests#561). |
Yes, it's not possible to change instance type, we need to create a new node pool with larger machine. Technically we can do that, but I'm afraid that some day we will need to add even larger node pool. This is why I suggest starting using NAP. |
I have no experience with NAP. If we review the configuration for requested resources though then this seems fine. Instead of debating how big the node pool config should be set to we can debate how big the pod requests should be 🙃 |
+1 Ben
This is the way.
/lgtm
/approve
…On Fri, Feb 12, 2021 at 1:42 AM Benjamin Elder ***@***.***> wrote:
I have no experience with NAP.
If we review the configuration for requested resources though then this
seems fine. Instead of debating how big the node pool config should be set
to we can debate how big the pod requests should be 🙃
—
You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub
<#1604 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABKWAVHQP74Q3SQZ6LZ3F5DS6TZZTANCNFSM4XAPB62A>
.
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: mborsz, thockin The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
I didn't mean to unilaterally approve - looking for consensus :) |
// Enable NAP | ||
cluster_autoscaling { | ||
enabled = true | ||
resource_limits = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I may be wrong, but I think resource_limits
is a sub-block of cluster_autoscaling
. You'll need to split the resource types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed, thanks!
/lgtm |
Can we merge this? Or are we waiting for someone else approval? |
Version 2.33 requires more than 10GB of memory and we cannot bump the mem limit without kubernetes/k8s.io#1604
I think we can enable this but someone needs to run terraform on it ASAP, so whomever has time should remove the hold and do it. Not me, at least not today. |
@thockin can you suggest anyone who will be able to run terraform on this in the next couple of days? |
@BenTheElder @ameukam @spiffxp can you run terraform on this? |
I have no objections but can't commit to running this just now, if @ameukam wants to take it, go for it |
/hold cancel |
Running |
Change applied successfully :
|
Context:
Currently perf-dash has cpu request/limit set to 3 cores and it still insufficient:
Also is being continuously OOMKilled with memory limit set to 8GiB.
The node pool added in #659 for perfdash is using n1-standard-4 which provides 4 cores and 12 GiB of allocatable memory.
I would like to increase perf-dash's requests to at least 6 cpus and 16 GiB (to give some room for growth), but I'm not able to do that due to the used machine type.
Instead of adding 'pool-3' node pool with larger machine type, I would like to propose enabling Node auto-provisioning on this cluster that will be adding such node pools as needed, reducing maintenance burden.