-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP-3327: Add CPUManager policy option to align CPUs by Socket instead of by NUMA node #3334
Conversation
arpitsardhana
commented
Jun 2, 2022
•
edited
Loading
edited
- One-line PR description: Adding new KEP
- Issue link: Add CPUManager policy option to align CPUs by Socket instead of by NUMA node #3327
|
Welcome @arpitsardhana! |
Hi @arpitsardhana. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/ok-to-test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR.
The KEP needs minor updates related to feature gates but overall looks promising. The feature seems useful as it provides us with another knob to tweak the CPU Management behavior.
We should carefully consider how this change would work on systems with multiple NUMA nodes per socket (e.g. DualSocketMultiNumaPerSocketHT) as well as systems with multiple NUMA sockets per NUMA (e.g. DualNumaMultiSocketPerNumaHT). Would be nice to see that captured as part of the KEP.
``` | ||
At the end, we will have a list of desired hints. These hints will then be passed to the topology manager whose job it is to select the best hint (with an increased likelihood of selecting a hint that has CPUs which are aligned by socket now). | ||
|
||
In case TopologyManager “single-numa-node” policy is enabled, the policy option of “align-by-socket” is redundant since allocation guarantees within the same numa are by definition socket aligned. Hence, we will error out in case the policy option of “align-by-socket” is enabled in conjunction with TopologyManager single-numa-node policy. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is true for systems with multiple NUMA nodes per Socket but not for the ones with multiple sockets per NUMA. In the latter case with TopologyManager “single-numa-node” policy, we could potentially get CPUs that span multiple sockets and with “align-by-socket” policy option enabled the goal should be to try to align those by socket?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is something, i will need to think more!
This KEP covers cases of DualSocketMultiNUMAPerSocketHT
Let me think on situation where a single NUMA can span multiple sockets (ex DualNumaMultiSocketPerNumaHT) and update once i have some definitive answer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we please have few examples of architectures we are targeting where a single NUMA can span multiple sockets?
I'm aware such architecture existed in the past, but I'm not sure the hardware is heading in this direction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we could do one of 2 things here:
- Change the flag to have semantics of
align-by-socket-or-numa-whichever-wraps-the-other
(bad name, but you get the point). - Error out if
align-by-socket
is set and the machine has more than one socket per numa node.
I would lean towards 2, as I think it is rare to find a machine with this configuration nowadays.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the KEP, i am leaning towards erring out in case there are more than one socket per NUMA
We can address this use case if we see real world architecture leaning in that direction.
I am open to suggestions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @klueska and I prefer approach number 2 as well. Would be fine by me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 on erroring out for now. It is a rare case, but might still exist for VMs. Some large VM might span multiple sockets using vNUMA.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a few comments. In general, the change is fairly straightforward and easy to reason about. The main open questions are around the edge cases and how this new option will interact with existing options.
That said, the text of the KEP could still use some polishing, with two specific general improvements that could be made:
- Split paragraphs into more than one line so that we can comment on specific sentences instead of a whole paragraph.
- Put backticks around all of the things that should have backticks (e.g.
CPU'Manager
,DeviceManager
, etc.)
``` | ||
At the end, we will have a list of desired hints. These hints will then be passed to the topology manager whose job it is to select the best hint (with an increased likelihood of selecting a hint that has CPUs which are aligned by socket now). | ||
|
||
In case TopologyManager “single-numa-node” policy is enabled, the policy option of “align-by-socket” is redundant since allocation guarantees within the same numa are by definition socket aligned. Hence, we will error out in case the policy option of “align-by-socket” is enabled in conjunction with TopologyManager single-numa-node policy. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we could do one of 2 things here:
- Change the flag to have semantics of
align-by-socket-or-numa-whichever-wraps-the-other
(bad name, but you get the point). - Error out if
align-by-socket
is set and the machine has more than one socket per numa node.
I would lean towards 2, as I think it is rare to find a machine with this configuration nowadays.
Thanks for comments! I have addressed majority of review comments. |
@johnbelamaric can we get a PRR review on this please? |
/assign @johnbelamaric |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
non-binding lgtm
non-blocking comments about possible improvements and clarifications
ping me anytime to get a final review (and so the proper LGTM, likely)
``` | ||
At the end, we will have a list of desired hints. These hints will then be passed to the topology manager whose job it is to select the best hint (with an increased likelihood of selecting a hint that has CPUs which are aligned by socket now). | ||
|
||
In case TopologyManager “single-numa-node” policy is enabled, the policy option of “align-by-socket” is redundant since allocation guarantees within the same numa are by definition socket aligned. Hence, we will error out in case the policy option of “align-by-socket” is enabled in conjunction with TopologyManager single-numa-node policy. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @klueska and I prefer approach number 2 as well. Would be fine by me.
###### How can someone using this feature know that it is working for their instance? | ||
|
||
In order to verify this feature is working, one should: | ||
Pick a node with at least 2 Sockets and 8 NUMA nodes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
non-blocking, asking for my understanding. Would a system with 2 sockets and 4 NUMA nodes be OK as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, 2 socket and 4 NUMA is ok as well.
Above line is fixed in latest version.
|
||
Inspect the kubelet configuration of a node -- check for the presence of the feature gate and usage of the new policy option. | ||
|
||
###### How can someone using this feature know that it is working for their instance? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIK, but I'll obviously defer the autorithative answer to PRRs, that this is meant to ask for more readily observable options, like metrics, or object statuses, or system properties. This (detailed) tests works of course.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing out! There was not no relevant metric for CPUManager policy option, so put down the test case taking a cue from other policy options KEP.
|
||
To verify the list of CPUs allocated to the container, one can either: | ||
- `exec` into uthe container and run `taskset -cp 1` (assuming this command is available in the container). | ||
- Call the `GetCPUS()` method of the `CPUProvider` interface in the `kubelet`'s [podresources API](https://pkg.go.dev/k8s.io/kubernetes/pkg/kubelet/apis/podresources#CPUsProvider). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is sufficient to connect to the podresources API and send a List
call; inspecting the input should provide all the needed informations.
I know of a few command line podresources clients, but not an recommended tool (no crictl
equivalent to say).
/assign @derekwaynecarr @wojtek-t |
Thanks for putting this enhancement together @arpitsardhana. In general, I think allowing alignment at boundaries other than a single NUMA node is a useful feature to have. This change, in particular, is designed to increase the scope of alignment for CPUs to the socket boundary on architectures where there are more than one NUMA node per socket. In this way, we are able to treat all CPUs on a socket as perfectly aligned with other resources attached to any individual NUMA node on that socket. I can see how this would be useful given that the ratio of requested CPUs to other requested resources by any given pod tends to be quite high (e.g. 30 CPUs for every 1 GPU). Since the number of CPUs per NUMA node decreases as the number of NUMA nodes per socket increases, it can be hard to achieve perfect alignment unless the scope for CPU alignment is increased in some way. My only (non-blocking) concerns with the proposed approach are the following:
/lgtm |
/lgtm |
|
||
| Risk | Impact | Mitigation | | ||
| -------------------------------------------------| -------| ---------- | | ||
| Bugs in the implementation lead to kubelet crash | High | Disable the policy option and restart the kubelet. The workload will run but CPU allocations can spread across socket in cases when allocation could have been within same socket | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there any feedback / status back to the users if the kubelet meet its CPU alignment request or not?
|
||
### Proposed Change | ||
|
||
When `align-by-socket` is enabled as a policy option, the `CPUManager`’s function `GetTopologyHints` will prefer hints which are socket aligned in addition to hints which require minimum number of NUMA nodes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The proposal only defines the behavior with the newly introduced option is in addition to the default CPUManager behavior.
It is unclear to me what behavior would be the user configs the follow cases:
- align-by-socket=true & full-pcups-only=true
- align-by-socket=true & distribute-cpus-across-numa=true
- align-by-socket=true & full-pcups-only=true & distribute-cpus-across-numa=true
Valid vs. invalid? Additive?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't thought extensively to all the cases, but I expect full-pcpus-only
to be fully compatible with this new option, like it is with distribute-cpus-across-numa
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @fromanirh , full-pcpu-only
& distribute-cpus-across-numa
can work in conjunction with align-by-socket
.
To put in simpler words, align-by-socket
influences how NUMA nodes are assigned(try to ensure that selected NUMA nodes are aligned by socket) while other two policy options influence how cpu's are distributed across NUMA nodes once assignment(hint) is finalized
I plan to raise another PR to address all comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed with @fromanirh above that align-by-socket=true & full-pcups-only=true can treat as an additive. But we need to iron out more details on align-by-socket=true & distribute-cpus-across-numa=true. Also agreed with @arpitsardhana above on the simple interpret. Please ping me if you have follow PR to address the comments. Thanks!
Agreed with @klueska above at: #3334 (comment) Added some comments as the non blocker for alpha too. /lgtm |
Thanks a lot @dchen1107 @fromanirh @klueska for review and approval. |
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: arpitsardhana, dchen1107, johnbelamaric, klueska The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Hi @klueska 1.25 Release Docs Shadow here. Does this enhancement work planned for 1.25 require any new docs or modification to existing docs? |
Thanks @cathchu Will look over documentation requirement |
@cathchu added Placeholder PR for website update: kubernetes/website#35569 |