Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP-3327: Add CPUManager policy option to align CPUs by Socket instead of by NUMA node #3334

Merged
merged 4 commits into from
Jun 21, 2022

Conversation

arpitsardhana
Copy link
Contributor

@arpitsardhana arpitsardhana commented Jun 2, 2022

@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Jun 2, 2022

CLA Signed

The committers listed above are authorized under a signed CLA.

  • ✅ login: arpitsardhana / name: Arpit Singh (32b4019)

@k8s-ci-robot k8s-ci-robot added the cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. label Jun 2, 2022
@k8s-ci-robot
Copy link
Contributor

Welcome @arpitsardhana!

It looks like this is your first PR to kubernetes/enhancements 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/enhancements has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Jun 2, 2022
@k8s-ci-robot
Copy link
Contributor

Hi @arpitsardhana. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jun 2, 2022
@k8s-ci-robot k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/node Categorizes an issue or PR as relevant to SIG Node. labels Jun 2, 2022
@klueska
Copy link
Contributor

klueska commented Jun 2, 2022

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jun 2, 2022
Copy link
Contributor

@swatisehgal swatisehgal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR.

The KEP needs minor updates related to feature gates but overall looks promising. The feature seems useful as it provides us with another knob to tweak the CPU Management behavior.

We should carefully consider how this change would work on systems with multiple NUMA nodes per socket (e.g. DualSocketMultiNumaPerSocketHT) as well as systems with multiple NUMA sockets per NUMA (e.g. DualNumaMultiSocketPerNumaHT). Would be nice to see that captured as part of the KEP.

keps/sig-node/3327-align-by-socket/README.md Outdated Show resolved Hide resolved
keps/sig-node/3327-align-by-socket/README.md Outdated Show resolved Hide resolved
keps/sig-node/3327-align-by-socket/README.md Outdated Show resolved Hide resolved
keps/sig-node/3327-align-by-socket/kep.yaml Outdated Show resolved Hide resolved
```
At the end, we will have a list of desired hints. These hints will then be passed to the topology manager whose job it is to select the best hint (with an increased likelihood of selecting a hint that has CPUs which are aligned by socket now).

In case TopologyManager “single-numa-node” policy is enabled, the policy option of “align-by-socket” is redundant since allocation guarantees within the same numa are by definition socket aligned. Hence, we will error out in case the policy option of “align-by-socket” is enabled in conjunction with TopologyManager single-numa-node policy.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is true for systems with multiple NUMA nodes per Socket but not for the ones with multiple sockets per NUMA. In the latter case with TopologyManager “single-numa-node” policy, we could potentially get CPUs that span multiple sockets and with “align-by-socket” policy option enabled the goal should be to try to align those by socket?

Copy link
Contributor Author

@arpitsardhana arpitsardhana Jun 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is something, i will need to think more!
This KEP covers cases of DualSocketMultiNUMAPerSocketHT
Let me think on situation where a single NUMA can span multiple sockets (ex DualNumaMultiSocketPerNumaHT) and update once i have some definitive answer

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we please have few examples of architectures we are targeting where a single NUMA can span multiple sockets?
I'm aware such architecture existed in the past, but I'm not sure the hardware is heading in this direction.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could do one of 2 things here:

  1. Change the flag to have semantics of align-by-socket-or-numa-whichever-wraps-the-other(bad name, but you get the point).
  2. Error out if align-by-socket is set and the machine has more than one socket per numa node.

I would lean towards 2, as I think it is rare to find a machine with this configuration nowadays.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the KEP, i am leaning towards erring out in case there are more than one socket per NUMA
We can address this use case if we see real world architecture leaning in that direction.
I am open to suggestions

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @klueska and I prefer approach number 2 as well. Would be fine by me.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on erroring out for now. It is a rare case, but might still exist for VMs. Some large VM might span multiple sockets using vNUMA.

Copy link
Contributor

@klueska klueska left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a few comments. In general, the change is fairly straightforward and easy to reason about. The main open questions are around the edge cases and how this new option will interact with existing options.

That said, the text of the KEP could still use some polishing, with two specific general improvements that could be made:

  1. Split paragraphs into more than one line so that we can comment on specific sentences instead of a whole paragraph.
  2. Put backticks around all of the things that should have backticks (e.g. CPU'Manager, DeviceManager, etc.)

keps/prod-readiness/sig-node/3327.yaml Outdated Show resolved Hide resolved
keps/sig-node/3327-align-by-socket/README.md Outdated Show resolved Hide resolved
keps/sig-node/3327-align-by-socket/README.md Outdated Show resolved Hide resolved
keps/sig-node/3327-align-by-socket/README.md Outdated Show resolved Hide resolved
keps/sig-node/3327-align-by-socket/README.md Show resolved Hide resolved
```
At the end, we will have a list of desired hints. These hints will then be passed to the topology manager whose job it is to select the best hint (with an increased likelihood of selecting a hint that has CPUs which are aligned by socket now).

In case TopologyManager “single-numa-node” policy is enabled, the policy option of “align-by-socket” is redundant since allocation guarantees within the same numa are by definition socket aligned. Hence, we will error out in case the policy option of “align-by-socket” is enabled in conjunction with TopologyManager single-numa-node policy.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could do one of 2 things here:

  1. Change the flag to have semantics of align-by-socket-or-numa-whichever-wraps-the-other(bad name, but you get the point).
  2. Error out if align-by-socket is set and the machine has more than one socket per numa node.

I would lean towards 2, as I think it is rare to find a machine with this configuration nowadays.

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Jun 8, 2022
@arpitsardhana
Copy link
Contributor Author

Thanks for comments! I have addressed majority of review comments.
However, few of things that may require further updates are 1) Addressing corner cases 2) more explanation on allocateCPU() changes 3) May be some more polishing around framing of KEP

@dchen1107 dchen1107 added this to the v1.25 milestone Jun 9, 2022
@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Jun 13, 2022

CLA Signed

The committers listed above are authorized under a signed CLA.

  • ✅ login: arpitsardhana / name: Arpit Singh (32b4019, baa7ccd)
  • ✅ login: sanjaychatterjee / name: Sanjay Chatterjee (3cef13f)

@k8s-ci-robot k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jun 13, 2022
@klueska
Copy link
Contributor

klueska commented Jun 15, 2022

@johnbelamaric can we get a PRR review on this please?

@arpitsardhana
Copy link
Contributor Author

/assign @johnbelamaric

Copy link
Contributor

@ffromani ffromani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

non-binding lgtm
non-blocking comments about possible improvements and clarifications
ping me anytime to get a final review (and so the proper LGTM, likely)

```
At the end, we will have a list of desired hints. These hints will then be passed to the topology manager whose job it is to select the best hint (with an increased likelihood of selecting a hint that has CPUs which are aligned by socket now).

In case TopologyManager “single-numa-node” policy is enabled, the policy option of “align-by-socket” is redundant since allocation guarantees within the same numa are by definition socket aligned. Hence, we will error out in case the policy option of “align-by-socket” is enabled in conjunction with TopologyManager single-numa-node policy.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @klueska and I prefer approach number 2 as well. Would be fine by me.

###### How can someone using this feature know that it is working for their instance?

In order to verify this feature is working, one should:
Pick a node with at least 2 Sockets and 8 NUMA nodes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

non-blocking, asking for my understanding. Would a system with 2 sockets and 4 NUMA nodes be OK as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, 2 socket and 4 NUMA is ok as well.
Above line is fixed in latest version.


Inspect the kubelet configuration of a node -- check for the presence of the feature gate and usage of the new policy option.

###### How can someone using this feature know that it is working for their instance?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK, but I'll obviously defer the autorithative answer to PRRs, that this is meant to ask for more readily observable options, like metrics, or object statuses, or system properties. This (detailed) tests works of course.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing out! There was not no relevant metric for CPUManager policy option, so put down the test case taking a cue from other policy options KEP.


To verify the list of CPUs allocated to the container, one can either:
- `exec` into uthe container and run `taskset -cp 1` (assuming this command is available in the container).
- Call the `GetCPUS()` method of the `CPUProvider` interface in the `kubelet`'s [podresources API](https://pkg.go.dev/k8s.io/kubernetes/pkg/kubelet/apis/podresources#CPUsProvider).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is sufficient to connect to the podresources API and send a List call; inspecting the input should provide all the needed informations.
I know of a few command line podresources clients, but not an recommended tool (no crictl equivalent to say).

@arpitsardhana
Copy link
Contributor Author

/assign @derekwaynecarr @wojtek-t

@klueska
Copy link
Contributor

klueska commented Jun 21, 2022

Thanks for putting this enhancement together @arpitsardhana.

In general, I think allowing alignment at boundaries other than a single NUMA node is a useful feature to have. This change, in particular, is designed to increase the scope of alignment for CPUs to the socket boundary on architectures where there are more than one NUMA node per socket. In this way, we are able to treat all CPUs on a socket as perfectly aligned with other resources attached to any individual NUMA node on that socket.

I can see how this would be useful given that the ratio of requested CPUs to other requested resources by any given pod tends to be quite high (e.g. 30 CPUs for every 1 GPU). Since the number of CPUs per NUMA node decreases as the number of NUMA nodes per socket increases, it can be hard to achieve perfect alignment unless the scope for CPU alignment is increased in some way.

My only (non-blocking) concerns with the proposed approach are the following:

  1. Is it too limiting to force alignment to the socket boundary? Are there cases where we would want a different boundary (e.g. if there are 4 NUMA nodes per socket, do we ever want to reduce the scope of alignment to just 2 NUMA nodes instead of all 4?). I don't think this concern should block this proposal for alpha, but we may want to consider a redesign of the flagname / API for beta.
  2. Does it really make sense to apply this new alignment to just the CPUManager? It seems as if we could have introduced a new TopologyManager policy called single-socket (mimicking the existing single-numa-node policy) that maybe would have achieved the same thing. Again, I think it's fine to move forward with the proposed approach as an alpha implementation, but we should consider this alternative a bit more closely before moving to beta.

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 21, 2022
@ffromani
Copy link
Contributor

/lgtm
I agree with the approach suggested by @klueska


| Risk | Impact | Mitigation |
| -------------------------------------------------| -------| ---------- |
| Bugs in the implementation lead to kubelet crash | High | Disable the policy option and restart the kubelet. The workload will run but CPU allocations can spread across socket in cases when allocation could have been within same socket |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any feedback / status back to the users if the kubelet meet its CPU alignment request or not?


### Proposed Change

When `align-by-socket` is enabled as a policy option, the `CPUManager`’s function `GetTopologyHints` will prefer hints which are socket aligned in addition to hints which require minimum number of NUMA nodes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The proposal only defines the behavior with the newly introduced option is in addition to the default CPUManager behavior.

It is unclear to me what behavior would be the user configs the follow cases:

  1. align-by-socket=true & full-pcups-only=true
  2. align-by-socket=true & distribute-cpus-across-numa=true
  3. align-by-socket=true & full-pcups-only=true & distribute-cpus-across-numa=true

Valid vs. invalid? Additive?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't thought extensively to all the cases, but I expect full-pcpus-only to be fully compatible with this new option, like it is with distribute-cpus-across-numa.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @fromanirh , full-pcpu-only & distribute-cpus-across-numa can work in conjunction with align-by-socket.
To put in simpler words, align-by-socket influences how NUMA nodes are assigned(try to ensure that selected NUMA nodes are aligned by socket) while other two policy options influence how cpu's are distributed across NUMA nodes once assignment(hint) is finalized

I plan to raise another PR to address all comments

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed with @fromanirh above that align-by-socket=true & full-pcups-only=true can treat as an additive. But we need to iron out more details on align-by-socket=true & distribute-cpus-across-numa=true. Also agreed with @arpitsardhana above on the simple interpret. Please ping me if you have follow PR to address the comments. Thanks!

@dchen1107
Copy link
Member

Agreed with @klueska above at: #3334 (comment)

Added some comments as the non blocker for alpha too.

/lgtm
/approve

@arpitsardhana
Copy link
Contributor Author

Thanks a lot @dchen1107 @fromanirh @klueska for review and approval.
I will work on comments and update the KEP.

@johnbelamaric
Copy link
Member

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: arpitsardhana, dchen1107, johnbelamaric, klueska

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 21, 2022
@k8s-ci-robot k8s-ci-robot merged commit fb10b2b into kubernetes:master Jun 21, 2022
@cathchu
Copy link

cathchu commented Jul 27, 2022

Hi @klueska

1.25 Release Docs Shadow here. Does this enhancement work planned for 1.25 require any new docs or modification to existing docs?
If so, please follows the steps here to open a PR against dev-1.25 branch in the k/website repo. This PR can be just a placeholder at this time and must be created before August 4.
Also, take a look at Documenting for a release to get yourself familiarize with the docs requirement for the release.
Thank you!

@arpitsardhana
Copy link
Contributor Author

Thanks @cathchu Will look over documentation requirement

@arpitsardhana
Copy link
Contributor Author

@cathchu added Placeholder PR for website update: kubernetes/website#35569

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. sig/node Categorizes an issue or PR as relevant to SIG Node. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
Development

Successfully merging this pull request may close these issues.