Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update huge pages KEP for container isolation of huge pages #1199

Merged
merged 1 commit into from
Oct 22, 2019

Conversation

bg-chun
Copy link
Member

@bg-chun bg-chun commented Aug 5, 2019

Propose new enhancement for hugepage.

  • add example of CRI update
  • Support container isolation of huge pages

More information is available in below issue.
kubernetes/kubernetes#80716

@k8s-ci-robot
Copy link
Contributor

Welcome @bg-chun!

It looks like this is your first PR to kubernetes/enhancements 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/enhancements has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot
Copy link
Contributor

Hi @bg-chun. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/node Categorizes an issue or PR as relevant to SIG Node. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Aug 5, 2019
@bg-chun bg-chun force-pushed the hugepage-updates branch 2 times, most recently from 74028e4 to 2c9ee00 Compare August 5, 2019 14:11
@bg-chun
Copy link
Member Author

bg-chun commented Sep 2, 2019

/ok-to-test

@k8s-ci-robot
Copy link
Contributor

@bg-chun: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

In response to this:

/ok-to-test

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

- [Phase 2](#Phase-2)
- [Support container isolation of huge pages](#support-container-isolation-of-huge-pages)
- [Enhance Node Allocatable feature to reserve huge pages for system](#enhance-node-allocatable-feature-to-reserve-huge-pages-for-system)
- [cAdviser changes(Phase 2)](#cAdviser-changes(Phase-2))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/cAdviser/cAdvisor/g 😄

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed :)

To support NUMA, `cAdviser` should discover and store pre-allocated huge pages per NUMA node. The `v3` version of `MachineInfo` will be introduced.

#### Enhance Node Allocatable feature to reserve huge pages for system
Some system services like `OVS-DPDK` comsume huge pages per NUMA node, to determine the allocatalbe number of huge pages in `kubelet`, `Node Allocatable feature` should support to reserve huge pages per NUMA node.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if I understand this correct; do you mean that huge pages per NUMA node should be node resources in k8s in the same way as huge pages (and cpu, memory, and empherial storage) is today? Or do you mean that the data should be stored internally, so that it can be utilized by the topology manager?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this point, the data which represents pre-allocated hugepages per NUMA node will be additional information for future usage(Memory Manager or maybe Topology Manager).
So, I mean the data should be stored internally.

And I believe that the node resources that you mentioned are tightly coupled with node scheduler.

@bg-chun
Copy link
Member Author

bg-chun commented Sep 3, 2019

@derekwaynecarr
May I ask you to trigger testing?

@odinuge
Copy link
Member

odinuge commented Sep 3, 2019

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Sep 3, 2019
@bg-chun
Copy link
Member Author

bg-chun commented Sep 3, 2019

Oh... it seems that CI does not allow updating the table of contents of existing KEP.

Copy link
Contributor

@AlexeyPerevalov AlexeyPerevalov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Handle memory restriction per NUMA node it's necessary feature for DPDK like applications. But not so clear how to guarantee in common case the memory allocation from specific NUMA node (e.g. in case of anonymous hugepages). The parent process of the container can bind NUMA node for memory allocation (e.g. with libnuma), but process itself can rebind it.


#### Support container isolation of huge pages

Container isolation of huge pages should be supported to avoid competition between containers to consume huge pages. Currently, `kubelet` sets the agregated huge pages limits on pod's cgroup of hugetlb subsystem. This should be enhanced to set limits on container's cgroup.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What kind of cgroup are you going to use? Current memory cgroup doesn't have support of limiting NUMA nodes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Container isolation of hugepages does not mean limiting NUMA nodes; it's a related issue but totally different.

At this point, Kubelet supports hugepages in pod isolation, not container isolation.
This is mentioned on an official website(see future section).
And this issue provides more details.

Supporting Pod isolation of hugepage means that the below situation that can happen.

  1. The pod has two containers
    (container A and B).
  2. Each of the containers requests 2 of 1GB-Hugepages as a resource.
    (container A: 2GB / container B : 2GB)
  3. But Kubelet set a limitation as 4GB on pod's cgroup of hugetlb subsystem.
  4. It means a single container in a pod can consume 4GB maximally.
  5. If a container consumes 2GB maximally, there will be no issue.
    But if a container consumes 4GB, another one cannot consume any hugepage.

Copy link
Member Author

@bg-chun bg-chun Sep 3, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What kind of cgroup are you going to use?
=> To limiting NUMA nodes, I am going to use cpuset subsystem(cpuset.mems).
But this KEP and updates will not cover it, Memory Manager KEP will cover it.

@k8s-ci-robot k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Sep 3, 2019
@bg-chun
Copy link
Member Author

bg-chun commented Sep 3, 2019

Below PR updates cAdvisor to discover and store pre-allocated huge pages per NUMA node as mentioned in KEP updates.
google/cadvisor#2304

@bg-chun
Copy link
Member Author

bg-chun commented Sep 4, 2019

It seems that to set a limit of hugepages over CRI message, LinuxContainerResources should be updated and the function WithResources in containerd should also be updated.

But, I'm not sure whether the additional update of containerd is required or not.
The additional update means updating the other part of containerd like a containerd-shim or runc.

This week, I will check it then I will leave the result of this work.

@bg-chun
Copy link
Member Author

bg-chun commented Oct 22, 2019

@derekwaynecarr , @bart0sh, @kad , @odinuge
Guys, I updated KEP to focus on only container isolation and spread the content to whole document.
May i ask you guys to review it again?

Copy link
Member

@derekwaynecarr derekwaynecarr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the update. merging the kep update, and we can discuss implementation separately.

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 22, 2019
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bg-chun, derekwaynecarr

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 22, 2019
@derekwaynecarr
Copy link
Member

@bg-chun for hugetlbfs, the writing container will be charged for its usage (similar to memory). assuming more than one container wants to write to hugetlbfs, requests alone should be sufficient, as limits should have been limiting the local container.

@k8s-ci-robot k8s-ci-robot merged commit 2e968f3 into kubernetes:master Oct 22, 2019
@k8s-ci-robot k8s-ci-robot added this to the v1.17 milestone Oct 22, 2019
bg-chun pushed a commit to bg-chun/cri that referenced this pull request Dec 18, 2019
update cri-api vendor to include hugepages changes
KEP: kubernetes/enhancements#1199
CRI: kubernetes/kubernetes#83614

Signed-off-by: Byonggon Chun <bg.chun@samsung.com>
bg-chun pushed a commit to bg-chun/cri that referenced this pull request Dec 19, 2019
update cri-api vendor to include hugepages changes
KEP: kubernetes/enhancements#1199
CRI: kubernetes/kubernetes#83614

Signed-off-by: Byonggon Chun <bg.chun@samsung.com>
bg-chun pushed a commit to bg-chun/cri-o that referenced this pull request Dec 19, 2019
update cri-api vendor to include hugepages changes
KEP: kubernetes/enhancements#1199
CRI: kubernetes/kubernetes#83614

Signed-off-by: Byonggon Chun <bg.chun@samsung.com>
bg-chun pushed a commit to bg-chun/cri-o that referenced this pull request Dec 21, 2019
update cri-api vendor to include hugepages changes
KEP: kubernetes/enhancements#1199
CRI: kubernetes/kubernetes#83614

Signed-off-by: Byonggon Chun <bg.chun@samsung.com>
bg-chun pushed a commit to bg-chun/cri that referenced this pull request Feb 4, 2020
update cri-api vendor to include hugepages changes
KEP: kubernetes/enhancements#1199
CRI: kubernetes/kubernetes#83614

Signed-off-by: Byonggon Chun <bg.chun@samsung.com>
bg-chun pushed a commit to bg-chun/cri that referenced this pull request Feb 4, 2020
update cri-api vendor to include hugepages changes
KEP: kubernetes/enhancements#1199
CRI: kubernetes/kubernetes#83614

Signed-off-by: Byonggon Chun <bg.chun@samsung.com>
bg-chun pushed a commit to bg-chun/cri that referenced this pull request Feb 4, 2020
update cri-api vendor to include hugepages changes
KEP: kubernetes/enhancements#1199
CRI: kubernetes/kubernetes#83614

Signed-off-by: Byonggon Chun <bg.chun@samsung.com>
bg-chun pushed a commit to bg-chun/cri that referenced this pull request Feb 4, 2020
update cri-api vendor to include hugepages changes
KEP: kubernetes/enhancements#1199
CRI: kubernetes/kubernetes#83614

Signed-off-by: Byonggon Chun <bg.chun@samsung.com>
bg-chun pushed a commit to bg-chun/cri that referenced this pull request Feb 4, 2020
update cri-api vendor to include hugepages changes
KEP: kubernetes/enhancements#1199
CRI: kubernetes/kubernetes#83614

Signed-off-by: Byonggon Chun <bg.chun@samsung.com>
bg-chun pushed a commit to bg-chun/cri that referenced this pull request Feb 4, 2020
update cri-api vendor to include hugepages changes
KEP: kubernetes/enhancements#1199
CRI: kubernetes/kubernetes#83614

Signed-off-by: Byonggon Chun <bg.chun@samsung.com>
bg-chun pushed a commit to bg-chun/cri that referenced this pull request Feb 4, 2020
update cri-api vendor to include hugepages changes
KEP: kubernetes/enhancements#1199
CRI: kubernetes/kubernetes#83614

Signed-off-by: Byonggon Chun <bg.chun@samsung.com>
bg-chun pushed a commit to bg-chun/cri that referenced this pull request Feb 4, 2020
update cri-api vendor to include hugepages changes
KEP: kubernetes/enhancements#1199
CRI: kubernetes/kubernetes#83614

Signed-off-by: Byonggon Chun <bg.chun@samsung.com>
bg-chun pushed a commit to bg-chun/cri that referenced this pull request Feb 5, 2020
update cri-api vendor to include hugepages changes
KEP: kubernetes/enhancements#1199
CRI: kubernetes/kubernetes#83614

Signed-off-by: Byonggon Chun <bg.chun@samsung.com>
`hugepages-2Mi: 4Mi`, but invalid to request `hugepages-2Mi: 3Mi`.

The request and limit for `hugepages-<hugepagesize>` must match. Similar to
memory, an application that requests `hugepages-<hugepagesize>` resource is at
minimum in the `Burstable` QoS class.

If multiple containers consume huge pages in the same pod, the request must be
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we would want to differentiate between init containers and regular containers, since the init containers are all guaranteed to be done when the regular containers start up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. sig/node Categorizes an issue or PR as relevant to SIG Node. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants