-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP: in-place update of pod resources #686
Conversation
Thanks for your pull request. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). 📝 Please follow instructions at https://git.k8s.io/community/CLA.md#the-contributor-license-agreement to sign the CLA. It may take a couple minutes for the CLA signature to be fully registered; after that, please reply here with a new comment and we'll verify. Thanks.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
@k8s-ci-robot fixed CLA email address to match commit email. |
/check-cla |
Is this moving a (partially) consensus-reached KEP to the new repo ir is this the request for deeper review? If the latter, please adjust the title to reflect that this IS the KEP. |
@thockin Yes this just moves the earlier draft KEP skeleton by @kgolab from k/community to k/enhancements in order to get the ball rolling per new process. The only change I made is to set the owning sig to autoscaling as I think it is the main driver for this feature. I'll append my previous comments on this once I bring this up for an initial look with sig-node and sig-scheduling and get their official buy-off. This will evolve further as we merge design ideas and fill in details with community consensus. |
Thanks to the above: | ||
* PodSpec.Container.ResourceRequirements becomes purely a declaration, | ||
denoting **desired** state of the Pod, | ||
* PodStatus.ContainerStatus.ResourceAllocated (new object) denotes **actual** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a corresponding CRI change for review? That shouldn't block merging this KEP draft but it is going to be important for implementers as it would need to be reviewed for Linux+Windows compatibility and runtime compatibility (dockershim/kata/hyper-v)
cc @feiskyer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not very confident we have reached agreement that it's the direction we will go. If yes, a CRI change should be included here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@PatrickLang In our implementation, is was sufficient to make changes in kubelet to detect a resources-only container spec update, and call UpdateContainerResources CRI API without any changes to the CRI itself. We have tested it with docker, we are yet to try kata.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vinaykul Kata does not update "container" resource for now :-) Also, it's related to how CRI shim is implemented, in containerd shimv2 work, I remembered we didn't handle this update at least month ago.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While if we decide to go with the current narrative in this KEP, CRI do need to be updated (new filed: ResourceAllocated
) and CRI shim & crictl maintainers should be notified about the incompatible change of meaning of LinuxContainerResources
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vinaykul Kata does not update "container" resource for now :-) Also, it's related to how CRI shim is implemented, in containerd shimv2 work, I remembered we didn't handle this update at least month ago.
Ah that's good to know. I last toyed with Kata at GA, and they were working on getting cpu/mem update working.
I tried out krt1.4.1 earlier today, and found OCI mostly works. CPU / memory increase & decrease reflects in the cgroup inside kata container and enforced, but VSZ/RSS isn't lowered when memory is lowered, and Get doesn't reflect the actual usage.
I'll try k8s-crio-kata tomorrow or friday to see how well crio-oci translation works and identify gaps. It probably won't work if containerd shim doesnt handle it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While if we decide to go with the current narrative in this KEP, CRI do need to be updated (new filed:
ResourceAllocated
) and CRI shim & crictl maintainers should be notified about the incompatible change of meaning ofLinuxContainerResources
.
@resouer kata-runtime 1.4.1 seems to handle updating cpu/memory via CRI-O (below example)
Regarding CRI, kubelet would merely switch from using PodSpec.Container.Container.ResourceRequirements to PodStatus.Container.ResourceAllocated to get the limits when invoking the CRI API in this function for e.g: https://github.com/Huawei-PaaS/kubernetes/blob/vert-scaling-cp-review/pkg/kubelet/kuberuntime/kuberuntime_container.go#L619
Did I miss something in thinking that CRI update is not necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I just check the kata-runtime's api and now every CRI shim could support container level resource adjustment. In that case, no CRI change is required, we can simply just use ResourceAllocated
to generate containerResources.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure about "every CRI"? I'll try to look on that deeper during next days, but almost for sure that will break https://github.com/Mirantis/virtlet/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure about "every CRI"? I'll try to look on that deeper during next days, but almost for sure that will break https://github.com/Mirantis/virtlet/
@PatrickLang Over the past week, I experimented with WinServer 2019 (for another project planning that I'm working on), and got the chance to try a windows cross-compile kubelet of my implementation and take a closer look at how to set the updated limits. Windows does create the container with specified limits (perhaps using information from ContainerConfig.WindowsContainerConfig.WindowsContainerResources struct).
For cleanliness, I do see that we should update the CRI API to specify ContainerResources instead of LinuxContainerResources (which would have pointers to LinuxContainerResources or WindowsContainerResources similar to ContainerConfig). Do you think containerID + WindowsContainerResources is sufficient for Windows to successfully update the limits?
@jellonek I've not looked at virtlet. If you have had the chance, can you please check if its CRI shim is able to use LinuxContainerResources?
keps/sig-autoscaling/draft-20181106-in-place-update-of-pod-resources.md
Outdated
Show resolved
Hide resolved
keps/sig-autoscaling/draft-20181106-in-place-update-of-pod-resources.md
Outdated
Show resolved
Hide resolved
keps/sig-autoscaling/draft-20181106-in-place-update-of-pod-resources.md
Outdated
Show resolved
Hide resolved
/assign @mwielgus |
/assign |
keps/sig-autoscaling/draft-20181106-in-place-update-of-pod-resources.md
Outdated
Show resolved
Hide resolved
keps/sig-autoscaling/draft-20181106-in-place-update-of-pod-resources.md
Outdated
Show resolved
Hide resolved
keps/sig-autoscaling/draft-20181106-in-place-update-of-pod-resources.md
Outdated
Show resolved
Hide resolved
keps/sig-autoscaling/draft-20181106-in-place-update-of-pod-resources.md
Outdated
Show resolved
Hide resolved
@vinaykul , thanks a lot for driving this forward and also simplifying the initial design. Overall the KEP looks good to me already from the autoscaling perspective. |
@kgolab Thanks for the review, please see my responses above. The goal has been to keep things simple for Kubelet, and so we removed proposed changes that were not critical to the base resize operation. If possible, could you please join tomorrow 10 am PST sig-node weekly meeting to review the above concerns? In today's sig-autoscaling meeting, I discussed this we will list @mwielgus as final approver in the KEP, he plans to approve it after your LGTM. |
/approve @vinaykul I approved your KEP from SIG Node to unblock the ongoing API review since we are converging on the high level design. There are some small implementation details can be addressed later. Thanks for taking this project to SIG Node last couple of months, and leading the design conversation at our weekly meeting. Thanks for your patience. |
/assign @derekwaynecarr |
/approve Approving for sig-scheduling. Our main concern is the race that could happen between the scheduler and an in-place update. We agreed that we don't have enough data points to conclude how disruptive that is going to be, and so we will re-evaluate after alpha. |
To provide fine-grained user control, PodSpec.Containers is extended with | ||
ResizePolicy map (new object) for each resource type (CPU, memory): | ||
* NoRestart - the default value; resize Container without restarting it, | ||
* RestartContainer - restart the Container in-place to apply new resource |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can see a case for a third policy here, which as a strawman I will call SignalContainerWINCH
. This would allow the container to attempt to adjust its language runtime to conform to the new limits - e.g. a programmer determines that calling runtime.GOMAXPROCS(math.Ceil(numCPUs) + 1)
results in less scheduler thrashing.
However, such a signal would only be useful if pods are able to interrogate the system for their own resource limits. This is perhaps best left to future enhancements to in-place update and should not block 1.17 implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On linux you can always read /sys/fs/cgroup can't you ?
/lgtm |
/approve Approving for sig-autoscaling. |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: ahg-g, dashpole, dchen1107, mwielgus, vinaykul The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
This PR moves Karol's initial proposal for in-place update of pod resources from k/community to k/enhancements as required by the new process.
This KEP intends to build upon the ideas in proposal for live and in-place vertical scaling and Vertical Resources Scaling in Kubernetes.
This PR also updates the owning-sig to sig-autoscaling, and adds initial set of reviewers @bsalamat , @derekwaynecarr , @dchen1107 from sig-scheduling and sig-node where bulk of the anticipated code changes will happen.
The original pull request by Karol, and associated discussion is here.
CC: @kgolab @bskiba @schylek @bsalamat @dchen1107 @derekwaynecarr @karthickrajamani @YoungjaeLee @resouer