-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP: in-place update of pod resources #686
Merged
Merged
Changes from 5 commits
Commits
Show all changes
22 commits
Select commit
Hold shift + click to select a range
cd94808
Move Karol Golab's draft KEP for In-place update of pod resources fro…
7fb66f1
Update owning-sig to sig-autoscaling, add initial set of reviewers.
b8c1f4e
Flow Control and few other sections added
kgolab 5d00f9f
Merge pull request #1 from kgolab/master
vinaykul 9580642
Update KEP filename per latest template guidelines, add non-goal item.
b8d814e
Merge remote-tracking branch 'upstream/master'
df1c8f8
Update flow control, clarify items per review, identify risks.
17923eb
Update policy name, clarify scheduler actions and policy precedence
e5052fc
Add RetryPolicy API change, clarify transition of PodCondition fields…
1194243
Update control flow per review, add notes on Pod Overhead, emptyDir
bfab6a3
Update API and flow control to avoid storing state in PodCondition
69f9190
Rename PodSpec scheduler resource allocations & PodCondition, and cla…
199a008
Key changes:
vinaykul 574737c
Update design so that Kubelet, instead of Scheduler, evicts lower pri…
vinaykul 5bdcd57
1. Remove PreEmpting PodCondition.
vinaykul bc9dc2b
Extend PodSpec to hold accepted resource resize values, add resourcea…
vinaykul 533c3c6
Update ResourceAllocated as ResourceList, clarify details of Kubelet …
vinaykul 29a22b6
Restate Kubelet fault handling to minimum guarantees, clarify Schedul…
vinaykul 20cbea6
Details of LimitRanger, ResourceQuota enforcement during Pod resize.
vinaykul 0ed9505
ResourceQuota with resize uses Containers[i].Resources
vinaykul c745563
Add note on VPA+HPA limitation for CPU, memory
vinaykul 55c8e56
Add KEP approvers, minor clarifications
vinaykul File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
277 changes: 277 additions & 0 deletions
277
keps/sig-autoscaling/20181106-in-place-update-of-pod-resources.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,277 @@ | ||
--- | ||
title: In-place Update of Pod Resources | ||
authors: | ||
- "@kgolab" | ||
- "@bskiba" | ||
- "@schylek" | ||
owning-sig: sig-autoscaling | ||
participating-sigs: | ||
- sig-node | ||
- sig-scheduling | ||
reviewers: | ||
- "@bsalamat" | ||
- "@derekwaynecarr" | ||
- "@dchen1107" | ||
approvers: | ||
- TBD | ||
editor: TBD | ||
creation-date: 2018-11-06 | ||
last-updated: 2018-11-06 | ||
status: provisional | ||
see-also: | ||
replaces: | ||
superseded-by: | ||
--- | ||
|
||
# In-place Update of Pod Resources | ||
|
||
## Table of Contents | ||
|
||
* [In-place Update of Pod Resources](#in-place-update-of-pod-resources) | ||
* [Table of Contents](#table-of-contents) | ||
* [Summary](#summary) | ||
* [Motivation](#motivation) | ||
* [Goals](#goals) | ||
* [Non-Goals](#non-goals) | ||
* [Proposal](#proposal) | ||
* [API Changes](#api-changes) | ||
* [Flow Control](#flow-control) | ||
* [Transitions of InPlaceResize condition](#transitions-of-inplaceresize-condition) | ||
* [Notes](#notes) | ||
* [Affected Components](#affected-components) | ||
* [Risks and Mitigations](#risks-and-mitigations) | ||
* [Graduation Criteria](#graduation-criteria) | ||
* [Implementation History](#implementation-history) | ||
* [Alternatives](#alternatives) | ||
|
||
## Summary | ||
|
||
This proposal aims at allowing Pod resource requests & limits to be updated | ||
in-place, without a need to restart the Pod or its Containers. | ||
|
||
The **core idea** behind the proposal is to make PodSpec mutable with regards to | ||
Resources, denoting **desired** resources. | ||
Additionally PodStatus is extended to provide information about **actual** | ||
resource allocation. | ||
|
||
This document builds upon [proposal for live and in-place vertical scaling][] and | ||
[Vertical Resources Scaling in Kubernetes][]. | ||
|
||
[proposal for live and in-place vertical scaling]: https://github.com/kubernetes/community/pull/1719 | ||
[Vertical Resources Scaling in Kubernetes]: https://docs.google.com/document/d/18K-bl1EVsmJ04xeRq9o_vfY2GDgek6B6wmLjXw-kos4/edit?ts=5b96bf40 | ||
|
||
## Motivation | ||
|
||
Resources allocated to a Pod's Container can require a change for various reasons: | ||
* load handled by the Pod has increased significantly and current resources are | ||
not enough to handle it, | ||
* load has decreased significantly and currently allocated resources are unused | ||
and thus wasted, | ||
* Resources have simply been set improperly. | ||
|
||
Currently changing Resource allocation requires the Pod to be recreated since | ||
the PodSpec is immutable. | ||
|
||
While many stateless workloads are designed to withstand such a disruption, some | ||
are more sensitive, especially when using low number of Pod replicas. | ||
|
||
Moreover, for stateful or batch workloads, a Pod restart is a serious | ||
disruption, resulting in lower availability or higher cost of running. | ||
|
||
Allowing Resources to be changed without recreating a Pod nor restarting a | ||
Container addresses this issue directly. | ||
|
||
### Goals | ||
|
||
* Primary: allow to change Pod resource requests & limits without restarting its | ||
Containers. | ||
vinaykul marked this conversation as resolved.
Show resolved
Hide resolved
|
||
* Secondary: allow actors (users, VPA, StatefulSet, JobController) to decide | ||
how to proceed if in-place resource update is not available. | ||
* Secondary: allow users to specify which Pods and Containers can be updated | ||
without a restart. | ||
dchen1107 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
### Non-Goals | ||
|
||
The explicit non-goal of this KEP is to avoid controlling full life-cycle of a | ||
Pod which failed an in-place resource update. These cases should be handled by | ||
actors which initiated the update. | ||
|
||
Other identified non-goals are: | ||
vinaykul marked this conversation as resolved.
Show resolved
Hide resolved
|
||
* allow to change Pod QoS class without a restart, | ||
* to change resources of Init Containers without a restart, | ||
* updating extended resources or any other resource types besides CPU, memory. | ||
vinaykul marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
## Proposal | ||
|
||
### API Changes | ||
|
||
PodSpec becomes mutable with regards to resources and limits. | ||
Additionally, PodSpec becomes a Pod subresource to allow fine-grained access control. | ||
vinaykul marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
PodStatus is extended with information about actually allocated resources. | ||
|
||
Thanks to the above: | ||
* PodSpec.Container.ResourceRequirements becomes purely a declaration, | ||
denoting **desired** state of the Pod, | ||
* PodStatus.ContainerStatus.ResourceAllocated (new object) denotes **actual** | ||
state of the Pod resources. | ||
|
||
To distinguish between possible states of the Pod resources, | ||
a new PodCondition InPlaceResize is added, with the following states: | ||
vinaykul marked this conversation as resolved.
Show resolved
Hide resolved
|
||
* (empty) - the default value; resource update awaits reconciliation | ||
(if ResourceRequirements differs from ResourceAllocated), | ||
* Awaiting - awaiting resources to be freed (e.g. via pre-emption), | ||
* Failed - resource update could not have been performed in-place | ||
but might be possible if some conditions change, | ||
* Rejected - resource update was rejected by any of the components involved. | ||
|
||
To provide some fine-grained control to the user, | ||
PodSpec.Container.ResourceRequirements is extended with ResizingPolicy flag, | ||
available per each resource request (CPU, memory) : | ||
* InPlace - the default value; allow in-place resize of the Container, | ||
* RestartContainer - restart the Container to apply new resource values | ||
vinaykul marked this conversation as resolved.
Show resolved
Hide resolved
|
||
(e.g. Java process needs to change its Xmx flag), | ||
* RestartPod - restart whole Pod to apply new resource values | ||
vinaykul marked this conversation as resolved.
Show resolved
Hide resolved
|
||
(e.g. Pod requires its Init Containers to re-run). | ||
|
||
By using the ResizingPolicy flag the user can mark Containers or Pods as safe | ||
(or unsafe) for in-place resources update. | ||
|
||
This flag **may** be used by the actors starting the process to decide if | ||
the process should be started at all (for example VPA might decide to | ||
evict Pod with RestartPod policy). | ||
This flag **must** be used by Kubelet to verify the actions needed. | ||
|
||
Setting the flag to separately control CPU & memory is due to an observation | ||
vinaykul marked this conversation as resolved.
Show resolved
Hide resolved
vinaykul marked this conversation as resolved.
Show resolved
Hide resolved
|
||
that usually CPU can be added/removed without much problems whereas | ||
changes to available memory are more probable to require restarts. | ||
|
||
### Flow Control | ||
vinaykul marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
The following steps denote a positive flow of an in-place update, | ||
for a Pod having ResizingPolicy set to InPlace for all its Containers. | ||
Some alternative flows are given in indented steps, | ||
unless noted otherwise they abort the flow. | ||
|
||
1. The initiating actor updates ResourceRequirements using PATCH verb. | ||
1. API Server validates the new ResourceRequirements | ||
(e.g. limits are not below requested resources, QoS class does not change). | ||
1. API Server calls all Admission Controllers to verify the Pod Update. | ||
1. If any of the controllers rejects the update, | ||
the InPlaceResize PodCondition is set to Rejected. | ||
1. API Server updates the PodSpec object and clears InPlaceResize condition. | ||
1. Scheduler observes that ResourceRequirements and ResourceAllocated differ. | ||
It updates its resource cache to use max(ResourceRequirements, ResourceAllocated). | ||
1. If required it pre-empts lower-priority Pods, setting | ||
the InPlaceResize PodCondition to Awaiting. | ||
Once the lower-priority Pods are evicted, Scheduler clears | ||
the InPlaceResize PodCondition and the flow continues. | ||
1. Kubelet observes that ResourceRequirements and ResourceAllocated differ | ||
and the InPlaceResize condition is clear. | ||
This is done potentially prior to Scheduler pre-empting lower-priority Pods. | ||
1. Kubelet checks that new ResourceRequirements do not fit Node’s | ||
allocatable resources and sets the InPlaceResize condition to Failed. | ||
1. Kubelet applies new resource values to cgroups, updates values | ||
in ResourceAllocated to match ResourceRequirements | ||
and clears InPlaceResize condition. | ||
1. Scheduler observes that ResourceAllocated has changed. | ||
It updates its resource cache to use new value of ResourceAllocated | ||
for the given Pod. | ||
1. The initating actor observes that ResourceRequirements and | ||
ResourceAllocated match again which signifies the completion of an update. | ||
|
||
#### Transitions of InPlaceResize condition | ||
|
||
The following diagram shows possible transitions of InPlaceResize condition. | ||
|
||
```text | ||
+---------+ | ||
+-----------+ +-----------+ | ||
| | (empty) | | | ||
| +---------> <---------+ | | ||
| | +----+----+ | | | ||
1| |2 3| 4| |5 | ||
+-----v-+--+ | +---+-v--+ | ||
| | | | | | ||
| Awaiting | | | Failed | | ||
| | | | | | ||
+-------+--+ | +---+----+ | ||
3| | |3 | ||
| +----v-----+ | | ||
| | | | | ||
+---------> Rejected <--------+ | ||
| | | ||
+----------+ | ||
``` | ||
|
||
1. Scheduler, on pre-emption. | ||
1. Scheduler, after pre-emption finishes. | ||
1. Any Controller, on permanent issue. | ||
1. Kubelet, on successful retry. | ||
1. Kubelet, if not enough space on Node. | ||
|
||
#### Notes | ||
|
||
* In case when there is no pre-emption required, Kubelet and Scheduler | ||
will pick up the ResourceRequirements change in parallel. | ||
* In case when there is pre-emption required Kubelet and Scheduler might | ||
pick up the ResourceRequirements change in parallel, | ||
Kubelet will then set the InPlaceResize condition to Failed | ||
and Scheduler will clear it once pre-emption is done. | ||
* Kubelet might try to apply new resources also if InPlaceResize | ||
condition is set to Failed, as a normal retry mechanism. | ||
* To avoid races and possible gamification, all components should use | ||
max(ResourceRequirements, ResourceAllocated) when computing resources | ||
used by a Pod. TBD if this can be weakened when InPlaceResize condition | ||
is set to Rejected, or should the initiating actor update | ||
ResourceRequirements back to reclaim resources. | ||
|
||
### Affected Components | ||
|
||
Pod v1 core API: | ||
* extended model, | ||
* added validation. | ||
|
||
Admission Controllers: LimitRanger, ResourceQuota need to support Pod Updates: | ||
* for ResourceQuota it should be enough to change podEvaluator.Handler | ||
implementation to allow Pod updates; max(ResourceRequirements, ResourceAllocated) | ||
should be used to be in line with current ResourceQuota behaviour | ||
which blocks resources before they are used (e.g. for Pending Pods), | ||
* for LimitRanger TBD. | ||
|
||
Kubelet | ||
* support in-place resource management, | ||
* set ResourceRequirements on placing the Pod on Node. | ||
|
||
Scheduler: | ||
* update its caches with proper resources, depending on InPlaceResize condition. | ||
|
||
Other components: | ||
* check how the change of meaning of resource requests influence other kubernetes components. | ||
|
||
### Possible Extensions | ||
|
||
1. Allow resource limits to be updated too. | ||
1. Allow ResizingPolicy to be set on Pod level, acting as default if | ||
(some of) the Containers do not have it set on their own. | ||
1. Extend ResizingPolicy flag to separately control resource increase and decrease | ||
(e.g. a container can be given more memory in-place but | ||
decreasing memory requires container restart). | ||
|
||
### Risks and Mitigations | ||
|
||
TODO | ||
|
||
## Graduation Criteria | ||
|
||
TODO | ||
|
||
## Implementation History | ||
|
||
- 2018-11-06 - initial KEP draft created | ||
- 2019-01-18 - implementation proposal extended | ||
|
||
## Alternatives | ||
|
||
TODO | ||
|
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please identify the component owners (for the autoscaling/node/scheduling areas) that will approve this KEP (and get approvals from them). That helps ensure there's agreement on the goals and overall approach before entering the API review process.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@liggitt Thanks for pointing this out. I've identified the approvers for the stakeholder SIGs, and SIG-node, SIG-scheduling have approved the KEP.
@mwielgus is going to follow-up with @kgolab to see if there are any concerns, and if not we should get lgtm and approval from SIG-autoscaling.
Please let us know what our next-steps are for API review.
Thanks,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I'd suggest:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@liggitt Thanks for the guidance. I've resolved many of the issues and comments that were either addressed or have become stale.
I'm tracking the remaining outstanding questions in #1287
I'll give folks a few days to re-open any that they may feel is not resolved or resolved in error.
And then I and @dashpole will ping @thockin to setup a time for API review.