-
Notifications
You must be signed in to change notification settings - Fork 5.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Proposal] deferContainers - Pod Termination Semantics #483
Conversation
@kubernetes/sig-apps-feature-requests @kubernetes/sig-node-feature-requests |
@kubernetes/sig-node-api-reviews would need to comment on this too since it would be implemented in the node. |
## Caviate | ||
This Design primarily focuses on handling graceful termination cases. If the Node running a deferContainer configured pod | ||
crashes abruptly, then this design does not guarantee that cleanup was performed gracefully. This still requires community | ||
feedback on how such scenarios are handled and how important it is for deferContainers to handle that situation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is quite crucial. In case of initContainers you have a guarantee if container is running => Init container was executed and it executed successfully. In case of deferContainers we have no guarantees about that. Not to mention fetching logs from deleted pods (no pod => no pod name in case of controllers => no way to get deferContainer logs).
Probably all of that has been already mentioned somewhere else, but I wanted to raise it also here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Honestly, we could not think of an elegant way to solve this problem, we really need to get communities feedback on how this is done today but we believe even without this deferContainers would be useful in many cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think adding this without any idea how to solve that problem is bad idea, it may create some illusion that some functionality is provided, while on production it may work not as "expected". ( compared with initContainers).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Termination constructs are important but providing absolute guarantee like their initialization counterparts are difficult to achieve, especially when there are ungraceful terminations like node crash.
In programming languages C++'s destructor, golang's defer(), atexit in C. None of this is guaranteed to work if the process crashes (ungraceful termination). This does not mean they are not useful. Their initialization counterpart offers higher level of guarantee.
We think the current design is a good starting point for providing better support for graceful POD termination. We might be able to improve it along the way as the users try it and provide more feedback. Initial implementation will be alpha status so should caution community from trying this in Production without enough tests.
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about renaming it an eventContainer, and then passing in event kind as an env var?
something like: POD_EVENT=stop, POD_EVENT=recover, etc. We only need to implement the first, stop, for now, but can reuse the machinery later for things like, maybe informing a statefulset pod it will be shutting down but will be replaced. so like, in cephs case, the statefulset might want to set no-out during the upgrade as it will come right back. but not set no-out during a delete as it might not come back and the cluster should recover. Kind of like how rpm spec install hooks work during remove/upgrade.
|
||
## Use Cases | ||
|
||
### Cleanup: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How are we going to handle situation:
- you want to delete pod because it's broken or whatever
- it has deferContainer that will hung because of "some reasons"
How am I supposed to not wait for deferContainer? Are we going to add kubectl delete .... --ignore-defer-containers
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
was thinking we should probably force the delete in that case we wont need to run deferContainers
`kubectl delete pod foo --grace-period=0 --force
But this new option is cool too, it seperates out the funtionality.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doesn't backwards compatibility mean that if someone previously ran delete pod foo --grace-period=0
to immediately terminate without waiting, that should skip deferContainers as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
then it might force --grace-period
to all the other deferContainers, predicting the running-time of a certain cleaup/termination activity might be difficult. Or may be we could have --grace-period=-1
for unlimited termination time for a POD?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How are we going to solve "force shutdown" "force shutdown + ignore defer"? Pls, don't get me wrong but it seems to be a bit inconsistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. I now see the inconsistency, to keep it more predictable we should stick with,
kubectl delete pod foo --grace-period=0 --force
as originally proposed. Even now this is the way to avoid running PreStop hooks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'd have to come up with some other signal. --grace-period=1 means "delete fast" and --grace-period=0 means "give up strong consistency and risk split brain". You don't want to do the latter.
I'm not sure there is a backward compatible way to allow defer containers to run without being bounded in time by the grace-period. I'm also not sure we want defer containers to run unbounded, since that was a key part of the graceful deletion design.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@smarterclayton I have added some design thoughts on how we handle time boundness for deferContainers, this was something i wanted to bring up in the SIG-APPs meeting yesterday. PTAL.
Also this particular discussion is about deleting the POD without running deferContainers or preStopHooks, an equivalent of kill -9 PID
. Could you please elaborate why should we come up with a different signal for this?
|
||
## Limitation with the current system | ||
Container pre-stop hooks are not sufficient for all termination cases: | ||
* They cannot easily coordinate complex conditions across containers in multi-container pods |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you provide a specific use case where the current Pod hooks are not sufficient?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The container hooks are not generic to POD itself they are specific to containers, imagine a multi-container pod with 1 App Container and 1 Side Car
SideCar : Membership/Gossip Container
AppContainer: A Stateful Sharded workload
Termination Sequence: First one needs to re-shard/rebalance and then come out of membership
Below code suggests that both the preStop hooks (if configured for both the containers) will be started in parllel
for _, container := range runningPod.Containers {
go func(container *kubecontainer.Container) {
defer utilruntime.HandleCrash()
defer wg.Done()
killContainerResult := kubecontainer.NewSyncResult(kubecontainer.KillContainer, container.Name)
if err := m.killContainer(pod, container.ID, container.Name, "Need to kill Pod", gracePeriodOverride); err != nil {
killContainerResult.Fail(kubecontainer.ErrKillContainer, err.Error())
}
containerResults <- killContainerResult
}(container)
}
- hard to co-ordinate if both have different preStop hooks,
- hard consolidate this logic to one preStop script if both are from different images. Individually upgrading the images becomes a problem
- If only one preStop command failed only that should be retried and not the whole script(currently preStop hooks restarts are not available user shoudl include them in the script)
## Limitation with the current system | ||
Container pre-stop hooks are not sufficient for all termination cases: | ||
* They cannot easily coordinate complex conditions across containers in multi-container pods | ||
* They can only function with code in the image or code in a shared volume, which would have to be statically linked |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When would I want to pull down new code to turn down a storage system. If the primary use cases are those expressed above, isn't all the code in my container already?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The objective is not always to pull a new image, the objective is to seperate out/de-couple Termination sequence. Most often we wont have to pull a new image all togather. I can also imagine simple cleanups like 'rm -rf /tmp/dir1' wrapped around bash or busyboxy images which are lighter.
If your pod is part of a CI engine and running 1000s of testcases, it should be easy to configure a deferContainer (from a different image/source) simply upload (curl POST) currently processed test results to a different server when the POD is getting evicted out of that node.
If curl is not available in your primary container you dont have to re-package and create a new image just to run curl at the termination - which will be the case for preStop hooks.
* They cannot easily coordinate complex conditions across containers in multi-container pods | ||
* They can only function with code in the image or code in a shared volume, which would have to be statically linked | ||
(not a common pattern in wide use) | ||
* They cannot be implemented with the current Docker implementation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please elaborate on what is meant by this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, Docker does not allow us to configure any termination hooks. I think its better to re-phrase or remove it to avoid confusion.
If I understand your design, it seems you want to move to a model where we run deferContainers during terminationGracePeriod seconds to provide a more eloquent abstraction than handling cleanup from the containers entry point.
|
1. Other than being "more eloquent," isn't this functionally equivalent with what we can do now with the exception that you could potentially pull down new code in your deferContainer? Yes de-couple the termination logic, handle complex termination sequence, upgrade/change the primary container images without affecting initialization or termination scripts. More predictable termination behaviour 2. When would you need to pull down a new image to cleanup an existing image? I think its a little dangerous to not package your cleanup code in the same container as the application. If you have an image pull failure for your defer container, you will not be able to cleanup successfully. It is not always necessary to pull a new image for cleanup, Most cleanups might use the same image but just that now it is more predictable. Modularize the termination sequence and ability to restart only failed steps. Currently, if preStop hook failed the (only way to trigger cleanup) nothing much can be done. Imagine in a master-slave type workload a terminating master is promoting a slave and if that command failed, it is important to retry this command. The user has to implement logic to handle this functionality. Below code doesnt attempt to retry preStop hooks if failed. // executePreStopHook runs the pre-stop lifecycle hooks if applicable and returns the duration it takes.
func (m *kubeGenericRuntimeManager) executePreStopHook(pod *v1.Pod, containerID kubecontainer.ContainerID, containerSpec *v1.Container, gracePeriod int64) int64 {
glog.V(3).Infof("Running preStop hook for container %q", containerID.String())
start := metav1.Now()
done := make(chan struct{})
go func() {
defer close(done)
defer utilruntime.HandleCrash()
if msg, err := m.runner.Run(containerID, pod, containerSpec, containerSpec.Lifecycle.PreStop); err != nil {
glog.Errorf("preStop hook for container %q failed: %v", containerSpec.Name, err)
m.generateContainerEvent(containerID, v1.EventTypeWarning, events.FailedPreStopHook, msg)
}
}() 3. I would think running commands in the same container to clean up a stateful application would be a more beneficial abstraction than pulling down a new image. deferContainers would allow us to run an arbitrary set of containers during termination (preStop) for the entire pod. If one develops an operator for a particular TRP workload which has reasonable termination sequence then one can use the primary image like redis or oracle directly from the source instead of repackaging that image with the termination scripts, without affecting each other the terimnation scripts/sequences can be improved and the primary image can be upgraded. 4. deferContainers will make it more difficult to reason about terminationGracePeriod's. You will now have to budget image pull time into grace period. This can be highly variable based on the environment. I was thinking we should add a seperateflag to handle such rare cases where one would want to use a different image which is very heavy. prePullDeferImages: true such that as soon as the POD is started kublet can download and pre-populate these heavy images in the node? |
Can't we fix pre-stop hooks to be retriable? |
yes in my opinion that would make preStop hook more functional but still the termination script is tighgtly coupled with the AppImage and are container specific hooks. |
With this proposal we definitely make termination logic more descriptive and extensive. I do like the syntactic ideas as they are similar to init containers for startup logic. The main issue I have with this proposal is yet another way to handle something. If deferContainers are accepted, we should think about "deprecating" or at least advise against preStop hooks. |
deferContainers wouldn't obviate the need for preStop hooks - for instance,
anything that has to manage per container shutdown state can't be done from
outside the container.
I'll have more time to review this later this week, but in general
termination guarantees are a lot harder than initialization guarantees (and
it's taken a while to ensure that initialization happens correctly). It's
best to focus on use cases that can't be accomplished any other way.
…On Fri, Apr 7, 2017 at 11:33 PM, Michael Grosser ***@***.***> wrote:
With this proposal we definitely make termination logic more descriptive
and extensive. I do like the syntactic ideas as they are similar to init
containers for startup logic.
The main issue I have with this proposal is yet another way to handle
something. If deferContainers are accepted, we should think about
"deprecating" or at least advise against preStop hooks.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#483 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABG_pxdS6SEJ2OpSD2kROxDT5ZOZ8FvWks5rtquYgaJpZM4MpOGx>
.
|
Update design with more info and remove irrelevant portions.
Just joining the discussio to add a use case: kubernetes/kubernetes#62117 For big-data analysis scenarios (such as high-energy physics) the data management concerns and the actual data-processing concerns are tended to by different teams and separating them is very useful. The sequence we would like to implement as part of a k8s
crucially, this would allow the data management teams to independently develop their stage-in/stage-out containers including authn/authz while keeping the ability to keep the main workload container as lean as possible (in an extreme case a statically compiled binary packaged as |
Hi, I discussed this isse with a couple of people at KubeCon, how can we move this forward? |
/cc |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Rotten issues close after 30d of inactivity. Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
@fejta-bot: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/reopen I think this is still an interesting use case |
/reopen I'm not sure if we need to translate this proposal into a KEP to move it forward. |
@idealhack: Reopened this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: If they are not already assigned, you can assign the PR to them by writing The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@dhilipkumars: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
I’m not convinced we have enough reason to implement this at this time. It
increases pod complexity in a direction we have historically resisted
moving pods. I would prefer to focus on the other smaller proposals around
sidecars and pod restart logic in the near term, then come back to this.
On Nov 21, 2018, at 3:13 AM, k8s-ci-robot <notifications@github.com> wrote:
@dhilipkumars <https://github.com/dhilipkumars>: The following test *failed*,
say /retest to rerun them all:
Test name Commit Details Rerun command
pull-community-verify 9e81de3
<9e81de3>
link
<https://gubernator.k8s.io/build/kubernetes-jenkins/pr-logs/pull/community/483/pull-community-verify/4051/>
/test
pull-community-verify
Full PR test history <https://gubernator.k8s.io/pr/community/483>. Your PR
dashboard <https://gubernator.k8s.io/pr/dhilipkumars>. Please help us cut
down on flakes by linking to
<https://git.k8s.io/community/contributors/devel/flaky-tests.md#filing-issues-for-flaky-tests>
an open issue
<https://github.com/kubernetes/community/issues?q=is:issue+is:open> when
you hit one in your PR.
Instructions for interacting with me using PR comments are available here
<https://git.k8s.io/community/contributors/guide/pull-requests.md>. If you
have questions or suggestions related to my behavior, please file an issue
against the kubernetes/test-infra
<https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:>
repository. I understand the commands that are listed here
<https://go.k8s.io/bot-commands>.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#483 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABG_pwR2mf9afCfn-2hU92k1HEmmZTwRks5uxQsRgaJpZM4MpOGx>
.
|
Rotten issues close after 30d of inactivity. Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
@fejta-bot: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Dear @fejta-bot can you please reopen? |
@furkanmustafa: You can't reopen an issue/PR unless you authored it or you are a collaborator. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Proposal for deferContainers a complimenting feature to init-containers that enable graceful pod termination.
@quinton-hoole @deepak-vij @shashidharatd @irfanurrehman @kevin-wangzefeng
@smarterclayton @Kargakis @erictune @kow3ns @hectorj2f @davidopp
It wold be great to get your feedback on this, also could someone tag sig-apps please.