-
Notifications
You must be signed in to change notification settings - Fork 5.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Add kubeadm upgrades proposal #825
Conversation
|
||
What actually happens in the Static Pod -> Self-hosted transition? | ||
|
||
First the PodSpec is modified. For instance, instead of hard-coding an IP for the API server to listen on it is dynamically fetched from the Downward API. When it’s ok to proceed, the self-hosted control plane DaemonSets are posted to the Static Pod API Server. The Static Pod API Server starts the self-hosted Pods normally. The Pods get in the Running state, but fails to bind to the port on the host and backs off internally. During that time, kubeadm notices the Running state of the self-hosted Pod, and deletes the Static Pod file which leads to that the kubelet stops the Static Pod-hosted component immediately. On the next run, the self-hosted component will try to bind to the port, which now is free, and will succeed. A self-hosted control plane has been created! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you break all single-line comments into separate lines? Will be better for reviews.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll do that
|
||
Notably, not etcd or kubelet for the moment. | ||
|
||
Self-hosting will be a [phase](https://github.com/kubernetes/kubeadm/blob/master/docs/design/design.md) in the kubeadm perspective. As there are Static Pods on disk from earlier in the `kubeadm init` process, kubeadm can pretty easily parse that file and extract the PodSpec from the file. This PodSpec will be modified a little to fit the self-hosting purposes and be injected into a DaemonSet which will be created for all control plane components (API Server, Scheduler and Controller Manager). kubeadm will wait for the self-hosted control plane Pods to be running and then destroy the Static Pod-hosted control plane. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assuming the DaemonSet controller continues to schedule DS pods, isn't it better for the controller manager to be a Deployment as opposed to a DaemonSet? If all control plane components run as DaemonSets, isn't the controller manager a single point of failure?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The kubelet will keep the controller manager running. You could get into a state where the controller manager is unrunnable due to a configuration or coding error, but I don't see how that would be any better with deployments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @roberthbailey. Using deployments won't solve the problem. On the other hand, it'd be good to state clearly why DaemonSet is chosen to use here.
|
||
2. Build an upgrading Operator that does the upgrading for us | ||
|
||
* Would consume a TPR with details how to do the upgrade |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CRD instead of TPR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you want kubeadm to be a building block for widely varying clusters, be cautious about requiring a custom resource be part of the base install path.
|
||
In v1.7 and v1.8, etcd runs in a Static Pod on the master. In v1.7, the default etcd version for Kubernetes is v3.0.17 and in k8s v1.8 the recommended version will be something like v3.1.10. In the v1.7->v1.8 upgrade path, we could offer upgrading etcd as well as an opt-in*. *This only applies to minor versions and 100% backwards-compatible upgrades like v3.0.x->v3.1.y, not backwards-incompatible upgrades like etcdv2 -> etcdv3. | ||
|
||
The easiest way of achieving an etcd upgrade is probably to create a Job (with `.replicas=<masters>`, Pod Anti-Affinity and `.spec.parallellism=1`) of some kind that would upgrade the etcd Static Pod manifest on disk by writing a new manifest, waiting for it to restart cleanly or rollback, etc. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be nice to add an example in the kubeadm repo.
|
||
What actually happens in the Static Pod -> Self-hosted transition? | ||
|
||
First the PodSpec is modified. For instance, instead of hard-coding an IP for the API server to listen on it is dynamically fetched from the Downward API. When it’s ok to proceed, the self-hosted control plane DaemonSets are posted to the Static Pod API Server. The Static Pod API Server starts the self-hosted Pods normally. The Pods get in the Running state, but fails to bind to the port on the host and backs off internally. During that time, kubeadm notices the Running state of the self-hosted Pod, and deletes the Static Pod file which leads to that the kubelet stops the Static Pod-hosted component immediately. On the next run, the self-hosted component will try to bind to the port, which now is free, and will succeed. A self-hosted control plane has been created! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Pods get in the Running state...
I believe that the behavior described is only true for the apiserver. The KCM and scheduler should just run fine since they aren't trying to bind to host ports.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both KCM and scheduler use host networks and binds ports, so I think this may be true.
On the other hand, if this is not true, there'd be two copies of KCM and scheduler running. That may cause unexpected consequences.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May want to clarify what "running" means here.
These master components all have liveness checks. If they cannot bind to the ports for their healthz endpoints, kubelet may try to kill them repeatedly. You may be able to avoid this by extending the initial delay of the liveness checks though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there any requirements for KCM and scheduler to use host networking? We've been successful running them both on the pod network & using service-account for apiserver access.
I believe the health check wouldn't come into play in the case of apiserver because it is just reaching out to the same host:port regardless if it's static manifest or the self-hosted "replacement". It's a bit disingenuous in that the healthcheck is really only testing one of them.
One problematic piece here is that the apiserver can't actually bind on the port, so it will be in a restart loop. If it does this enough then the backoff period can make it seem like something went wrong (we start the pivot by removing static pod, then the backoff period on the replacement causes it to not be restarted for a while).
As long as the kubelet itself doesn't die during that process, it will recover. It's just somewhat less than ideal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the other hand, if this is not true, there'd be two copies of KCM and scheduler running. That may cause unexpected consequences.
It shouldn't, they are setup to lock on endpoints by default.
|
||
Status as of v1.7: self hosted config file option exists; not split out into phase. The code is fragile and uses Deployments for the scheduler and controller-manager, which unfortunately [leads to a deadlock](https://github.com/kubernetes/kubernetes/issues/45717) at `kubeadm init` time. It does not leverage checkpointing so the entire cluster burns to the ground if you reboot your computer(s). Long story short; self-hosting is not production-ready in 1.7. | ||
|
||
The rest of the document assumes a self-hosted cluster. It will not be possible to use 'kubeadm upgrade' a cluster unless it's self-hosted. The sig-cluster-lifecycle team will still provide some manual, documented steps that an user can follow to upgrade a non-self-hosted cluster. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The rest of the document assumes a self-hosted cluster.
This applies to the whole doc, not just the rest, because the preceding section describes self-hosting. I think we need to state this at the top of the document rather than hiding it in this section. We should also provide a link at that point to the instructions for manually upgrading a non-self-hosted cluster.
Once this statement is at the top, then the section describing self-hosting can be seen as background. Right now this document is describing both how we implement self-hosting (do we have that described anywhere else?) and also how we do upgrades, nominally in a doc that is just supposed to explain upgrades.
Another option is to break out the self-hosting description into a different document.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think that breaking out the self-hosting parts is a good idea.
Also self-hosting didn't come out as important in this process as we had thought
|
||
Self-hosting implementation for v1.8 | ||
|
||
The API Server, controller manager and scheduler will all be deployed in DaemonSets with node affinity (using `nodeSelector`) to masters. The current way DaemonSet upgrades currently operate is "remove an old Pod, then add a new Pod". Since supporting “single masters” is a definite requirement, we have to either workaround removing the only replica of the scheduler/controller-manager by duplicating the existing DaemonSets (e.g. a `temp-self-hosted-kube-apiserver` DS will be created as a copy of the normal `self-hosted-kube-apiserver` DS during the upgrade) or ask for a [new upgrade strategy](https://github.com/kubernetes/kubernetes/issues/48841) (“add first, then delete”, which is generally useful) from sig-apps. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with node affinity (using
nodeSelector
) to masters.
Are we also going to use taints?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: the node affinity needs to be a hard requirement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clarified that we're using the nodeSelector feature right now, not the "real" node affinity one
|
||
Self-hosting implementation for v1.8 | ||
|
||
The API Server, controller manager and scheduler will all be deployed in DaemonSets with node affinity (using `nodeSelector`) to masters. The current way DaemonSet upgrades currently operate is "remove an old Pod, then add a new Pod". Since supporting “single masters” is a definite requirement, we have to either workaround removing the only replica of the scheduler/controller-manager by duplicating the existing DaemonSets (e.g. a `temp-self-hosted-kube-apiserver` DS will be created as a copy of the normal `self-hosted-kube-apiserver` DS during the upgrade) or ask for a [new upgrade strategy](https://github.com/kubernetes/kubernetes/issues/48841) (“add first, then delete”, which is generally useful) from sig-apps. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since supporting “single masters” is a definite requirement,
remove "definite".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
|
||
Checkpointing | ||
|
||
In order to be able to reboot a self-hosted cluster (e.g. a single, self-hosted master), there has to be some kind of checkpointing mechanism. Basically the kubelet will write some state to disk for Pods that opt-in to checkpointing ("I’m running an API Server, let’s write that down so I remember it", etc.). Then if the kubelet reboots (in the single-master case for example), it will check the state store for Pods it was running before the reboot. It discovers that it ran an API Server, Controller Manager and Scheduler and starts those Pods now as well. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically the kubelet will write
Remove Basically from the beginning of the sentence.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
|
||
Checkpointing | ||
|
||
In order to be able to reboot a self-hosted cluster (e.g. a single, self-hosted master), there has to be some kind of checkpointing mechanism. Basically the kubelet will write some state to disk for Pods that opt-in to checkpointing ("I’m running an API Server, let’s write that down so I remember it", etc.). Then if the kubelet reboots (in the single-master case for example), it will check the state store for Pods it was running before the reboot. It discovers that it ran an API Server, Controller Manager and Scheduler and starts those Pods now as well. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a link to kubernetes/kubernetes#49236
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
|
||
11. Has to define an API (CRD or the like) between the client and the Operator | ||
|
||
**Decision**: Keep the logic inside of the kubeadm CLI (option 1) for the implementation in v1.8.0. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes sense. I don't think it would be difficult to move the logic later (at least not more difficult than using an operator now).
|
||
One of the hardest parts with implementing the upgrade will be to respect customizations made by the user at `kubeadm init` time. The proposed solution would be to store the kubeadm configuration given at `init`-time in the API as a ConfigMap, and then retrieve that configuration at upgrade time, parse it using the API machinery and use it for the upgrade. | ||
|
||
This highlights a very important point: **We have to get the kubeadm configuration API group to Beta (v1beta1) in time for v1.8.** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this on track?
|
||
This is a stretch goal for v1.8, but not a strictly necessary feature. | ||
|
||
Pre-pulling of images/ensuring the upgrade doesn’t take too long |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't seem like a subsection of alternatives considered.
|
||
2. Make sure the cluster is healthy | ||
|
||
1. Make sure the API Server’s `/healthz` endpoint returns `ok` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we look at /componentstatuses too?
|
||
1. Make sure the API Server’s `/healthz` endpoint returns `ok` | ||
|
||
2. Makes sure all Nodes return `Ready` status |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you also check node conditions?
|
||
What actually happens in the Static Pod -> Self-hosted transition? | ||
|
||
First the PodSpec is modified. For instance, instead of hard-coding an IP for the API server to listen on it is dynamically fetched from the Downward API. When it’s ok to proceed, the self-hosted control plane DaemonSets are posted to the Static Pod API Server. The Static Pod API Server starts the self-hosted Pods normally. The Pods get in the Running state, but fails to bind to the port on the host and backs off internally. During that time, kubeadm notices the Running state of the self-hosted Pod, and deletes the Static Pod file which leads to that the kubelet stops the Static Pod-hosted component immediately. On the next run, the self-hosted component will try to bind to the port, which now is free, and will succeed. A self-hosted control plane has been created! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both KCM and scheduler use host networks and binds ports, so I think this may be true.
On the other hand, if this is not true, there'd be two copies of KCM and scheduler running. That may cause unexpected consequences.
|
||
What actually happens in the Static Pod -> Self-hosted transition? | ||
|
||
First the PodSpec is modified. For instance, instead of hard-coding an IP for the API server to listen on it is dynamically fetched from the Downward API. When it’s ok to proceed, the self-hosted control plane DaemonSets are posted to the Static Pod API Server. The Static Pod API Server starts the self-hosted Pods normally. The Pods get in the Running state, but fails to bind to the port on the host and backs off internally. During that time, kubeadm notices the Running state of the self-hosted Pod, and deletes the Static Pod file which leads to that the kubelet stops the Static Pod-hosted component immediately. On the next run, the self-hosted component will try to bind to the port, which now is free, and will succeed. A self-hosted control plane has been created! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May want to clarify what "running" means here.
These master components all have liveness checks. If they cannot bind to the ports for their healthz endpoints, kubelet may try to kill them repeatedly. You may be able to avoid this by extending the initial delay of the liveness checks though.
|
||
Notably, not etcd or kubelet for the moment. | ||
|
||
Self-hosting will be a [phase](https://github.com/kubernetes/kubeadm/blob/master/docs/design/design.md) in the kubeadm perspective. As there are Static Pods on disk from earlier in the `kubeadm init` process, kubeadm can pretty easily parse that file and extract the PodSpec from the file. This PodSpec will be modified a little to fit the self-hosting purposes and be injected into a DaemonSet which will be created for all control plane components (API Server, Scheduler and Controller Manager). kubeadm will wait for the self-hosted control plane Pods to be running and then destroy the Static Pod-hosted control plane. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @roberthbailey. Using deployments won't solve the problem. On the other hand, it'd be good to state clearly why DaemonSet is chosen to use here.
|
||
Self-hosting implementation for v1.8 | ||
|
||
The API Server, controller manager and scheduler will all be deployed in DaemonSets with node affinity (using `nodeSelector`) to masters. The current way DaemonSet upgrades currently operate is "remove an old Pod, then add a new Pod". Since supporting “single masters” is a definite requirement, we have to either workaround removing the only replica of the scheduler/controller-manager by duplicating the existing DaemonSets (e.g. a `temp-self-hosted-kube-apiserver` DS will be created as a copy of the normal `self-hosted-kube-apiserver` DS during the upgrade) or ask for a [new upgrade strategy](https://github.com/kubernetes/kubernetes/issues/48841) (“add first, then delete”, which is generally useful) from sig-apps. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: the node affinity needs to be a hard requirement.
|
||
Self-hosting implementation for v1.8 | ||
|
||
The API Server, controller manager and scheduler will all be deployed in DaemonSets with node affinity (using `nodeSelector`) to masters. The current way DaemonSet upgrades currently operate is "remove an old Pod, then add a new Pod". Since supporting “single masters” is a definite requirement, we have to either workaround removing the only replica of the scheduler/controller-manager by duplicating the existing DaemonSets (e.g. a `temp-self-hosted-kube-apiserver` DS will be created as a copy of the normal `self-hosted-kube-apiserver` DS during the upgrade) or ask for a [new upgrade strategy](https://github.com/kubernetes/kubernetes/issues/48841) (“add first, then delete”, which is generally useful) from sig-apps. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Explain the "single masters" requirement.
s/The current way DaemonSet upgrades currently operate is "remove/Currently, DaemonSet performs upgrades by "removing
|
||
Defining decent upgrading and version skew policies is important before implementing. | ||
|
||
Definitions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make it a header for readability, e.g., ##Definitions
|
||
6. Example: running `kubeadm upgrade apply --version v1.9.0` against a v1.8.2 control plane will error out if the nodes are still on v1.7.x | ||
|
||
4. This means that there are possibly two kinds of upgrades kubeadm can do: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this part of the upgrade policy, or what's derived from it? I feel like this section is mixed with policy, what kubeadm
can do, and implementation specifics. I'd suggest splitting them if possible.
|
||
4. Example: The `system:nodes` ClusterRoleBinding had to lose its binding to the `system:nodes` Group when upgrading to v1.7; otherwise the Node Authorizer wouldn’t have had any effect. | ||
|
||
5. Using kubeadm, you must upgrade the control plane atomically. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the upgrade failed, would kubeadm roll back the changes?
|
||
Only *control plane components* will be in-scope for self-hosting for v1.8. | ||
|
||
Notably, not etcd or kubelet for the moment. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do users of kubeadm
upgrade kubelets today?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's done via package manager. yum/apt.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps it would be worth linking to the expected node upgrade process or having a brief description. I was wondering how the nodes get upgraded as well.
|
||
2. Build an upgrading Operator that does the upgrading for us | ||
|
||
* Would consume a TPR with details how to do the upgrade |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's TPR? Couldn't find it in this proposal...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's CRD.
Thanks for the comments @yujuhong! |
|
||
What actually happens in the Static Pod -> Self-hosted transition? | ||
|
||
First the PodSpec is modified. For instance, instead of hard-coding an IP for the API server to listen on it is dynamically fetched from the Downward API. When it’s ok to proceed, the self-hosted control plane DaemonSets are posted to the Static Pod API Server. The Static Pod API Server starts the self-hosted Pods normally. The Pods get in the Running state, but fails to bind to the port on the host and backs off internally. During that time, kubeadm notices the Running state of the self-hosted Pod, and deletes the Static Pod file which leads to that the kubelet stops the Static Pod-hosted component immediately. On the next run, the self-hosted component will try to bind to the port, which now is free, and will succeed. A self-hosted control plane has been created! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to have these steps in a numbered list
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do
|
||
Self-hosting implementation for v1.8 | ||
|
||
The API Server, controller manager and scheduler will all be deployed in DaemonSets with node affinity (using `nodeSelector`) to masters. The current way DaemonSet upgrades currently operate is "remove an old Pod, then add a new Pod". Since supporting “single masters” is a definite requirement, we have to either workaround removing the only replica of the scheduler/controller-manager by duplicating the existing DaemonSets (e.g. a `temp-self-hosted-kube-apiserver` DS will be created as a copy of the normal `self-hosted-kube-apiserver` DS during the upgrade) or ask for a [new upgrade strategy](https://github.com/kubernetes/kubernetes/issues/48841) (“add first, then delete”, which is generally useful) from sig-apps. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 add first, then delete. That sounds much nicer than hacking around it with temporary duplicates.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I implemented this the hacky way for now, as the UpdateStrategy didn't make v1.8
However, we definitely want add first, then delete longer-term
|
||
Checkpointing | ||
|
||
In order to be able to reboot a self-hosted cluster (e.g. a single, self-hosted master), there has to be some kind of checkpointing mechanism. Basically the kubelet will write some state to disk for Pods that opt-in to checkpointing ("I’m running an API Server, let’s write that down so I remember it", etc.). Then if the kubelet reboots (in the single-master case for example), it will check the state store for Pods it was running before the reboot. It discovers that it ran an API Server, Controller Manager and Scheduler and starts those Pods now as well. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sure that if the Pods are still running, the Kubelet doesn't create duplicates on restart. There are many reasons the Kubelet could restart without a full node reboot: OOM kill, dynamic config, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@timothysc see comment on checkpointing ^
|
||
Checkpointing | ||
|
||
In order to be able to reboot a self-hosted cluster (e.g. a single, self-hosted master), there has to be some kind of checkpointing mechanism. Basically the kubelet will write some state to disk for Pods that opt-in to checkpointing ("I’m running an API Server, let’s write that down so I remember it", etc.). Then if the kubelet reboots (in the single-master case for example), it will check the state store for Pods it was running before the reboot. It discovers that it ran an API Server, Controller Manager and Scheduler and starts those Pods now as well. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, why wouldn't we just checkpoint all pods by default? How much disk space would this consume? Would it be small enough that we could require opt-out instead of opt-in?
|
||
In order to be able to reboot a self-hosted cluster (e.g. a single, self-hosted master), there has to be some kind of checkpointing mechanism. Basically the kubelet will write some state to disk for Pods that opt-in to checkpointing ("I’m running an API Server, let’s write that down so I remember it", etc.). Then if the kubelet reboots (in the single-master case for example), it will check the state store for Pods it was running before the reboot. It discovers that it ran an API Server, Controller Manager and Scheduler and starts those Pods now as well. | ||
|
||
This solves the chicken-and-egg problem that otherwise would occur when the kubelet comes back up, where the kubelet tries to connect to the API server, but the API server hasn’t been started yet, it should be running as a Pod on that kubelet, although it isn’t aware of that. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to be clear, you're saying that checkpointing solves the problem of the Kubelet hosting itself in a Pod if it can't contact the API server and learn that it should?
Side question: If we have checkpoints, can we start provisioning initial checkpoints instead of static pods?
|
||
2. Upgrading to a higher minor release than the kubeadm CLI version will **not** be supported. | ||
|
||
3. Example: kubeadm v1.8.3 can upgrade your v1.8.3 cluster to v1.8.6 if you specify `--force` at the time of the upgrade, but kubeadm can never upgrade your v1.8.3 cluster to v1.9.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: be a little more specific - s/but kubeadm can never/but kubeadm v1.8.3 can never...
|
||
3. The control plane must be upgraded before the kubelets in the cluster | ||
|
||
5. For kubeadm; the maximum amount of skew between the control plane and the kubelets is *one minor release*. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should support two trailing versions for clients. API changes are supposed to be backwards-compatible, so this shouldn't be a problem for kubeadm. In theory.
I'm not sure whether we should try to support skipping a minor version during an upgrade though. I think people typically do this one version at a time.
@luxas - you mentioned on Friday that you were going to wait until Tuesday to collect feedback and then update the proposal. Friendly ping to let us know when the PR is ready to be re-reviewed. |
@roberthbailey I changed my mind ;) Feel free to take it for a spin. I hope I can get to this by the end of next week at least. Finally, thank you everyone for commenting -- I'm sorry I'm overloaded with other things at the moment plus the upgrades impl., I'll answer your questions as soon as I can. |
@luxas - should we clean this up and get it merged (now that kubeadm supports upgrade)? |
Yeah, sorry, it's on my ever-growing backlog. I'll try to get to this in a week or two or so, lots of other tasks to complete still :(
… On 10 Oct 2017, at 00:20, Robert Bailey ***@***.***> wrote:
@luxas - should we clean this up and get it merged (now that kubeadm supports upgrade)?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
b7b7d2f
to
3a57440
Compare
@timothysc @roberthbailey Please read this doc up to the "Various other notes"; I need to finish the last sections there yet. Thanks! |
# kubeadm upgrades proposal | ||
|
||
Authors: Lucas Käldström & the SIG Cluster Lifecycle team | ||
Last updated: October 2017 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a newline before this so that there is a line break in the markdown.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
|
||
## Abstract | ||
|
||
This proposal describes how kubeadm will support a upgrading clusters in an user-friendly and automated way using different techniques. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
support upgrading (remove the 'a')
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oops
|
||
## Abstract | ||
|
||
This proposal describes how kubeadm will support a upgrading clusters in an user-friendly and automated way using different techniques. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what different techniques? shouldn't there be a single technique that we plan to use for all kubeadm clusters?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed the ambiguity; I mean that kubeadm can actually perform different tasks for different clusters under the hood (self-hosted vs static pod hosted), but that isn't relevant here
- Support for upgrading the API Server, the Controller Manager, the Scheduler, kube-dns & kube-proxy | ||
- Support for performing necessary post-upgrade steps like upgrading Bootstrap Tokens that were alpha in v1.6 & v1.7 to beta ones in v1.8 | ||
- Automated e2e tests running. | ||
- GA in v1.10: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note that this is the plan, since 1.10 isn't out yet and kubeadm itself it still beta.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yup, I left this as in a future version
now instead
## Graduation requirements | ||
|
||
- Beta in v1.8: | ||
- Support for the `kubeadm upgrade plan` and `kubeadm upgrade apply` commands. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some bulleted lines end with periods and others don't which is strange. please make them consistent one way or the other
|
||
**Rewrite manifests completely from config object or mutate existing manifest:** | ||
|
||
Instead of generating new manifests from a versioned configuration object, we could try to add "filters" to the existing manifests and apply different filters depending on what the upgrade looks like. This approach, to modify existing manifests in various ways depending on the version bump has some pros (modularity, simple mutating functions), but the matrix of functions and different policies would grow just too big for this kind of system, so we voted against this alternative in favor for the solution above. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did we actually vote? maybe say decided instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right, we didn't actually vote. Thanks for the reword suggestion there.
- In kubeadm v1.4 to v1.8, this is the default way of setting up the control plane | ||
- Running the control plane in Kubernetes-hosted containers as DaemonSets; aka. [Self-Hosting](#TODO) | ||
- When creating a self-hosted cluster; kubeadm first creates a Static Pod-hosted cluster and then pivots to the self-hosted control plane. | ||
- This is the default way to deploy the control plane since v1.9; but the user can opt-out if it and stick with the Static Pod-hosted cluster. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/since/beginning with/
also replace ; with ,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
- Creates a **backup directory** with the prefix `/etc/kubernetes/tmp/kubeadm-backup-manifests*`. | ||
- Q: Why `/etc/kubernetes/tmp`? | ||
- A: Possibly not very likely, but we concluded that there may be an attack area for computers where `/tmp` is shared and writable by all users. | ||
We wouldn't want anyone to mock with the new Static Pod manifests being applied to the clusters. Hence we chose `/etc/kubernetes/tmp`, which is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
replace mock with muck or mess
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hehe, typo ;)
- A: Possibly not very likely, but we concluded that there may be an attack area for computers where `/tmp` is shared and writable by all users. | ||
We wouldn't want anyone to mock with the new Static Pod manifests being applied to the clusters. Hence we chose `/etc/kubernetes/tmp`, which is | ||
root-only owned. | ||
- In a loop for all control plane components: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does the order of the control plane components matter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added a comment that order shouldn't matter, but we do apiserver, ctlr-mgr, sched
For instance, if the scheduler doesn't come up cleanly; kubeadm will rollback the previously (successfully upgraded) API server, controller manager | ||
manifests as well as the scheduler manifest. | ||
|
||
#### Self-hosted control plane |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it looks like this section isn't finished so i'm going to stop reviewing here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, thanks for the review so far!
Please poke when the remainder of the doc is finished so I can hopefully just do one more pass. |
@luxas @roberthbailey should this doc move inside https://github.com/kubernetes/community/tree/master/contributors/design-proposals/cluster-lifecycle? |
@Kargakis yes indeed. I initially filed this PR before we did that, will update on my next round here. |
This PR has been idle coming up on 2 years now, so I think it should be closed. /close |
@roberthbailey: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
This proposal has so far been developed in this Google doc: https://docs.google.com/document/d/1PRrC2tvB-p7sotIA5rnHy5WAOGdJJOIXPPv23hUFGrY/edit
Features issue: kubernetes/enhancements#296
@kubernetes/sig-cluster-lifecycle-proposals
@kubernetes/sig-onprem-proposals
for general review/approval
@kubernetes/sig-architecture-proposals
for review of Kubernetes upgrades with no external dependencies.
The aim is to provide an easy way to do upgrades against any cluster given some basic requirements long-term
@kubernetes/sig-api-machinery-proposals a heads up on the expected retry loop while trying to bind to a port
@kubernetes/sig-apps-proposals for being able to upgrade DaemonSets using an "add first, then delete" strategy
@kubernetes/sig-node-proposals for the expected checkpointing functionality
I'm not sure if the markdown converter I used could preserve styling well, if not, I'll update that in the coming days.