[WIP] Add kubeadm upgrades proposal #825

luxas · 2017-07-19T21:08:55Z

This proposal has so far been developed in this Google doc: https://docs.google.com/document/d/1PRrC2tvB-p7sotIA5rnHy5WAOGdJJOIXPPv23hUFGrY/edit

Features issue: kubernetes/enhancements#296

@kubernetes/sig-cluster-lifecycle-proposals
@kubernetes/sig-onprem-proposals
for general review/approval

@kubernetes/sig-architecture-proposals
for review of Kubernetes upgrades with no external dependencies.
The aim is to provide an easy way to do upgrades against any cluster given some basic requirements long-term

@kubernetes/sig-api-machinery-proposals a heads up on the expected retry loop while trying to bind to a port

@kubernetes/sig-apps-proposals for being able to upgrade DaemonSets using an "add first, then delete" strategy

@kubernetes/sig-node-proposals for the expected checkpointing functionality

I'm not sure if the markdown converter I used could preserve styling well, if not, I'll update that in the coming days.

0xmichalis · 2017-07-22T14:46:54Z

contributors/design-proposals/kubeadm-upgrades.md

+
+What actually happens in the Static Pod -> Self-hosted transition?
+
+First the PodSpec is modified. For instance, instead of hard-coding an IP for the API server to listen on it is dynamically fetched from the Downward API. When it’s ok to proceed, the self-hosted control plane DaemonSets are posted to the Static Pod API Server. The Static Pod API Server starts the self-hosted Pods normally. The Pods get in the Running state, but fails to bind to the port on the host and backs off internally. During that time, kubeadm notices the Running state of the self-hosted Pod, and deletes the Static Pod file which leads to that the kubelet stops the Static Pod-hosted component immediately. On the next run, the self-hosted component will try to bind to the port, which now is free, and will succeed. A self-hosted control plane has been created!


Can you break all single-line comments into separate lines? Will be better for reviews.

I'll do that

0xmichalis · 2017-07-22T14:51:55Z

contributors/design-proposals/kubeadm-upgrades.md

+
+Notably, not etcd or kubelet for the moment.
+
+Self-hosting will be a [phase](https://github.com/kubernetes/kubeadm/blob/master/docs/design/design.md) in the kubeadm perspective. As there are Static Pods on disk from earlier in the `kubeadm init` process, kubeadm can pretty easily parse that file and extract the PodSpec from the file. This PodSpec will be modified a little to fit the self-hosting purposes and be injected into a DaemonSet which will be created for all control plane components (API Server, Scheduler and Controller Manager). kubeadm will wait for the self-hosted control plane Pods to be running and then destroy the Static Pod-hosted control plane.


Assuming the DaemonSet controller continues to schedule DS pods, isn't it better for the controller manager to be a Deployment as opposed to a DaemonSet? If all control plane components run as DaemonSets, isn't the controller manager a single point of failure?

The kubelet will keep the controller manager running. You could get into a state where the controller manager is unrunnable due to a configuration or coding error, but I don't see how that would be any better with deployments.

I agree with @roberthbailey. Using deployments won't solve the problem. On the other hand, it'd be good to state clearly why DaemonSet is chosen to use here.

0xmichalis · 2017-07-22T15:12:37Z

contributors/design-proposals/kubeadm-upgrades.md

+
+2. Build an upgrading Operator that does the upgrading for us
+
+    * Would consume a TPR with details how to do the upgrade


CRD instead of TPR

if you want kubeadm to be a building block for widely varying clusters, be cautious about requiring a custom resource be part of the base install path.

0xmichalis · 2017-07-22T15:17:03Z

contributors/design-proposals/kubeadm-upgrades.md

+
+In v1.7 and v1.8, etcd runs in a Static Pod on the master. In v1.7, the default etcd version for Kubernetes is v3.0.17 and in k8s v1.8 the recommended version will be something like v3.1.10. In the v1.7->v1.8 upgrade path, we could offer upgrading etcd as well as an opt-in*. *This only applies to minor versions and 100% backwards-compatible upgrades like v3.0.x->v3.1.y, not backwards-incompatible upgrades like etcdv2 -> etcdv3.
+
+The easiest way of achieving an etcd upgrade is probably to create a Job (with `.replicas=<masters>`, Pod Anti-Affinity and `.spec.parallellism=1`) of some kind that would upgrade the etcd Static Pod manifest on disk by writing a new manifest, waiting for it to restart cleanly or rollback, etc.


Would be nice to add an example in the kubeadm repo.

roberthbailey · 2017-07-25T18:52:06Z

contributors/design-proposals/kubeadm-upgrades.md

+
+What actually happens in the Static Pod -> Self-hosted transition?
+
+First the PodSpec is modified. For instance, instead of hard-coding an IP for the API server to listen on it is dynamically fetched from the Downward API. When it’s ok to proceed, the self-hosted control plane DaemonSets are posted to the Static Pod API Server. The Static Pod API Server starts the self-hosted Pods normally. The Pods get in the Running state, but fails to bind to the port on the host and backs off internally. During that time, kubeadm notices the Running state of the self-hosted Pod, and deletes the Static Pod file which leads to that the kubelet stops the Static Pod-hosted component immediately. On the next run, the self-hosted component will try to bind to the port, which now is free, and will succeed. A self-hosted control plane has been created!


The Pods get in the Running state...

I believe that the behavior described is only true for the apiserver. The KCM and scheduler should just run fine since they aren't trying to bind to host ports.

Both KCM and scheduler use host networks and binds ports, so I think this may be true.

On the other hand, if this is not true, there'd be two copies of KCM and scheduler running. That may cause unexpected consequences.

May want to clarify what "running" means here.

These master components all have liveness checks. If they cannot bind to the ports for their healthz endpoints, kubelet may try to kill them repeatedly. You may be able to avoid this by extending the initial delay of the liveness checks though.

Are there any requirements for KCM and scheduler to use host networking? We've been successful running them both on the pod network & using service-account for apiserver access.

I believe the health check wouldn't come into play in the case of apiserver because it is just reaching out to the same host:port regardless if it's static manifest or the self-hosted "replacement". It's a bit disingenuous in that the healthcheck is really only testing one of them.

One problematic piece here is that the apiserver can't actually bind on the port, so it will be in a restart loop. If it does this enough then the backoff period can make it seem like something went wrong (we start the pivot by removing static pod, then the backoff period on the replacement causes it to not be restarted for a while).

As long as the kubelet itself doesn't die during that process, it will recover. It's just somewhat less than ideal.

On the other hand, if this is not true, there'd be two copies of KCM and scheduler running. That may cause unexpected consequences.

It shouldn't, they are setup to lock on endpoints by default.

roberthbailey · 2017-07-25T18:54:57Z

contributors/design-proposals/kubeadm-upgrades.md

+
+Status as of v1.7: self hosted config file option exists; not split out into phase. The code is fragile and uses Deployments for the scheduler and controller-manager, which unfortunately [leads to a deadlock](https://github.com/kubernetes/kubernetes/issues/45717) at `kubeadm init` time. It does not leverage checkpointing so the entire cluster burns to the ground if you reboot your computer(s). Long story short; self-hosting is not production-ready in 1.7.
+
+The rest of the document assumes a self-hosted cluster. It will not be possible to use 'kubeadm upgrade' a cluster unless it's self-hosted. The sig-cluster-lifecycle team will still provide some manual, documented steps that an user can follow to upgrade a non-self-hosted cluster.


The rest of the document assumes a self-hosted cluster.

This applies to the whole doc, not just the rest, because the preceding section describes self-hosting. I think we need to state this at the top of the document rather than hiding it in this section. We should also provide a link at that point to the instructions for manually upgrading a non-self-hosted cluster.

Once this statement is at the top, then the section describing self-hosting can be seen as background. Right now this document is describing both how we implement self-hosting (do we have that described anywhere else?) and also how we do upgrades, nominally in a doc that is just supposed to explain upgrades.

Another option is to break out the self-hosting description into a different document.

Yeah, I think that breaking out the self-hosting parts is a good idea.
Also self-hosting didn't come out as important in this process as we had thought

roberthbailey · 2017-07-25T18:57:28Z

contributors/design-proposals/kubeadm-upgrades.md

+
+Self-hosting implementation for v1.8
+
+The API Server, controller manager and scheduler will all be deployed in DaemonSets with node affinity (using `nodeSelector`) to masters. The current way DaemonSet upgrades currently operate is "remove an old Pod, then add a new Pod". Since supporting “single masters” is a definite requirement, we have to either workaround removing the only replica of the scheduler/controller-manager by duplicating the existing DaemonSets (e.g. a `temp-self-hosted-kube-apiserver` DS will be created as a copy of the normal `self-hosted-kube-apiserver` DS during the upgrade) or ask for a [new upgrade strategy](https://github.com/kubernetes/kubernetes/issues/48841) (“add first, then delete”, which is generally useful) from sig-apps.


with node affinity (using nodeSelector) to masters.

Are we also going to use taints?

nit: the node affinity needs to be a hard requirement.

clarified that we're using the nodeSelector feature right now, not the "real" node affinity one

roberthbailey · 2017-07-25T18:58:02Z

contributors/design-proposals/kubeadm-upgrades.md

+
+Self-hosting implementation for v1.8
+
+The API Server, controller manager and scheduler will all be deployed in DaemonSets with node affinity (using `nodeSelector`) to masters. The current way DaemonSet upgrades currently operate is "remove an old Pod, then add a new Pod". Since supporting “single masters” is a definite requirement, we have to either workaround removing the only replica of the scheduler/controller-manager by duplicating the existing DaemonSets (e.g. a `temp-self-hosted-kube-apiserver` DS will be created as a copy of the normal `self-hosted-kube-apiserver` DS during the upgrade) or ask for a [new upgrade strategy](https://github.com/kubernetes/kubernetes/issues/48841) (“add first, then delete”, which is generally useful) from sig-apps.


Since supporting “single masters” is a definite requirement,

remove "definite".

roberthbailey · 2017-07-25T18:59:10Z

contributors/design-proposals/kubeadm-upgrades.md

+
+Checkpointing
+
+In order to be able to reboot a self-hosted cluster (e.g. a single, self-hosted master), there has to be some kind of checkpointing mechanism. Basically the kubelet will write some state to disk for Pods that opt-in to checkpointing ("I’m running an API Server, let’s write that down so I remember it", etc.). Then if the kubelet reboots (in the single-master case for example), it will check the state store for Pods it was running before the reboot. It discovers that it ran an API Server, Controller Manager and Scheduler and starts those Pods now as well.


Basically the kubelet will write

Remove Basically from the beginning of the sentence.

roberthbailey · 2017-07-25T19:00:16Z

contributors/design-proposals/kubeadm-upgrades.md

+
+Checkpointing
+
+In order to be able to reboot a self-hosted cluster (e.g. a single, self-hosted master), there has to be some kind of checkpointing mechanism. Basically the kubelet will write some state to disk for Pods that opt-in to checkpointing ("I’m running an API Server, let’s write that down so I remember it", etc.). Then if the kubelet reboots (in the single-master case for example), it will check the state store for Pods it was running before the reboot. It discovers that it ran an API Server, Controller Manager and Scheduler and starts those Pods now as well.


Please add a link to kubernetes/kubernetes#49236

roberthbailey · 2017-07-25T22:25:23Z

contributors/design-proposals/kubeadm-upgrades.md

+
+        11. Has to define an API (CRD or the like) between the client and the Operator
+
+**Decision**: Keep the logic inside of the kubeadm CLI (option 1) for the implementation in v1.8.0.


This makes sense. I don't think it would be difficult to move the logic later (at least not more difficult than using an operator now).

roberthbailey · 2017-07-25T22:26:34Z

contributors/design-proposals/kubeadm-upgrades.md

+
+One of the hardest parts with implementing the upgrade will be to respect customizations made by the user at `kubeadm init` time. The proposed solution would be to store the kubeadm configuration given at `init`-time in the API as a ConfigMap, and then retrieve that configuration at upgrade time, parse it using the API machinery and use it for the upgrade.
+
+This highlights a very important point: **We have to get the kubeadm configuration API group to Beta (v1beta1) in time for v1.8.**


Is this on track?

roberthbailey · 2017-07-25T22:30:29Z

contributors/design-proposals/kubeadm-upgrades.md

+
+This is a stretch goal for v1.8, but not a strictly necessary feature.
+
+Pre-pulling of images/ensuring the upgrade doesn’t take too long


This doesn't seem like a subsection of alternatives considered.

roberthbailey · 2017-07-25T22:35:01Z

contributors/design-proposals/kubeadm-upgrades.md

+
+2. Make sure the cluster is healthy
+
+    1. Make sure the API Server’s `/healthz` endpoint returns `ok`


Should we look at /componentstatuses too?

roberthbailey · 2017-07-25T22:35:16Z

contributors/design-proposals/kubeadm-upgrades.md

+
+    1. Make sure the API Server’s `/healthz` endpoint returns `ok`
+
+    2. Makes sure all Nodes return `Ready` status


Do you also check node conditions?

yujuhong · 2017-07-26T17:36:30Z

contributors/design-proposals/kubeadm-upgrades.md

+
+What actually happens in the Static Pod -> Self-hosted transition?
+
+First the PodSpec is modified. For instance, instead of hard-coding an IP for the API server to listen on it is dynamically fetched from the Downward API. When it’s ok to proceed, the self-hosted control plane DaemonSets are posted to the Static Pod API Server. The Static Pod API Server starts the self-hosted Pods normally. The Pods get in the Running state, but fails to bind to the port on the host and backs off internally. During that time, kubeadm notices the Running state of the self-hosted Pod, and deletes the Static Pod file which leads to that the kubelet stops the Static Pod-hosted component immediately. On the next run, the self-hosted component will try to bind to the port, which now is free, and will succeed. A self-hosted control plane has been created!


Both KCM and scheduler use host networks and binds ports, so I think this may be true.

On the other hand, if this is not true, there'd be two copies of KCM and scheduler running. That may cause unexpected consequences.

yujuhong · 2017-07-26T17:47:14Z

contributors/design-proposals/kubeadm-upgrades.md

+
+What actually happens in the Static Pod -> Self-hosted transition?
+
+First the PodSpec is modified. For instance, instead of hard-coding an IP for the API server to listen on it is dynamically fetched from the Downward API. When it’s ok to proceed, the self-hosted control plane DaemonSets are posted to the Static Pod API Server. The Static Pod API Server starts the self-hosted Pods normally. The Pods get in the Running state, but fails to bind to the port on the host and backs off internally. During that time, kubeadm notices the Running state of the self-hosted Pod, and deletes the Static Pod file which leads to that the kubelet stops the Static Pod-hosted component immediately. On the next run, the self-hosted component will try to bind to the port, which now is free, and will succeed. A self-hosted control plane has been created!


May want to clarify what "running" means here.

These master components all have liveness checks. If they cannot bind to the ports for their healthz endpoints, kubelet may try to kill them repeatedly. You may be able to avoid this by extending the initial delay of the liveness checks though.

yujuhong · 2017-07-26T17:49:45Z

contributors/design-proposals/kubeadm-upgrades.md

+
+Notably, not etcd or kubelet for the moment.
+
+Self-hosting will be a [phase](https://github.com/kubernetes/kubeadm/blob/master/docs/design/design.md) in the kubeadm perspective. As there are Static Pods on disk from earlier in the `kubeadm init` process, kubeadm can pretty easily parse that file and extract the PodSpec from the file. This PodSpec will be modified a little to fit the self-hosting purposes and be injected into a DaemonSet which will be created for all control plane components (API Server, Scheduler and Controller Manager). kubeadm will wait for the self-hosted control plane Pods to be running and then destroy the Static Pod-hosted control plane.


I agree with @roberthbailey. Using deployments won't solve the problem. On the other hand, it'd be good to state clearly why DaemonSet is chosen to use here.

yujuhong · 2017-07-26T17:55:29Z

contributors/design-proposals/kubeadm-upgrades.md

+
+Self-hosting implementation for v1.8
+
+The API Server, controller manager and scheduler will all be deployed in DaemonSets with node affinity (using `nodeSelector`) to masters. The current way DaemonSet upgrades currently operate is "remove an old Pod, then add a new Pod". Since supporting “single masters” is a definite requirement, we have to either workaround removing the only replica of the scheduler/controller-manager by duplicating the existing DaemonSets (e.g. a `temp-self-hosted-kube-apiserver` DS will be created as a copy of the normal `self-hosted-kube-apiserver` DS during the upgrade) or ask for a [new upgrade strategy](https://github.com/kubernetes/kubernetes/issues/48841) (“add first, then delete”, which is generally useful) from sig-apps.


nit: the node affinity needs to be a hard requirement.

yujuhong · 2017-07-26T17:57:33Z

contributors/design-proposals/kubeadm-upgrades.md

+
+Self-hosting implementation for v1.8
+
+The API Server, controller manager and scheduler will all be deployed in DaemonSets with node affinity (using `nodeSelector`) to masters. The current way DaemonSet upgrades currently operate is "remove an old Pod, then add a new Pod". Since supporting “single masters” is a definite requirement, we have to either workaround removing the only replica of the scheduler/controller-manager by duplicating the existing DaemonSets (e.g. a `temp-self-hosted-kube-apiserver` DS will be created as a copy of the normal `self-hosted-kube-apiserver` DS during the upgrade) or ask for a [new upgrade strategy](https://github.com/kubernetes/kubernetes/issues/48841) (“add first, then delete”, which is generally useful) from sig-apps.


Explain the "single masters" requirement.

s/The current way DaemonSet upgrades currently operate is "remove/Currently, DaemonSet performs upgrades by "removing

yujuhong · 2017-07-26T18:09:41Z

contributors/design-proposals/kubeadm-upgrades.md

+
+Defining decent upgrading and version skew policies is important before implementing.
+
+Definitions


Make it a header for readability, e.g., ##Definitions

yujuhong · 2017-07-26T18:14:14Z

contributors/design-proposals/kubeadm-upgrades.md

+
+    6. Example: running `kubeadm upgrade apply --version v1.9.0` against a v1.8.2 control plane will error out if the nodes are still on v1.7.x
+
+4. This means that there are possibly two kinds of upgrades kubeadm can do:


Is this part of the upgrade policy, or what's derived from it? I feel like this section is mixed with policy, what kubeadm can do, and implementation specifics. I'd suggest splitting them if possible.

yujuhong · 2017-07-26T18:15:20Z

contributors/design-proposals/kubeadm-upgrades.md

+
+                4. Example: The `system:nodes` ClusterRoleBinding had to lose its binding to the `system:nodes` Group when upgrading to v1.7; otherwise the Node Authorizer wouldn’t have had any effect.
+
+5. Using kubeadm, you must upgrade the control plane atomically.


If the upgrade failed, would kubeadm roll back the changes?

yujuhong · 2017-07-26T18:16:09Z

contributors/design-proposals/kubeadm-upgrades.md

+
+Only *control plane components* will be in-scope for self-hosting for v1.8.
+
+Notably, not etcd or kubelet for the moment.


How do users of kubeadm upgrade kubelets today?

It's done via package manager. yum/apt.

Perhaps it would be worth linking to the expected node upgrade process or having a brief description. I was wondering how the nodes get upgraded as well.

yujuhong · 2017-07-26T18:17:17Z

contributors/design-proposals/kubeadm-upgrades.md

+
+2. Build an upgrading Operator that does the upgrading for us
+
+    * Would consume a TPR with details how to do the upgrade


What's TPR? Couldn't find it in this proposal...

roberthbailey · 2017-07-26T18:43:01Z

Thanks for the comments @yujuhong!

mtaufen · 2017-07-29T01:18:39Z

contributors/design-proposals/kubeadm-upgrades.md

+
+What actually happens in the Static Pod -> Self-hosted transition?
+
+First the PodSpec is modified. For instance, instead of hard-coding an IP for the API server to listen on it is dynamically fetched from the Downward API. When it’s ok to proceed, the self-hosted control plane DaemonSets are posted to the Static Pod API Server. The Static Pod API Server starts the self-hosted Pods normally. The Pods get in the Running state, but fails to bind to the port on the host and backs off internally. During that time, kubeadm notices the Running state of the self-hosted Pod, and deletes the Static Pod file which leads to that the kubelet stops the Static Pod-hosted component immediately. On the next run, the self-hosted component will try to bind to the port, which now is free, and will succeed. A self-hosted control plane has been created!


It would be nice to have these steps in a numbered list

mtaufen · 2017-07-29T01:23:21Z

contributors/design-proposals/kubeadm-upgrades.md

+
+Self-hosting implementation for v1.8
+
+The API Server, controller manager and scheduler will all be deployed in DaemonSets with node affinity (using `nodeSelector`) to masters. The current way DaemonSet upgrades currently operate is "remove an old Pod, then add a new Pod". Since supporting “single masters” is a definite requirement, we have to either workaround removing the only replica of the scheduler/controller-manager by duplicating the existing DaemonSets (e.g. a `temp-self-hosted-kube-apiserver` DS will be created as a copy of the normal `self-hosted-kube-apiserver` DS during the upgrade) or ask for a [new upgrade strategy](https://github.com/kubernetes/kubernetes/issues/48841) (“add first, then delete”, which is generally useful) from sig-apps.


+1 add first, then delete. That sounds much nicer than hacking around it with temporary duplicates.

I implemented this the hacky way for now, as the UpdateStrategy didn't make v1.8
However, we definitely want add first, then delete longer-term

mtaufen · 2017-07-29T01:25:11Z

contributors/design-proposals/kubeadm-upgrades.md

+
+Checkpointing
+
+In order to be able to reboot a self-hosted cluster (e.g. a single, self-hosted master), there has to be some kind of checkpointing mechanism. Basically the kubelet will write some state to disk for Pods that opt-in to checkpointing ("I’m running an API Server, let’s write that down so I remember it", etc.). Then if the kubelet reboots (in the single-master case for example), it will check the state store for Pods it was running before the reboot. It discovers that it ran an API Server, Controller Manager and Scheduler and starts those Pods now as well.


Make sure that if the Pods are still running, the Kubelet doesn't create duplicates on restart. There are many reasons the Kubelet could restart without a full node reboot: OOM kill, dynamic config, etc.

@timothysc see comment on checkpointing ^

mtaufen · 2017-07-29T01:26:25Z

contributors/design-proposals/kubeadm-upgrades.md

+
+Checkpointing
+
+In order to be able to reboot a self-hosted cluster (e.g. a single, self-hosted master), there has to be some kind of checkpointing mechanism. Basically the kubelet will write some state to disk for Pods that opt-in to checkpointing ("I’m running an API Server, let’s write that down so I remember it", etc.). Then if the kubelet reboots (in the single-master case for example), it will check the state store for Pods it was running before the reboot. It discovers that it ran an API Server, Controller Manager and Scheduler and starts those Pods now as well.


Yes, why wouldn't we just checkpoint all pods by default? How much disk space would this consume? Would it be small enough that we could require opt-out instead of opt-in?

mtaufen · 2017-07-29T01:29:13Z

contributors/design-proposals/kubeadm-upgrades.md

+
+In order to be able to reboot a self-hosted cluster (e.g. a single, self-hosted master), there has to be some kind of checkpointing mechanism. Basically the kubelet will write some state to disk for Pods that opt-in to checkpointing ("I’m running an API Server, let’s write that down so I remember it", etc.). Then if the kubelet reboots (in the single-master case for example), it will check the state store for Pods it was running before the reboot. It discovers that it ran an API Server, Controller Manager and Scheduler and starts those Pods now as well.
+
+This solves the chicken-and-egg problem that otherwise would occur when the kubelet comes back up, where the kubelet tries to connect to the API server, but the API server hasn’t been started yet, it should be running as a Pod on that kubelet, although it isn’t aware of that.


Just to be clear, you're saying that checkpointing solves the problem of the Kubelet hosting itself in a Pod if it can't contact the API server and learn that it should?

Side question: If we have checkpoints, can we start provisioning initial checkpoints instead of static pods?

mtaufen · 2017-07-29T01:31:45Z

contributors/design-proposals/kubeadm-upgrades.md

+
+    2. Upgrading to a higher minor release than the kubeadm CLI version will **not** be supported.
+
+    3. Example: kubeadm v1.8.3 can upgrade your v1.8.3 cluster to v1.8.6 if you specify `--force` at the time of the upgrade, but kubeadm can never upgrade your v1.8.3 cluster to v1.9.0


nit: be a little more specific - s/but kubeadm can never/but kubeadm v1.8.3 can never...

mtaufen · 2017-07-29T01:38:45Z

contributors/design-proposals/kubeadm-upgrades.md

+
+3. The control plane must be upgraded before the kubelets in the cluster
+
+    5. For kubeadm; the maximum amount of skew between the control plane and the kubelets is *one minor release*.


I think we should support two trailing versions for clients. API changes are supposed to be backwards-compatible, so this shouldn't be a problem for kubeadm. In theory.
I'm not sure whether we should try to support skipping a minor version during an upgrade though. I think people typically do this one version at a time.

roberthbailey · 2017-08-02T05:07:40Z

@luxas - you mentioned on Friday that you were going to wait until Tuesday to collect feedback and then update the proposal.

Friendly ping to let us know when the PR is ready to be re-reviewed.

luxas · 2017-08-09T21:13:40Z

@roberthbailey I changed my mind ;)
I found it more valuable to actually focus my hours on working on the code in this phase of the cycle and get dependent PRs merged.
The WIP is here: kubernetes/kubernetes#48899

Feel free to take it for a spin.
I'm gonna do the cleanup of this doc later when more of the actual dependent code is merged, otherwise it would be too tight. This proposal as-is had consensus in the SIG, and the remaining comments are minor, wording or clarifications, so it has lower priority for me than actually shipping the code.

I hope I can get to this by the end of next week at least.

Finally, thank you everyone for commenting -- I'm sorry I'm overloaded with other things at the moment plus the upgrades impl., I'll answer your questions as soon as I can.

roberthbailey · 2017-10-09T21:20:15Z

@luxas - should we clean this up and get it merged (now that kubeadm supports upgrade)?

luxas · 2017-10-10T08:02:00Z

Yeah, sorry, it's on my ever-growing backlog. I'll try to get to this in a week or two or so, lots of other tasks to complete still :(

…

On 10 Oct 2017, at 00:20, Robert Bailey ***@***.***> wrote: @luxas - should we clean this up and get it merged (now that kubeadm supports upgrade)? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

castrojo · 2017-10-10T15:54:23Z

This change is

luxas · 2017-10-25T18:59:49Z

@timothysc @roberthbailey Please read this doc up to the "Various other notes"; I need to finish the last sections there yet. Thanks!

roberthbailey · 2017-11-01T05:51:05Z

contributors/design-proposals/kubeadm-upgrades.md

+# kubeadm upgrades proposal
+
+Authors: Lucas Käldström & the SIG Cluster Lifecycle team
+Last updated: October 2017


Add a newline before this so that there is a line break in the markdown.

roberthbailey · 2017-11-01T05:51:21Z

contributors/design-proposals/kubeadm-upgrades.md

+
+## Abstract
+
+This proposal describes how kubeadm will support a upgrading clusters in an user-friendly and automated way using different techniques.


support upgrading (remove the 'a')

roberthbailey · 2017-11-01T05:51:53Z

contributors/design-proposals/kubeadm-upgrades.md

+
+## Abstract
+
+This proposal describes how kubeadm will support a upgrading clusters in an user-friendly and automated way using different techniques.


what different techniques? shouldn't there be a single technique that we plan to use for all kubeadm clusters?

removed the ambiguity; I mean that kubeadm can actually perform different tasks for different clusters under the hood (self-hosted vs static pod hosted), but that isn't relevant here

roberthbailey · 2017-11-01T05:52:43Z

contributors/design-proposals/kubeadm-upgrades.md

+   - Support for upgrading the API Server, the Controller Manager, the Scheduler, kube-dns & kube-proxy
+   - Support for performing necessary post-upgrade steps like upgrading Bootstrap Tokens that were alpha in v1.6 & v1.7 to beta ones in v1.8
+   - Automated e2e tests running.
+ - GA in v1.10:


note that this is the plan, since 1.10 isn't out yet and kubeadm itself it still beta.

yup, I left this as in a future version now instead

roberthbailey · 2017-11-01T05:53:30Z

contributors/design-proposals/kubeadm-upgrades.md

+## Graduation requirements
+
+ - Beta in v1.8:
+   - Support for the `kubeadm upgrade plan` and `kubeadm upgrade apply` commands.


some bulleted lines end with periods and others don't which is strange. please make them consistent one way or the other

roberthbailey · 2017-11-01T06:15:06Z

contributors/design-proposals/kubeadm-upgrades.md

+
+**Rewrite manifests completely from config object or mutate existing manifest:**
+
+Instead of generating new manifests from a versioned configuration object, we could try to add "filters" to the existing manifests and apply different filters depending on what the upgrade looks like. This approach, to modify existing manifests in various ways depending on the version bump has some pros (modularity, simple mutating functions), but the matrix of functions and different policies would grow just too big for this kind of system, so we voted against this alternative in favor for the solution above.


did we actually vote? maybe say decided instead?

right, we didn't actually vote. Thanks for the reword suggestion there.

roberthbailey · 2017-11-01T06:16:37Z

contributors/design-proposals/kubeadm-upgrades.md

+   - In kubeadm v1.4 to v1.8, this is the default way of setting up the control plane
+ - Running the control plane in Kubernetes-hosted containers as DaemonSets; aka. [Self-Hosting](#TODO)
+   - When creating a self-hosted cluster; kubeadm first creates a Static Pod-hosted cluster and then pivots to the self-hosted control plane.
+   - This is the default way to deploy the control plane since v1.9; but the user can opt-out if it and stick with the Static Pod-hosted cluster. 


s/since/beginning with/

also replace ; with ,

roberthbailey · 2017-11-01T06:17:46Z

contributors/design-proposals/kubeadm-upgrades.md

+ - Creates a **backup directory** with the prefix `/etc/kubernetes/tmp/kubeadm-backup-manifests*`.
+   - Q: Why `/etc/kubernetes/tmp`?
+   - A: Possibly not very likely, but we concluded that there may be an attack area for computers where `/tmp` is shared and writable by all users.
+     We wouldn't want anyone to mock with the new Static Pod manifests being applied to the clusters. Hence we chose `/etc/kubernetes/tmp`, which is


replace mock with muck or mess

hehe, typo ;)

roberthbailey · 2017-11-01T06:18:42Z

contributors/design-proposals/kubeadm-upgrades.md

+   - A: Possibly not very likely, but we concluded that there may be an attack area for computers where `/tmp` is shared and writable by all users.
+     We wouldn't want anyone to mock with the new Static Pod manifests being applied to the clusters. Hence we chose `/etc/kubernetes/tmp`, which is
+     root-only owned.
+ - In a loop for all control plane components:


does the order of the control plane components matter?

added a comment that order shouldn't matter, but we do apiserver, ctlr-mgr, sched

roberthbailey · 2017-11-01T06:20:41Z

contributors/design-proposals/kubeadm-upgrades.md

+For instance, if the scheduler doesn't come up cleanly; kubeadm will rollback the previously (successfully upgraded) API server, controller manager
+manifests as well as the scheduler manifest.
+
+#### Self-hosted control plane


it looks like this section isn't finished so i'm going to stop reviewing here.

Yeah, thanks for the review so far!

roberthbailey · 2017-11-02T18:25:47Z

Please poke when the remainder of the doc is finished so I can hopefully just do one more pass.

0xmichalis · 2017-11-02T19:50:23Z

@luxas @roberthbailey should this doc move inside https://github.com/kubernetes/community/tree/master/contributors/design-proposals/cluster-lifecycle?

luxas · 2017-11-04T21:51:57Z

@Kargakis yes indeed. I initially filed this PR before we did that, will update on my next round here.

roberthbailey · 2019-08-27T21:14:38Z

This PR has been idle coming up on 2 years now, so I think it should be closed.

/close

k8s-ci-robot · 2019-08-27T21:14:40Z

@roberthbailey: Closed this PR.

In response to this:

This PR has been idle coming up on 2 years now, so I think it should be closed.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jul 19, 2017

luxas mentioned this pull request Jul 19, 2017

Easy, automated upgrades with the kubeadm upgrade command kubernetes/enhancements#296

Closed

0xmichalis reviewed Jul 22, 2017

View reviewed changes

roberthbailey reviewed Jul 25, 2017

View reviewed changes

yujuhong reviewed Jul 26, 2017

View reviewed changes

roberthbailey self-assigned this Jul 28, 2017

mtaufen reviewed Jul 29, 2017

View reviewed changes

k8s-github-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Aug 15, 2017

luxas mentioned this pull request Sep 4, 2017

Self-hosting support in kubeadm kubernetes/enhancements#415

Closed

luxas added 2 commits October 22, 2017 19:24

Add kubeadm upgrades proposal

a3f6dd7

[WIP] Add kubeadm upgrades doc

3a57440

luxas force-pushed the kubeadm_upgrades branch from b7b7d2f to 3a57440 Compare October 25, 2017 16:23

luxas changed the title ~~Add kubeadm upgrades proposal~~ [WIP] Add kubeadm upgrades proposal Oct 25, 2017

k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 25, 2017

roberthbailey reviewed Nov 1, 2017

View reviewed changes

0xmichalis added the sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. label Nov 2, 2017

roberthbailey added the keep-open label Dec 5, 2017

fejta added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. keep-open labels Dec 15, 2017

kubernetes deleted a comment from k8s-github-robot Dec 15, 2017

k8s-github-robot added the kind/design Categorizes issue or PR as related to design. label Feb 6, 2018

k8s-ci-robot closed this Aug 27, 2019

danehans pushed a commit to danehans/community that referenced this pull request Jul 18, 2023

Corrections to candidate profile pages for Elekto. (kubernetes#825)

dae4401


		What actually happens in the Static Pod -> Self-hosted transition?

		First the PodSpec is modified. For instance, instead of hard-coding an IP for the API server to listen on it is dynamically fetched from the Downward API. When it’s ok to proceed, the self-hosted control plane DaemonSets are posted to the Static Pod API Server. The Static Pod API Server starts the self-hosted Pods normally. The Pods get in the Running state, but fails to bind to the port on the host and backs off internally. During that time, kubeadm notices the Running state of the self-hosted Pod, and deletes the Static Pod file which leads to that the kubelet stops the Static Pod-hosted component immediately. On the next run, the self-hosted component will try to bind to the port, which now is free, and will succeed. A self-hosted control plane has been created!


		Notably, not etcd or kubelet for the moment.

		Self-hosting will be a [phase](https://github.com/kubernetes/kubeadm/blob/master/docs/design/design.md) in the kubeadm perspective. As there are Static Pods on disk from earlier in the `kubeadm init` process, kubeadm can pretty easily parse that file and extract the PodSpec from the file. This PodSpec will be modified a little to fit the self-hosting purposes and be injected into a DaemonSet which will be created for all control plane components (API Server, Scheduler and Controller Manager). kubeadm will wait for the self-hosted control plane Pods to be running and then destroy the Static Pod-hosted control plane.


		2. Build an upgrading Operator that does the upgrading for us

		* Would consume a TPR with details how to do the upgrade


		In v1.7 and v1.8, etcd runs in a Static Pod on the master. In v1.7, the default etcd version for Kubernetes is v3.0.17 and in k8s v1.8 the recommended version will be something like v3.1.10. In the v1.7->v1.8 upgrade path, we could offer upgrading etcd as well as an opt-in. This only applies to minor versions and 100% backwards-compatible upgrades like v3.0.x->v3.1.y, not backwards-incompatible upgrades like etcdv2 -> etcdv3.

		The easiest way of achieving an etcd upgrade is probably to create a Job (with `.replicas=<masters>`, Pod Anti-Affinity and `.spec.parallellism=1`) of some kind that would upgrade the etcd Static Pod manifest on disk by writing a new manifest, waiting for it to restart cleanly or rollback, etc.


		Status as of v1.7: self hosted config file option exists; not split out into phase. The code is fragile and uses Deployments for the scheduler and controller-manager, which unfortunately [leads to a deadlock](https://github.com/kubernetes/kubernetes/issues/45717) at `kubeadm init` time. It does not leverage checkpointing so the entire cluster burns to the ground if you reboot your computer(s). Long story short; self-hosting is not production-ready in 1.7.

		The rest of the document assumes a self-hosted cluster. It will not be possible to use 'kubeadm upgrade' a cluster unless it's self-hosted. The sig-cluster-lifecycle team will still provide some manual, documented steps that an user can follow to upgrade a non-self-hosted cluster.


		Self-hosting implementation for v1.8

		The API Server, controller manager and scheduler will all be deployed in DaemonSets with node affinity (using `nodeSelector`) to masters. The current way DaemonSet upgrades currently operate is "remove an old Pod, then add a new Pod". Since supporting “single masters” is a definite requirement, we have to either workaround removing the only replica of the scheduler/controller-manager by duplicating the existing DaemonSets (e.g. a `temp-self-hosted-kube-apiserver` DS will be created as a copy of the normal `self-hosted-kube-apiserver` DS during the upgrade) or ask for a [new upgrade strategy](https://github.com/kubernetes/kubernetes/issues/48841) (“add first, then delete”, which is generally useful) from sig-apps.


		Checkpointing

		In order to be able to reboot a self-hosted cluster (e.g. a single, self-hosted master), there has to be some kind of checkpointing mechanism. Basically the kubelet will write some state to disk for Pods that opt-in to checkpointing ("I’m running an API Server, let’s write that down so I remember it", etc.). Then if the kubelet reboots (in the single-master case for example), it will check the state store for Pods it was running before the reboot. It discovers that it ran an API Server, Controller Manager and Scheduler and starts those Pods now as well.


		11. Has to define an API (CRD or the like) between the client and the Operator

		Decision: Keep the logic inside of the kubeadm CLI (option 1) for the implementation in v1.8.0.


		One of the hardest parts with implementing the upgrade will be to respect customizations made by the user at `kubeadm init` time. The proposed solution would be to store the kubeadm configuration given at `init`-time in the API as a ConfigMap, and then retrieve that configuration at upgrade time, parse it using the API machinery and use it for the upgrade.

		This highlights a very important point: We have to get the kubeadm configuration API group to Beta (v1beta1) in time for v1.8.


		This is a stretch goal for v1.8, but not a strictly necessary feature.

		Pre-pulling of images/ensuring the upgrade doesn’t take too long


		2. Make sure the cluster is healthy

		1. Make sure the API Server’s `/healthz` endpoint returns `ok`


		1. Make sure the API Server’s `/healthz` endpoint returns `ok`

		2. Makes sure all Nodes return `Ready` status


		Defining decent upgrading and version skew policies is important before implementing.

		Definitions


		6. Example: running `kubeadm upgrade apply --version v1.9.0` against a v1.8.2 control plane will error out if the nodes are still on v1.7.x

		4. This means that there are possibly two kinds of upgrades kubeadm can do:


		4. Example: The `system:nodes` ClusterRoleBinding had to lose its binding to the `system:nodes` Group when upgrading to v1.7; otherwise the Node Authorizer wouldn’t have had any effect.

		5. Using kubeadm, you must upgrade the control plane atomically.


		Only control plane components will be in-scope for self-hosting for v1.8.

		Notably, not etcd or kubelet for the moment.


		In order to be able to reboot a self-hosted cluster (e.g. a single, self-hosted master), there has to be some kind of checkpointing mechanism. Basically the kubelet will write some state to disk for Pods that opt-in to checkpointing ("I’m running an API Server, let’s write that down so I remember it", etc.). Then if the kubelet reboots (in the single-master case for example), it will check the state store for Pods it was running before the reboot. It discovers that it ran an API Server, Controller Manager and Scheduler and starts those Pods now as well.

		This solves the chicken-and-egg problem that otherwise would occur when the kubelet comes back up, where the kubelet tries to connect to the API server, but the API server hasn’t been started yet, it should be running as a Pod on that kubelet, although it isn’t aware of that.


		2. Upgrading to a higher minor release than the kubeadm CLI version will not be supported.

		3. Example: kubeadm v1.8.3 can upgrade your v1.8.3 cluster to v1.8.6 if you specify `--force` at the time of the upgrade, but kubeadm can never upgrade your v1.8.3 cluster to v1.9.0


		3. The control plane must be upgraded before the kubelets in the cluster

		5. For kubeadm; the maximum amount of skew between the control plane and the kubelets is one minor release.


		## Abstract

		This proposal describes how kubeadm will support a upgrading clusters in an user-friendly and automated way using different techniques.

[WIP] Add kubeadm upgrades proposal #825

[WIP] Add kubeadm upgrades proposal #825

Conversation

luxas commented Jul 19, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

roberthbailey commented Jul 26, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

roberthbailey commented Aug 2, 2017

luxas commented Aug 9, 2017

roberthbailey commented Oct 9, 2017

luxas commented Oct 10, 2017 via email

castrojo commented Oct 10, 2017

luxas commented Oct 25, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment


		Rewrite manifests completely from config object or mutate existing manifest:

		Instead of generating new manifests from a versioned configuration object, we could try to add "filters" to the existing manifests and apply different filters depending on what the upgrade looks like. This approach, to modify existing manifests in various ways depending on the version bump has some pros (modularity, simple mutating functions), but the matrix of functions and different policies would grow just too big for this kind of system, so we voted against this alternative in favor for the solution above.