docs: add control plane conversion guide and 0.9 upgrade notes

These docs are critical to get 0.9.0-beta released. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
siderolabs · Mar 10, 2021 · ae8bedb · ae8bedb
1 parent ed9673e
commit ae8bedb
Show file tree

Hide file tree

Showing 3 changed files with 309 additions and 25 deletions.
diff --git a/website/content/docs/v0.9/Guides/converting-control-plane.md b/website/content/docs/v0.9/Guides/converting-control-plane.md
@@ -0,0 +1,255 @@
+---
+title: "Converting Control Plane"
+description: "How to convert Talos self-hosted Kubernetes control plane (pre-0.9) to static pods based one."
+---
+
+Talos version 0.9 runs Kubernetes control plane in a new way: static pods managed by Talos.
+Talos version 0.8 and below runs self-hosted control plane.
+After Talos OS upgrade to version 0.9 Kubernetes control plane should be converted to run as static pods.
+
+This guide describes automated conversion script and also shows detailed manual conversion process.
+
+## Automated Conversion
+
+First, make sure all nodes are updated to Talos 0.9:
+
+```bash
+$ kubectl get nodes -o wide
+NAME                     STATUS   ROLES                  AGE   VERSION   INTERNAL-IP   EXTERNAL-IP   OS-IMAGE         KERNEL-VERSION   CONTAINER-RUNTIME
+talos-default-master-1   Ready    control-plane,master   58m   v1.20.4   172.20.0.2    <none>        Talos (v0.9.0)   5.10.19-talos    containerd://1.4.4
+talos-default-master-2   Ready    control-plane,master   58m   v1.20.4   172.20.0.3    <none>        Talos (v0.9.0)   5.10.19-talos    containerd://1.4.4
+talos-default-master-3   Ready    control-plane,master   58m   v1.20.4   172.20.0.4    <none>        Talos (v0.9.0)   5.10.19-talos    containerd://1.4.4
+talos-default-worker-1   Ready    <none>                 58m   v1.20.4   172.20.0.5    <none>        Talos (v0.9.0)   5.10.19-talos    containerd://1.4.4
+```
+
+Start the conversion script:
+
+```bash
+$ talosctl -n <IP> convert-k8s
+discovered master nodes ["172.20.0.2" "172.20.0.3" "172.20.0.4"]
+current self-hosted status: true
+gathering control plane configuration
+aggregator CA key can't be recovered from bootkube-boostrapped control plane, generating new CA
+patching master node "172.20.0.2" configuration
+patching master node "172.20.0.3" configuration
+patching master node "172.20.0.4" configuration
+waiting for static pod definitions to be generated
+waiting for manifests to be generated
+Talos generated control plane static pod definitions and bootstrap manifests, please verify them with commands:
+    talosctl -n <master node IP> get StaticPods.kubernetes.talos.dev
+    talosctl -n <master node IP> get Manifests.kubernetes.talos.dev
+
+bootstrap manifests will only be applied for missing resources, existing resources will not be updated
+confirm disabling pod-checkpointer to proceed with control plane update [yes/no]:
+```
+
+Script stops at this point waiting for confirmation.
+Talos still runs self-hosted control plane, and static pods were not rendered yet.
+
+As instructed by the script, please verify that static pod definitions are correct:
+
+```bash
+$ talosctl -n <IP> get staticpods -o yaml
+node: 172.20.0.2
+metadata:
+    namespace: controlplane
+    type: StaticPods.kubernetes.talos.dev
+    id: kube-apiserver
+    version: 1
+    phase: running
+spec:
+    apiVersion: v1
+    kind: Pod
+    metadata:
+        annotations:
+            talos.dev/config-version: "2"
+            talos.dev/secrets-version: "1"
+        creationTimestamp: null
+        labels:
+            k8s-app: kube-apiserver
+            tier: control-plane
+        name: kube-apiserver
+        namespace: kube-system
+    spec:
+        containers:
+            - command:
+...
+```
+
+Static pod definitions are generated from the machine configuration and should match pod template as generated by Talos on bootstrap of self-hosted control plane unless there were some manual changes applied to the daemonset specs after bootstrap.
+Talos patches the machine configuration with the container image versions scraped from the daemonset definition, fetches the service account key from Kubernetes secrets.
+
+Aggregator CA can't be recovered from the self-hosted control plane, so new CA gets generated.
+This is generally harmless and not visible from outside the cluster.
+The Aggregator CA is _not_ the same CA as is used by Talos or Kubernetes standard API.
+It is a special PKI used for aggregating API extension services inside your cluster.
+If you have non-standard apiserver aggregations (fairly rare, and you should know if you do), then you may need to restart these services after the new CA is in place.
+
+Verify that bootstrap manifests are correct:
+
+```bash
+$ talosctl -n <IP> get manifests --namespace controlplane
+NODE         NAMESPACE      TYPE       ID                               VERSION
+172.20.0.2   controlplane   Manifest   00-kubelet-bootstrapping-token   1
+172.20.0.2   controlplane   Manifest   01-csr-approver-role-binding     1
+172.20.0.2   controlplane   Manifest   01-csr-node-bootstrap            1
+172.20.0.2   controlplane   Manifest   01-csr-renewal-role-binding      1
+172.20.0.2   controlplane   Manifest   02-kube-system-sa-role-binding   1
+172.20.0.2   controlplane   Manifest   03-default-pod-security-policy   1
+172.20.0.2   controlplane   Manifest   10-kube-proxy                    1
+172.20.0.2   controlplane   Manifest   11-core-dns                      1
+172.20.0.2   controlplane   Manifest   11-core-dns-svc                  1
+172.20.0.2   controlplane   Manifest   11-kube-config-in-cluster        1
+```
+
+```bash
+$ talosctl -n <IP> get manifests --namespace=extras
+NODE         NAMESPACE   TYPE       ID                                                        VERSION
+172.20.0.2   extras      Manifest   05-https://docs.projectcalico.org/manifests/calico.yaml   1
+```
+
+Make sure that manifests and static pods are correct across all control plane nodes, as each node reconciles
+control plane state on its own.
+For example, CNI configuration in machine config should be in sync across all the nodes.
+Talos nodes try to create any missing Kubernetes resources from the manifests, but it never
+updates or deletes existing resources.
+
+If something looks wrong, script can be aborted and machine configuration should be updated to fix the problem.
+Once configuration is updated, the script can be restarted.
+
+If static pod definitions and manifests look good, confirm next step to disable `pod-checkpointer`:
+
+```bash
+$ talosctl -n <IP> convert-k8s
+...
+confirm disabling pod-checkpointer to proceed with control plane update [yes/no]: yes
+disabling pod-checkpointer
+deleting daemonset "pod-checkpointer"
+checking for active pod checkpoints
+2021/03/09 23:37:25 retrying error: found 3 active pod checkpoints: [pod-checkpointer-655gc-talos-default-master-3 pod-checkpointer-pw6mv-talos-default-master-1 pod-checkpointer-zdw9z-talos-default-master-2]
+2021/03/09 23:42:25 retrying error: found 1 active pod checkpoints: [pod-checkpointer-pw6mv-talos-default-master-1]
+confirm applying static pod definitions and manifests [yes/no]:
+```
+
+Self-hosted control plane runs `pod-checkpointer` to work around issues with control plane availability.
+It should be disabled before conversion starts to allow self-hosted control plane to be removed.
+It takes around 5 minutes for the `pod-checkpointer` to be fully disabled.
+Script verifies that all checkpoints are removed before proceeding.
+
+This last confirmation before proceeding is at the point when there is no way to keep running self-hosted control plane:
+static pods are released, bootstrap manifests are applied, self-hosted control plane is removed.
+
+```bash
+$ talosctl -n <IP> convert-k8s
+...
+confirm applying static pod definitions and manifests [yes/no]: yes
+removing self-hosted initialized key
+waiting for static pods for "kube-apiserver" to be present in the API server state
+waiting for static pods for "kube-controller-manager" to be present in the API server state
+waiting for static pods for "kube-scheduler" to be present in the API server state
+deleting daemonset "kube-apiserver"
+waiting for static pods for "kube-apiserver" to be present in the API server state
+deleting daemonset "kube-controller-manager"
+waiting for static pods for "kube-controller-manager" to be present in the API server state
+deleting daemonset "kube-scheduler"
+waiting for static pods for "kube-scheduler" to be present in the API server state
+conversion process completed successfully
+```
+
+As soon as the control plane static pods are rendered, the kubelet starts the control plane static pods.
+It is expected that the pods for `kube-apiserver` will crash initially.
+Only one `kube-apiserver` can be bound to the host `Node`'s port 6443 at a time.
+Eventually, the old `kube-apiserver` will be killed, and the new one will be able to start.
+This is all handled automatically.
+The script will continue by removing each self-hosted daemonset and verifying that static pods are ready and healthy.
+
+## Manual Conversion
+
+Check that Talos runs self-hosted control plane:
+
+```bash
+$ talosctl -n <CONTROL_PLANE_IP> get bs
+NODE         NAMESPACE   TYPE              ID              VERSION   SELF HOSTED
+172.20.0.2   runtime     BootstrapStatus   control-plane   2         true
+```
+
+Talos machine configuration need to be updated to the 0.9 format; there are two new required machine configuration settings:
+
+* `.cluster.serviceAccount` is the service account PEM-encoded private key.
+* `.cluster.aggregatorCA` is the aggregator CA for `kube-apiserver` (certficiate and private key).
+
+Current service account can be fetched from the Kubernetes secrets:
+
+```bash
+$ kubectl -n kube-system get secrets kube-controller-manager -o jsonpath='{.data.service\-account\.key}'
+LS0tLS1CRUdJTiBSU0EgUFJJVkFURS...
+```
+
+All control plane node machine configurations should be patched with the service account key:
+
+```bash
+$ talosctl -n <CONTROL_PLANE_IP1>,<CONTROL_PLANE_IP2>,... patch mc --immediate -p '[{"op": "add", "path": "/cluster/serviceAccount", "value": {"key": "LS0tLS1CRUdJTiBSU0EgUFJJVkFURS..."}}]'
+patched mc at the node 172.20.0.2
+```
+
+Aggregator CA can be generated using OpenSSL or any other certificate generation tools: RSA or ECDSA certificate with CN `front-proxy` valid for 10 years.
+PEM-encoded CA certificate and key should be base64-encoded and patched into the machine config at path `/cluster/aggregatorCA`:
+
+```bash
+$ talosctl -n <CONTROL_PLANE_IP1>,<CONTROL_PLANE_IP2>,... patch mc --immediate -p '[{"op": "add", "path": "/cluster/aggregatorCA", "value": {"crt": "S0tLS1CRUdJTiBDRVJUSUZJQ...", "key": "LS0tLS1CRUdJTiBFQy..."}}]'
+patched mc at the node 172.20.0.2
+```
+
+At this point static pod definitions and bootstrap manifests should be rendered, please see "Automated Conversion" on how to verify generated objects.
+Feel free to continue to refine your machine configuration until the generated static pod definitions and bootstrap manifests look good.
+
+If static pod definitions are not generated, check logs with `talosctl -n <IP> logs controller-runtime`.
+
+Disable `pod-checkpointer` with:
+
+```bash
+$ kubectl -n kube-system delete ds pod-checkpointer
+daemonset.apps "pod-checkpointer" deleted
+```
+
+Wait for all pod checkpoints to be removed:
+
+```bash
+$ kubectl -n kube-system get pods
+NAME                                            READY   STATUS    RESTARTS   AGE
+...
+pod-checkpointer-8q2lh-talos-default-master-2   1/1     Running   0          3m34s
+pod-checkpointer-nnm5w-talos-default-master-3   1/1     Running   0          3m24s
+pod-checkpointer-qnmdt-talos-default-master-1   1/1     Running   0          2m21s
+```
+
+Pod checkpoints have annotation `checkpointer.alpha.coreos.com/checkpoint-of`.
+
+Once all the pod checkpoints are removed (it takes 5 minutes for the checkpoints to be removed), proceed by removing self-hosted initialized key:
+
+```bash
+talosctl -n <CONTROL_PLANE_IP> convert-k8s --remove-initialized-key
+```
+
+Talos controllers will now render static pod definitions, and the kubelet will launch any resulting static pods.
+
+Once static pods are visible in `kubectl get pods -n kube-system` output, proceed by removing each of the self-hosted daemonsets:
+
+```bash
+$ kubectl -n kube-system delete daemonset kube-apiserver
+daemonset.apps "kube-apiserver" deleted
+```
+
+Make sure static pods for `kube-apiserver` got started successfully, pods are running and ready.
+
+Proceed by deleting `kube-controller-manager` and `kube-scheduler` daemonsets, verifying that static pods are running between each step:
+
+```bash
+$ kubectl -n kube-system delete daemonset kube-controller-manager
+daemonset.apps "kube-controller-manager" deleted
+```
+
+```bash
+$ kubectl -n kube-system delete daemonset kube-scheduler
+daemonset.apps "kube-scheduler" deleted
+```
diff --git a/website/content/docs/v0.9/Guides/upgrading-talos.md b/website/content/docs/v0.9/Guides/upgrading-talos.md
@@ -3,14 +3,54 @@ title: Upgrading Talos
 ---
 
 Talos upgrades are effected by an API call.
-The `talosctl` CLI utility will facilitate this, or you can use the automatic upgrade features provided by the [talos controller manager](https://github.com/talos-systems/talos-controller-manager).
+The `talosctl` CLI utility will facilitate this.
+<!-- , or you can use the automatic upgrade features provided by the [talos controller manager](https://github.com/talos-systems/talos-controller-manager) -->
 
 ## Video Walkthrough
 
 To see a live demo of this writeup, see the video below:
 
 <iframe width="560" height="315" src="https://www.youtube.com/embed/sw78qS8vBGc" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
 
+## Upgrading from Talos 0.8
+
+Talos 0.9 drops support for `bootkube` and self-hosted control plane.
+
+Please make sure Talos is upgraded to the latest minor release of 0.8 first (0.8.4 at the moment
+of this writing), then proceed with upgrading to the latest minor release of 0.9.
+
+### Before Upgrade to 0.9
+
+If cluster was bootstrapped on Talos version < 0.8.3, add checkpointer annotations to
+the `kube-scheduler` and `kube-controller-manager` daemonsets to improve resiliency of
+self-hosted control plane to reboots (this is critical for single control-plane node clusters):
+
+```bash
+$ kubectl -n kube-system patch daemonset kube-controller-manager --type json -p '[{"op": "add", "path":"/spec/template/metadata/annotations", "value": {"checkpointer.alpha.coreos.com/checkpoint": "true"}}]'
+daemonset.apps/kube-controller-manager patched
+$ kubectl -n kube-system patch daemonset kube-scheduler --type json -p '[{"op": "add", "path":"/spec/template/metadata/annotations", "value": {"checkpointer.alpha.coreos.com/checkpoint": "true"}}]'
+daemonset.apps/kube-scheduler patched
+```
+
+Talos 0.9 only supports Kubernetes versions 1.19.x and 1.20.x.
+If running 1.18.x, please upgrade Kubernetes before upgrading Talos.
+
+Make sure cluster is running latest minor release of Talos 0.8.
+
+Prepare by downloading `talosctl` binary for Talos release 0.9.x.
+
+### After Upgrade to 0.9
+
+After the upgrade to 0.9, Talos will still be running self-hosted control plane until the [conversion process](../converting-control-plane/) is run.
+
+> Note: Talos 0.9 doesn't include bootkube recovery option (`talosctl recover`), so
+> it's not possible to recover self-hosted control plane after upgrading to 0.9.
+
+As soon as all the nodes get upgraded to 0.9, run `talosctl convert-k8s` to convert the control plane
+to the new static pod format for 0.9.
+
+Once the conversion process is complete, Kubernetes can be upgraded.
+
 ## `talosctl` Upgrade
 
 To manually upgrade a Talos node, you will specify the node's IP address and the
@@ -29,6 +69,10 @@ There is an option to this command: `--preserve`, which can be used to explicitl
 In most cases, it is correct to just let Talos perform its default action.
 However, if you are running a single-node control-plane, you will want to make sure that `--preserve=true`.
 
+If Talos fails to run the upgrade, the `--staged` flag may be used to perform the upgrade after a reboot
+which is followed by another reboot to upgraded version.
+
+<!--
 ## Talos Controller Manager
 
 The Talos Controller Manager can coordinate upgrades of your nodes
@@ -43,16 +87,17 @@ configured, take a look at the [GitHub page](https://github.com/talos-systems/ta
 Please note that the controller manager is still in fairly early development.
 More advanced features, such as time slot scheduling, will be coming in the
 future.
+-->
 
-## Changelog and Upgrade Notes
+## Machine Configuration Changes
 
-In an effort to create more production ready clusters, Talos will now taint control plane nodes as unschedulable.
-This means that any application you might have deployed must tolerate this taint if you intend on running the application on control plane nodes.
+Talos 0.9 introduces new required parameters in machine configuration:
 
-Another feature you will notice is the automatic uncordoning of nodes that have been upgraded.
-Talos will now uncordon a node if the cordon was initiated by the upgrade process.
+* `.cluster.aggregatorCA`
+* `.cluster.serviceAccount`
 
-### Talosctl
+Talos supports both ECDSA and RSA certificates and keys for Kubernetes and etcd, with ECDSA being default.
+Talos <= 0.8 supports only RSA keys and certificates.
 
-The `talosctl` CLI now requires an explicit set of nodes.
-This can be configured with `talos config nodes` or set on the fly with `talos --nodes`.
+Utility `talosctl gen config` generates by default config in 0.9 format which is not compatible with
+Talos 0.8, but old format can be generated with `talosctl gen config --talos-version=v0.8`.
diff --git a/website/content/docs/v0.9/Learn More/upgrades.md b/website/content/docs/v0.9/Learn More/upgrades.md
@@ -109,19 +109,3 @@ automatically?
 **A.** Yes.
 
 We provide the [Talos Controller Manager](https://github.com/talos-systems/talos-controller-manager) to perform this maintenance in a simple, controllable fashion.
-
-## Upgrade Notes for Talos 0.8
-
-Talos 0.8 comes with new [KSPP requirements](https://kernsec.org/wiki/index.php/Kernel_Self_Protection_Project/Recommended_Settings) compliance check.
-
-Following kernel arguments are mandatory for Talos to boot successfully:
-
-- `init_on_alloc=1`: required by KSPP
-- `slab_nomerge`: required by KSPP
-- `pti=on`: required by KSPP
-
-Talos installer automatically injects those args while installing Talos, so this mostly is required when PXE booting Talos.
-
-## Kubernetes
-
-Kubernetes upgrades with Talos are covered in a [separate document](../../guides/upgrading-kubernetes/).