-
Notifications
You must be signed in to change notification settings - Fork 550
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
docs: add control plane conversion guide and 0.9 upgrade notes
These docs are critical to get 0.9.0-beta released. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
- Loading branch information
Showing
3 changed files
with
309 additions
and
25 deletions.
There are no files selected for viewing
255 changes: 255 additions & 0 deletions
255
website/content/docs/v0.9/Guides/converting-control-plane.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,255 @@ | ||
--- | ||
title: "Converting Control Plane" | ||
description: "How to convert Talos self-hosted Kubernetes control plane (pre-0.9) to static pods based one." | ||
--- | ||
|
||
Talos version 0.9 runs Kubernetes control plane in a new way: static pods managed by Talos. | ||
Talos version 0.8 and below runs self-hosted control plane. | ||
After Talos OS upgrade to version 0.9 Kubernetes control plane should be converted to run as static pods. | ||
|
||
This guide describes automated conversion script and also shows detailed manual conversion process. | ||
|
||
## Automated Conversion | ||
|
||
First, make sure all nodes are updated to Talos 0.9: | ||
|
||
```bash | ||
$ kubectl get nodes -o wide | ||
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME | ||
talos-default-master-1 Ready control-plane,master 58m v1.20.4 172.20.0.2 <none> Talos (v0.9.0) 5.10.19-talos containerd://1.4.4 | ||
talos-default-master-2 Ready control-plane,master 58m v1.20.4 172.20.0.3 <none> Talos (v0.9.0) 5.10.19-talos containerd://1.4.4 | ||
talos-default-master-3 Ready control-plane,master 58m v1.20.4 172.20.0.4 <none> Talos (v0.9.0) 5.10.19-talos containerd://1.4.4 | ||
talos-default-worker-1 Ready <none> 58m v1.20.4 172.20.0.5 <none> Talos (v0.9.0) 5.10.19-talos containerd://1.4.4 | ||
``` | ||
|
||
Start the conversion script: | ||
|
||
```bash | ||
$ talosctl -n <IP> convert-k8s | ||
discovered master nodes ["172.20.0.2" "172.20.0.3" "172.20.0.4"] | ||
current self-hosted status: true | ||
gathering control plane configuration | ||
aggregator CA key can't be recovered from bootkube-boostrapped control plane, generating new CA | ||
patching master node "172.20.0.2" configuration | ||
patching master node "172.20.0.3" configuration | ||
patching master node "172.20.0.4" configuration | ||
waiting for static pod definitions to be generated | ||
waiting for manifests to be generated | ||
Talos generated control plane static pod definitions and bootstrap manifests, please verify them with commands: | ||
talosctl -n <master node IP> get StaticPods.kubernetes.talos.dev | ||
talosctl -n <master node IP> get Manifests.kubernetes.talos.dev | ||
bootstrap manifests will only be applied for missing resources, existing resources will not be updated | ||
confirm disabling pod-checkpointer to proceed with control plane update [yes/no]: | ||
``` | ||
Script stops at this point waiting for confirmation. | ||
Talos still runs self-hosted control plane, and static pods were not rendered yet. | ||
As instructed by the script, please verify that static pod definitions are correct: | ||
```bash | ||
$ talosctl -n <IP> get staticpods -o yaml | ||
node: 172.20.0.2 | ||
metadata: | ||
namespace: controlplane | ||
type: StaticPods.kubernetes.talos.dev | ||
id: kube-apiserver | ||
version: 1 | ||
phase: running | ||
spec: | ||
apiVersion: v1 | ||
kind: Pod | ||
metadata: | ||
annotations: | ||
talos.dev/config-version: "2" | ||
talos.dev/secrets-version: "1" | ||
creationTimestamp: null | ||
labels: | ||
k8s-app: kube-apiserver | ||
tier: control-plane | ||
name: kube-apiserver | ||
namespace: kube-system | ||
spec: | ||
containers: | ||
- command: | ||
... | ||
``` | ||
Static pod definitions are generated from the machine configuration and should match pod template as generated by Talos on bootstrap of self-hosted control plane unless there were some manual changes applied to the daemonset specs after bootstrap. | ||
Talos patches the machine configuration with the container image versions scraped from the daemonset definition, fetches the service account key from Kubernetes secrets. | ||
Aggregator CA can't be recovered from the self-hosted control plane, so new CA gets generated. | ||
This is generally harmless and not visible from outside the cluster. | ||
The Aggregator CA is _not_ the same CA as is used by Talos or Kubernetes standard API. | ||
It is a special PKI used for aggregating API extension services inside your cluster. | ||
If you have non-standard apiserver aggregations (fairly rare, and you should know if you do), then you may need to restart these services after the new CA is in place. | ||
|
||
Verify that bootstrap manifests are correct: | ||
|
||
```bash | ||
$ talosctl -n <IP> get manifests --namespace controlplane | ||
NODE NAMESPACE TYPE ID VERSION | ||
172.20.0.2 controlplane Manifest 00-kubelet-bootstrapping-token 1 | ||
172.20.0.2 controlplane Manifest 01-csr-approver-role-binding 1 | ||
172.20.0.2 controlplane Manifest 01-csr-node-bootstrap 1 | ||
172.20.0.2 controlplane Manifest 01-csr-renewal-role-binding 1 | ||
172.20.0.2 controlplane Manifest 02-kube-system-sa-role-binding 1 | ||
172.20.0.2 controlplane Manifest 03-default-pod-security-policy 1 | ||
172.20.0.2 controlplane Manifest 10-kube-proxy 1 | ||
172.20.0.2 controlplane Manifest 11-core-dns 1 | ||
172.20.0.2 controlplane Manifest 11-core-dns-svc 1 | ||
172.20.0.2 controlplane Manifest 11-kube-config-in-cluster 1 | ||
``` | ||
|
||
```bash | ||
$ talosctl -n <IP> get manifests --namespace=extras | ||
NODE NAMESPACE TYPE ID VERSION | ||
172.20.0.2 extras Manifest 05-https://docs.projectcalico.org/manifests/calico.yaml 1 | ||
``` | ||
|
||
Make sure that manifests and static pods are correct across all control plane nodes, as each node reconciles | ||
control plane state on its own. | ||
For example, CNI configuration in machine config should be in sync across all the nodes. | ||
Talos nodes try to create any missing Kubernetes resources from the manifests, but it never | ||
updates or deletes existing resources. | ||
|
||
If something looks wrong, script can be aborted and machine configuration should be updated to fix the problem. | ||
Once configuration is updated, the script can be restarted. | ||
|
||
If static pod definitions and manifests look good, confirm next step to disable `pod-checkpointer`: | ||
|
||
```bash | ||
$ talosctl -n <IP> convert-k8s | ||
... | ||
confirm disabling pod-checkpointer to proceed with control plane update [yes/no]: yes | ||
disabling pod-checkpointer | ||
deleting daemonset "pod-checkpointer" | ||
checking for active pod checkpoints | ||
2021/03/09 23:37:25 retrying error: found 3 active pod checkpoints: [pod-checkpointer-655gc-talos-default-master-3 pod-checkpointer-pw6mv-talos-default-master-1 pod-checkpointer-zdw9z-talos-default-master-2] | ||
2021/03/09 23:42:25 retrying error: found 1 active pod checkpoints: [pod-checkpointer-pw6mv-talos-default-master-1] | ||
confirm applying static pod definitions and manifests [yes/no]: | ||
``` | ||
|
||
Self-hosted control plane runs `pod-checkpointer` to work around issues with control plane availability. | ||
It should be disabled before conversion starts to allow self-hosted control plane to be removed. | ||
It takes around 5 minutes for the `pod-checkpointer` to be fully disabled. | ||
Script verifies that all checkpoints are removed before proceeding. | ||
|
||
This last confirmation before proceeding is at the point when there is no way to keep running self-hosted control plane: | ||
static pods are released, bootstrap manifests are applied, self-hosted control plane is removed. | ||
|
||
```bash | ||
$ talosctl -n <IP> convert-k8s | ||
... | ||
confirm applying static pod definitions and manifests [yes/no]: yes | ||
removing self-hosted initialized key | ||
waiting for static pods for "kube-apiserver" to be present in the API server state | ||
waiting for static pods for "kube-controller-manager" to be present in the API server state | ||
waiting for static pods for "kube-scheduler" to be present in the API server state | ||
deleting daemonset "kube-apiserver" | ||
waiting for static pods for "kube-apiserver" to be present in the API server state | ||
deleting daemonset "kube-controller-manager" | ||
waiting for static pods for "kube-controller-manager" to be present in the API server state | ||
deleting daemonset "kube-scheduler" | ||
waiting for static pods for "kube-scheduler" to be present in the API server state | ||
conversion process completed successfully | ||
``` | ||
As soon as the control plane static pods are rendered, the kubelet starts the control plane static pods. | ||
It is expected that the pods for `kube-apiserver` will crash initially. | ||
Only one `kube-apiserver` can be bound to the host `Node`'s port 6443 at a time. | ||
Eventually, the old `kube-apiserver` will be killed, and the new one will be able to start. | ||
This is all handled automatically. | ||
The script will continue by removing each self-hosted daemonset and verifying that static pods are ready and healthy. | ||
## Manual Conversion | ||
Check that Talos runs self-hosted control plane: | ||
```bash | ||
$ talosctl -n <CONTROL_PLANE_IP> get bs | ||
NODE NAMESPACE TYPE ID VERSION SELF HOSTED | ||
172.20.0.2 runtime BootstrapStatus control-plane 2 true | ||
``` | ||
Talos machine configuration need to be updated to the 0.9 format; there are two new required machine configuration settings: | ||
* `.cluster.serviceAccount` is the service account PEM-encoded private key. | ||
* `.cluster.aggregatorCA` is the aggregator CA for `kube-apiserver` (certficiate and private key). | ||
Current service account can be fetched from the Kubernetes secrets: | ||
```bash | ||
$ kubectl -n kube-system get secrets kube-controller-manager -o jsonpath='{.data.service\-account\.key}' | ||
LS0tLS1CRUdJTiBSU0EgUFJJVkFURS... | ||
``` | ||
All control plane node machine configurations should be patched with the service account key: | ||
```bash | ||
$ talosctl -n <CONTROL_PLANE_IP1>,<CONTROL_PLANE_IP2>,... patch mc --immediate -p '[{"op": "add", "path": "/cluster/serviceAccount", "value": {"key": "LS0tLS1CRUdJTiBSU0EgUFJJVkFURS..."}}]' | ||
patched mc at the node 172.20.0.2 | ||
``` | ||
Aggregator CA can be generated using OpenSSL or any other certificate generation tools: RSA or ECDSA certificate with CN `front-proxy` valid for 10 years. | ||
PEM-encoded CA certificate and key should be base64-encoded and patched into the machine config at path `/cluster/aggregatorCA`: | ||
```bash | ||
$ talosctl -n <CONTROL_PLANE_IP1>,<CONTROL_PLANE_IP2>,... patch mc --immediate -p '[{"op": "add", "path": "/cluster/aggregatorCA", "value": {"crt": "S0tLS1CRUdJTiBDRVJUSUZJQ...", "key": "LS0tLS1CRUdJTiBFQy..."}}]' | ||
patched mc at the node 172.20.0.2 | ||
``` | ||
At this point static pod definitions and bootstrap manifests should be rendered, please see "Automated Conversion" on how to verify generated objects. | ||
Feel free to continue to refine your machine configuration until the generated static pod definitions and bootstrap manifests look good. | ||
If static pod definitions are not generated, check logs with `talosctl -n <IP> logs controller-runtime`. | ||
Disable `pod-checkpointer` with: | ||
```bash | ||
$ kubectl -n kube-system delete ds pod-checkpointer | ||
daemonset.apps "pod-checkpointer" deleted | ||
``` | ||
Wait for all pod checkpoints to be removed: | ||
```bash | ||
$ kubectl -n kube-system get pods | ||
NAME READY STATUS RESTARTS AGE | ||
... | ||
pod-checkpointer-8q2lh-talos-default-master-2 1/1 Running 0 3m34s | ||
pod-checkpointer-nnm5w-talos-default-master-3 1/1 Running 0 3m24s | ||
pod-checkpointer-qnmdt-talos-default-master-1 1/1 Running 0 2m21s | ||
``` | ||
Pod checkpoints have annotation `checkpointer.alpha.coreos.com/checkpoint-of`. | ||
Once all the pod checkpoints are removed (it takes 5 minutes for the checkpoints to be removed), proceed by removing self-hosted initialized key: | ||
```bash | ||
talosctl -n <CONTROL_PLANE_IP> convert-k8s --remove-initialized-key | ||
``` | ||
Talos controllers will now render static pod definitions, and the kubelet will launch any resulting static pods. | ||
Once static pods are visible in `kubectl get pods -n kube-system` output, proceed by removing each of the self-hosted daemonsets: | ||
```bash | ||
$ kubectl -n kube-system delete daemonset kube-apiserver | ||
daemonset.apps "kube-apiserver" deleted | ||
``` | ||
Make sure static pods for `kube-apiserver` got started successfully, pods are running and ready. | ||
Proceed by deleting `kube-controller-manager` and `kube-scheduler` daemonsets, verifying that static pods are running between each step: | ||
```bash | ||
$ kubectl -n kube-system delete daemonset kube-controller-manager | ||
daemonset.apps "kube-controller-manager" deleted | ||
``` | ||
```bash | ||
$ kubectl -n kube-system delete daemonset kube-scheduler | ||
daemonset.apps "kube-scheduler" deleted | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters