Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: add control plane conversion guide and 0.9 upgrade notes #3278

Merged
merged 1 commit into from
Mar 10, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
255 changes: 255 additions & 0 deletions website/content/docs/v0.9/Guides/converting-control-plane.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,255 @@
---
title: "Converting Control Plane"
description: "How to convert Talos self-hosted Kubernetes control plane (pre-0.9) to static pods based one."
---

Talos version 0.9 runs Kubernetes control plane in a new way: static pods managed by Talos.
Talos version 0.8 and below runs self-hosted control plane.
After Talos OS upgrade to version 0.9 Kubernetes control plane should be converted to run as static pods.

This guide describes automated conversion script and also shows detailed manual conversion process.

## Automated Conversion

First, make sure all nodes are updated to Talos 0.9:

```bash
$ kubectl get nodes -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
talos-default-master-1 Ready control-plane,master 58m v1.20.4 172.20.0.2 <none> Talos (v0.9.0) 5.10.19-talos containerd://1.4.4
talos-default-master-2 Ready control-plane,master 58m v1.20.4 172.20.0.3 <none> Talos (v0.9.0) 5.10.19-talos containerd://1.4.4
talos-default-master-3 Ready control-plane,master 58m v1.20.4 172.20.0.4 <none> Talos (v0.9.0) 5.10.19-talos containerd://1.4.4
talos-default-worker-1 Ready <none> 58m v1.20.4 172.20.0.5 <none> Talos (v0.9.0) 5.10.19-talos containerd://1.4.4
```

Start the conversion script:

```bash
$ talosctl -n <IP> convert-k8s
discovered master nodes ["172.20.0.2" "172.20.0.3" "172.20.0.4"]
current self-hosted status: true
gathering control plane configuration
aggregator CA key can't be recovered from bootkube-boostrapped control plane, generating new CA
patching master node "172.20.0.2" configuration
patching master node "172.20.0.3" configuration
patching master node "172.20.0.4" configuration
waiting for static pod definitions to be generated
waiting for manifests to be generated
Talos generated control plane static pod definitions and bootstrap manifests, please verify them with commands:
talosctl -n <master node IP> get StaticPods.kubernetes.talos.dev
talosctl -n <master node IP> get Manifests.kubernetes.talos.dev

bootstrap manifests will only be applied for missing resources, existing resources will not be updated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this mean? Missing from where? What bootstrap manifests? Resources existing where? Is this something I (the user) need to worry about?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure if we can omit that from the output completely, but the idea is pretty simple. once conversion is complete, Talos nodes will try to create any missing Kubernetes resources found in the bootstrap manifests (which are in turn generated from Talos machine configuration).

So if someone modified some resource, it's fine as Talos won't overwrite them, but if someone deleted some manifest, Talos will attempt to re-create it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is explained more around lines 108-112, but probably script itself deserves better message

confirm disabling pod-checkpointer to proceed with control plane update [yes/no]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The user isn't likely to know anything about the pod checkpointer, so they cannot make this choice.

Instead, perhaps:

In order to upgrade components, we need to disable the Pod Checkpointer because the Pod Checkpointer's job is to replace components that are supposed to run if they stop running.
Once you disable the Pod Checkpointers, you should avoid rebooting your control plane nodes until the entire conversion is complete.
Do you wish to continue now?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it looks too verbose for me, but let me take this to another PR as this is script output and it should be fixed in the code first, and I'll update the docs in subsequent PR once I fix the code itself

```

Script stops at this point waiting for confirmation.
Talos still runs self-hosted control plane, and static pods were not rendered yet.

As instructed by the script, please verify that static pod definitions are correct:

```bash
$ talosctl -n <IP> get staticpods -o yaml
node: 172.20.0.2
metadata:
namespace: controlplane
type: StaticPods.kubernetes.talos.dev
id: kube-apiserver
version: 1
phase: running
spec:
apiVersion: v1
kind: Pod
metadata:
annotations:
talos.dev/config-version: "2"
talos.dev/secrets-version: "1"
creationTimestamp: null
labels:
k8s-app: kube-apiserver
tier: control-plane
name: kube-apiserver
namespace: kube-system
spec:
containers:
- command:
...
```

Static pod definitions are generated from the machine configuration and should match pod template as generated by Talos on bootstrap of self-hosted control plane unless there were some manual changes applied to the daemonset specs after bootstrap.
Talos patches the machine configuration with the container image versions scraped from the daemonset definition, fetches the service account key from Kubernetes secrets.

Aggregator CA can't be recovered from the self-hosted control plane, so new CA gets generated.
smira marked this conversation as resolved.
Show resolved Hide resolved
This is generally harmless and not visible from outside the cluster.
The Aggregator CA is _not_ the same CA as is used by Talos or Kubernetes standard API.
It is a special PKI used for aggregating API extension services inside your cluster.
If you have non-standard apiserver aggregations (fairly rare, and you should know if you do), then you may need to restart these services after the new CA is in place.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An additional note should be added here to tell the user to say NO to this for the first run, so that they can verify the manifests, as indicated below.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

technically they don't have to say no, as they can use another window to perform lookups.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My key point here is that we want to prevent people from blindly saying "yes" before they have inspected things. The pipelining is convenient, but if it requires branching to perform checks, we should make that apparent before they blindly say "yes".

Verify that bootstrap manifests are correct:

```bash
$ talosctl -n <IP> get manifests --namespace controlplane
NODE NAMESPACE TYPE ID VERSION
172.20.0.2 controlplane Manifest 00-kubelet-bootstrapping-token 1
172.20.0.2 controlplane Manifest 01-csr-approver-role-binding 1
172.20.0.2 controlplane Manifest 01-csr-node-bootstrap 1
172.20.0.2 controlplane Manifest 01-csr-renewal-role-binding 1
172.20.0.2 controlplane Manifest 02-kube-system-sa-role-binding 1
172.20.0.2 controlplane Manifest 03-default-pod-security-policy 1
172.20.0.2 controlplane Manifest 10-kube-proxy 1
172.20.0.2 controlplane Manifest 11-core-dns 1
172.20.0.2 controlplane Manifest 11-core-dns-svc 1
172.20.0.2 controlplane Manifest 11-kube-config-in-cluster 1
```

```bash
$ talosctl -n <IP> get manifests --namespace=extras
NODE NAMESPACE TYPE ID VERSION
172.20.0.2 extras Manifest 05-https://docs.projectcalico.org/manifests/calico.yaml 1
```

Make sure that manifests and static pods are correct across all control plane nodes, as each node reconciles
control plane state on its own.
For example, CNI configuration in machine config should be in sync across all the nodes.
Talos nodes try to create any missing Kubernetes resources from the manifests, but it never
updates or deletes existing resources.

If something looks wrong, script can be aborted and machine configuration should be updated to fix the problem.
Once configuration is updated, the script can be restarted.

If static pod definitions and manifests look good, confirm next step to disable `pod-checkpointer`:

```bash
$ talosctl -n <IP> convert-k8s
...
confirm disabling pod-checkpointer to proceed with control plane update [yes/no]: yes
disabling pod-checkpointer
deleting daemonset "pod-checkpointer"
checking for active pod checkpoints
2021/03/09 23:37:25 retrying error: found 3 active pod checkpoints: [pod-checkpointer-655gc-talos-default-master-3 pod-checkpointer-pw6mv-talos-default-master-1 pod-checkpointer-zdw9z-talos-default-master-2]
2021/03/09 23:42:25 retrying error: found 1 active pod checkpoints: [pod-checkpointer-pw6mv-talos-default-master-1]
confirm applying static pod definitions and manifests [yes/no]:
```

Self-hosted control plane runs `pod-checkpointer` to work around issues with control plane availability.
It should be disabled before conversion starts to allow self-hosted control plane to be removed.
It takes around 5 minutes for the `pod-checkpointer` to be fully disabled.
Script verifies that all checkpoints are removed before proceeding.

This last confirmation before proceeding is at the point when there is no way to keep running self-hosted control plane:
static pods are released, bootstrap manifests are applied, self-hosted control plane is removed.

```bash
$ talosctl -n <IP> convert-k8s
...
confirm applying static pod definitions and manifests [yes/no]: yes
removing self-hosted initialized key
waiting for static pods for "kube-apiserver" to be present in the API server state
waiting for static pods for "kube-controller-manager" to be present in the API server state
waiting for static pods for "kube-scheduler" to be present in the API server state
deleting daemonset "kube-apiserver"
waiting for static pods for "kube-apiserver" to be present in the API server state
deleting daemonset "kube-controller-manager"
waiting for static pods for "kube-controller-manager" to be present in the API server state
deleting daemonset "kube-scheduler"
waiting for static pods for "kube-scheduler" to be present in the API server state
conversion process completed successfully
```

As soon as the control plane static pods are rendered, the kubelet starts the control plane static pods.
It is expected that the pods for `kube-apiserver` will crash initially.
Only one `kube-apiserver` can be bound to the host `Node`'s port 6443 at a time.
Eventually, the old `kube-apiserver` will be killed, and the new one will be able to start.
This is all handled automatically.
The script will continue by removing each self-hosted daemonset and verifying that static pods are ready and healthy.

## Manual Conversion

Check that Talos runs self-hosted control plane:

```bash
$ talosctl -n <CONTROL_PLANE_IP> get bs
NODE NAMESPACE TYPE ID VERSION SELF HOSTED
172.20.0.2 runtime BootstrapStatus control-plane 2 true
```

Talos machine configuration need to be updated to the 0.9 format; there are two new required machine configuration settings:

* `.cluster.serviceAccount` is the service account PEM-encoded private key.
* `.cluster.aggregatorCA` is the aggregator CA for `kube-apiserver` (certficiate and private key).

Current service account can be fetched from the Kubernetes secrets:

```bash
$ kubectl -n kube-system get secrets kube-controller-manager -o jsonpath='{.data.service\-account\.key}'
LS0tLS1CRUdJTiBSU0EgUFJJVkFURS...
```

All control plane node machine configurations should be patched with the service account key:

```bash
$ talosctl -n <CONTROL_PLANE_IP1>,<CONTROL_PLANE_IP2>,... patch mc --immediate -p '[{"op": "add", "path": "/cluster/serviceAccount", "value": {"key": "LS0tLS1CRUdJTiBSU0EgUFJJVkFURS..."}}]'
patched mc at the node 172.20.0.2
```

Aggregator CA can be generated using OpenSSL or any other certificate generation tools: RSA or ECDSA certificate with CN `front-proxy` valid for 10 years.
PEM-encoded CA certificate and key should be base64-encoded and patched into the machine config at path `/cluster/aggregatorCA`:

```bash
$ talosctl -n <CONTROL_PLANE_IP1>,<CONTROL_PLANE_IP2>,... patch mc --immediate -p '[{"op": "add", "path": "/cluster/aggregatorCA", "value": {"crt": "S0tLS1CRUdJTiBDRVJUSUZJQ...", "key": "LS0tLS1CRUdJTiBFQy..."}}]'
patched mc at the node 172.20.0.2
```

At this point static pod definitions and bootstrap manifests should be rendered, please see "Automated Conversion" on how to verify generated objects.
Feel free to continue to refine your machine configuration until the generated static pod definitions and bootstrap manifests look good.

If static pod definitions are not generated, check logs with `talosctl -n <IP> logs controller-runtime`.

Disable `pod-checkpointer` with:

```bash
$ kubectl -n kube-system delete ds pod-checkpointer
daemonset.apps "pod-checkpointer" deleted
```

Wait for all pod checkpoints to be removed:

```bash
$ kubectl -n kube-system get pods
NAME READY STATUS RESTARTS AGE
...
pod-checkpointer-8q2lh-talos-default-master-2 1/1 Running 0 3m34s
pod-checkpointer-nnm5w-talos-default-master-3 1/1 Running 0 3m24s
pod-checkpointer-qnmdt-talos-default-master-1 1/1 Running 0 2m21s
```

Pod checkpoints have annotation `checkpointer.alpha.coreos.com/checkpoint-of`.

Once all the pod checkpoints are removed (it takes 5 minutes for the checkpoints to be removed), proceed by removing self-hosted initialized key:

```bash
talosctl -n <CONTROL_PLANE_IP> convert-k8s --remove-initialized-key
```

Talos controllers will now render static pod definitions, and the kubelet will launch any resulting static pods.

Once static pods are visible in `kubectl get pods -n kube-system` output, proceed by removing each of the self-hosted daemonsets:

```bash
$ kubectl -n kube-system delete daemonset kube-apiserver
daemonset.apps "kube-apiserver" deleted
```

Make sure static pods for `kube-apiserver` got started successfully, pods are running and ready.

Proceed by deleting `kube-controller-manager` and `kube-scheduler` daemonsets, verifying that static pods are running between each step:

```bash
$ kubectl -n kube-system delete daemonset kube-controller-manager
daemonset.apps "kube-controller-manager" deleted
```

```bash
$ kubectl -n kube-system delete daemonset kube-scheduler
daemonset.apps "kube-scheduler" deleted
```
63 changes: 54 additions & 9 deletions website/content/docs/v0.9/Guides/upgrading-talos.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,14 +3,54 @@ title: Upgrading Talos
---

Talos upgrades are effected by an API call.
The `talosctl` CLI utility will facilitate this, or you can use the automatic upgrade features provided by the [talos controller manager](https://github.com/talos-systems/talos-controller-manager).
The `talosctl` CLI utility will facilitate this.
<!-- , or you can use the automatic upgrade features provided by the [talos controller manager](https://github.com/talos-systems/talos-controller-manager) -->

## Video Walkthrough

To see a live demo of this writeup, see the video below:

<iframe width="560" height="315" src="https://www.youtube.com/embed/sw78qS8vBGc" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>

## Upgrading from Talos 0.8

Talos 0.9 drops support for `bootkube` and self-hosted control plane.

Please make sure Talos is upgraded to the latest minor release of 0.8 first (0.8.4 at the moment
of this writing), then proceed with upgrading to the latest minor release of 0.9.

### Before Upgrade to 0.9

If cluster was bootstrapped on Talos version < 0.8.3, add checkpointer annotations to
the `kube-scheduler` and `kube-controller-manager` daemonsets to improve resiliency of
self-hosted control plane to reboots (this is critical for single control-plane node clusters):

```bash
$ kubectl -n kube-system patch daemonset kube-controller-manager --type json -p '[{"op": "add", "path":"/spec/template/metadata/annotations", "value": {"checkpointer.alpha.coreos.com/checkpoint": "true"}}]'
daemonset.apps/kube-controller-manager patched
$ kubectl -n kube-system patch daemonset kube-scheduler --type json -p '[{"op": "add", "path":"/spec/template/metadata/annotations", "value": {"checkpointer.alpha.coreos.com/checkpoint": "true"}}]'
daemonset.apps/kube-scheduler patched
```

Talos 0.9 only supports Kubernetes versions 1.19.x and 1.20.x.
If running 1.18.x, please upgrade Kubernetes before upgrading Talos.

Make sure cluster is running latest minor release of Talos 0.8.

Prepare by downloading `talosctl` binary for Talos release 0.9.x.

### After Upgrade to 0.9

After the upgrade to 0.9, Talos will still be running self-hosted control plane until the [conversion process](../converting-control-plane/) is run.

> Note: Talos 0.9 doesn't include bootkube recovery option (`talosctl recover`), so
> it's not possible to recover self-hosted control plane after upgrading to 0.9.

As soon as all the nodes get upgraded to 0.9, run `talosctl convert-k8s` to convert the control plane
to the new static pod format for 0.9.

Once the conversion process is complete, Kubernetes can be upgraded.

## `talosctl` Upgrade

To manually upgrade a Talos node, you will specify the node's IP address and the
Expand All @@ -29,6 +69,10 @@ There is an option to this command: `--preserve`, which can be used to explicitl
In most cases, it is correct to just let Talos perform its default action.
However, if you are running a single-node control-plane, you will want to make sure that `--preserve=true`.

If Talos fails to run the upgrade, the `--staged` flag may be used to perform the upgrade after a reboot
which is followed by another reboot to upgraded version.

<!--
## Talos Controller Manager

The Talos Controller Manager can coordinate upgrades of your nodes
Expand All @@ -43,16 +87,17 @@ configured, take a look at the [GitHub page](https://github.com/talos-systems/ta
Please note that the controller manager is still in fairly early development.
More advanced features, such as time slot scheduling, will be coming in the
future.
-->

## Changelog and Upgrade Notes
## Machine Configuration Changes

In an effort to create more production ready clusters, Talos will now taint control plane nodes as unschedulable.
This means that any application you might have deployed must tolerate this taint if you intend on running the application on control plane nodes.
Talos 0.9 introduces new required parameters in machine configuration:

Another feature you will notice is the automatic uncordoning of nodes that have been upgraded.
Talos will now uncordon a node if the cordon was initiated by the upgrade process.
* `.cluster.aggregatorCA`
* `.cluster.serviceAccount`

### Talosctl
Talos supports both ECDSA and RSA certificates and keys for Kubernetes and etcd, with ECDSA being default.
Talos <= 0.8 supports only RSA keys and certificates.

The `talosctl` CLI now requires an explicit set of nodes.
This can be configured with `talos config nodes` or set on the fly with `talos --nodes`.
Utility `talosctl gen config` generates by default config in 0.9 format which is not compatible with
Talos 0.8, but old format can be generated with `talosctl gen config --talos-version=v0.8`.
16 changes: 0 additions & 16 deletions website/content/docs/v0.9/Learn More/upgrades.md
Original file line number Diff line number Diff line change
Expand Up @@ -109,19 +109,3 @@ automatically?
**A.** Yes.

We provide the [Talos Controller Manager](https://github.com/talos-systems/talos-controller-manager) to perform this maintenance in a simple, controllable fashion.

## Upgrade Notes for Talos 0.8

Talos 0.8 comes with new [KSPP requirements](https://kernsec.org/wiki/index.php/Kernel_Self_Protection_Project/Recommended_Settings) compliance check.

Following kernel arguments are mandatory for Talos to boot successfully:

- `init_on_alloc=1`: required by KSPP
- `slab_nomerge`: required by KSPP
- `pti=on`: required by KSPP

Talos installer automatically injects those args while installing Talos, so this mostly is required when PXE booting Talos.

## Kubernetes

Kubernetes upgrades with Talos are covered in a [separate document](../../guides/upgrading-kubernetes/).