[epic] Bootstrap data synchronization #3040

brandond · 2021-03-10T09:46:22Z

Environmental Info:
K3s Version:
All versions

Node(s) CPU architecture, OS, and Version:
N/A

Cluster Configuration:
N/A

Describe the bug:
The cluster bootstrap data (ca certs, etc) is only written to the datastore once, by the first node, after the initial startup generates the keying material. If any of the certificates expire and are renewed, or are otherwise altered by the end-user, the bootstrap data in the datastore will contain stale data. This will cause problems when the bootstrap data is used by new nodes join the cluster, or when the cluster datastore is restored from backup.

Additionally, the bootstrap data is NEVER written to the datastore when using managed etcd. This means that the complete cluster state cannot be restored from an etcd snapshot.

Steps To Reproduce:

Cert expiry with external DB

Start Node A as server using external DB
Stop Node A, move time 10+ years into the future past the end of the CA certificate expiration
Start Node A, note that certificates are renewed to extend expiry
Join Node B as server to the cluster.
Current behavior: Node B comes up with the original, expired certs from the datastore that need to be renewed locally.
Desired behavior: Node B comes up with the renewed certificates as updated by Node A.

Cert expiry with user-provided certs

Pre-create CA certificates on disk, signed by an existing trusted CA.
Start Node A as server using any datastore type
Stop Node A, move time into the future past the end of the CA certificate expiration
Current behavior: Certificates are renewed to extend expiry, but are now self-signed. This may break things in other interesting ways?
Desired behavior: K3s fails to start with error indicating that user-provided certificates cannot be renewed.

Cert restoration with external DB

Start Node A as server using external DB
Stop Node A
Delete certificates from disk
Start Node A
Current behavior: Certificates are restored from the database and the cluster starts up normally
Desired behavior: Certificates are restored from the database and the cluster starts up normally

Cert replacement with external DB

Start Node A as server using external DB
Stop Node A
Update --datastore-endpoint to point at a different cluster's database
Start Node A
Current behavior: CA certificates and other keying material from the original database are used instead of the certificates from the new cluster's database, breaking things in strange and interesting ways
Desired behavior: CA certificates from new cluster's database are written to disk, and any other downstream keying materials (encryption configuration, ipsec keys, token signing certs, kubeconfig client certs, etc) are regenerated as well.

Cert restoration with managed etcd

Start Node A as server using managed etcd
Stop Node A
Delete certificates from disk
Start Node A
Current behavior: New certificates are generated and the cluster fails to start properly (cert errors from kubectl, pods crash, etc)
Desired behavior: Certificates are restored from the datastore via HTTP bootstrap from another node in the cluster - if possible. May need a --cluster-reset --cluster-reset-restore in order to properly extract the correct certs from the etcd datasture.

Kubeconfig restoration with any datastore

Start Node A as server using any datastore
Stop Node A
Delete /var/lib/rancher/k3s/server/cred/*.kubeconfig
Start Node A
Current behavior: Components fail to start due to missing kubeconfigs, which are only generated if the certs+keys for the relevant kubeconfig are missing.
Desired behavior: Kubeconfigs are regenerated

token encryption rotation with any datastore type

Start Node A as a server using any datastore, and --token=oldpass
Restart Node A with --token=newpass
Stop K3s on Node A, and delete /var/lib/rancher/k3s
Start K3s on Node A. If using managed etcd, attempt to restore from an etcd snapshot. If using external datastore, no action is necessary.
Current behavior: Restoration (or connection to current database) fails as the bootstrap data does not match, and new CA certificates and keying material are generated.
Desired behavior: Bootstrap data is encrypted with new key when the token is changed, and subsequent restores or reconnections to the external datastore properly load the previous CA certs and keying material.

Additional context / logs:
Is preventing #2902 from actually working
Related to behavior described in #3015
Related to secrets encryption rotation needed for #3407

The text was updated successfully, but these errors were encountered:

davidnuzik · 2021-04-14T21:18:02Z

Fixed in etcd so far but not sqlite. Bump to next milestone as this is not critical.

galal-hussein · 2021-06-07T18:26:52Z

As far as I can see, the issue still not fixed in etcd:

start k3s server with --cluster-init

bash-5.1# md5sum /var/lib/rancher/k3s/server/tls/*-ca.crt
482a736a585e5f889b9c01a5ba606fc5  /var/lib/rancher/k3s/server/tls/client-ca.crt
3276baf0d72a5a3598bfcd209fa0f3d4  /var/lib/rancher/k3s/server/tls/request-header-ca.crt
99d3b20961505410426a4234bd1cb898  /var/lib/rancher/k3s/server/tls/server-ca.crt

bash-5.1# sudo ETCDCTL_API=3 etcd-v3.4.15-linux-amd64/etcdctl --cert /var/lib/rancher/k3s/server/tls/etcd/server-client.crt --key /var/lib/rancher/k3s/server/tls/etcd/server-client.key --endpoints https://127.0.0.1:2379 --cacert /var/lib/rancher/k3s/server/tls/etcd/server-ca.crt get --prefix / --keys-only | grep "bootstrap/"
/bootstrap/614b7a1032e4
      
bash-5.1# rm -rf /var/lib/rancher/k3s/server/tls/

restart k3s server with --cluster-init

bash-5.1# md5sum /var/lib/rancher/k3s/server/tls/*-ca.crt
9b151203b64baa9680a3e0a94ca4956c  /var/lib/rancher/k3s/server/tls/client-ca.crt
215bc4dd6fa94668d29dea916d40cc0e  /var/lib/rancher/k3s/server/tls/request-header-ca.crt
11a7d01d3272e3f4e201788555ac04eb  /var/lib/rancher/k3s/server/tls/server-ca.crt

bash-5.1# sudo ETCDCTL_API=3 etcd-v3.4.15-linux-amd64/etcdctl --cert /var/lib/rancher/k3s/server/tls/etcd/server-client.crt --
key /var/lib/rancher/k3s/server/tls/etcd/server-client.key --endpoints https://127.0.0.1:2379 --cacert /var/lib/rancher/k3s/server/tls/etcd/server-ca.crt get --prefix / --keys-only | grep "bootstrap/"
/bootstrap/614b7a1032e4

brandond · 2021-06-07T21:43:12Z

Yeah, as noted in our call earlier the cluster-reset-restore path will overwrite the certs on disk with the data from the bootstrap data, but the logic about what we sync and when needs to be completely rethought.

fapatel1 · 2021-06-16T20:20:50Z

This is depending on #3015 and @briandowns to link other issue that Hussein is working on

cwayne18 · 2021-10-05T14:57:48Z

The PR attached to #3015 should cover this epic, let's move to "To test" when that's landed

rancher-max · 2021-11-13T00:07:47Z

Validated on master-branch commit 8271d98a766b060463bc73ef66c5085b5797b4cc following the same steps as in #3015 (comment)

brandond added this to the 1.21 initial release milestone Mar 10, 2021

brandond added kind/bug Something isn't working priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Mar 10, 2021

brandond mentioned this issue Mar 10, 2021

Have Bootstrap Data Stored in etcd at Completed Start #3038

Merged

davidnuzik modified the milestones: v1.21.0+k3s1, v1.21.1+k3s1 Apr 14, 2021

cjellick modified the milestones: v1.21.1+k3s1, v1.21 - Backlog May 12, 2021

fapatel1 assigned briandowns Jun 16, 2021

fapatel1 modified the milestones: v1.21 - Backlog, v1.21.3+k3s1 Jun 16, 2021

brandond changed the title ~~Bootstrap data is never updated once written, and is never written at all for managed etcd~~ [epic] Bootstrap data synchronization Jun 16, 2021

davidnuzik modified the milestones: v1.21.3+k3s1, v1.21.4+k3s1 Jul 7, 2021

fapatel1 mentioned this issue Jul 29, 2021

Adding a k3s server node from a previous cluster causes 'x509: certificate signed by unknown authority' #2034

Closed

cjellick modified the milestones: v1.21.4+k3s1, v1.21.5+k3s1 Aug 13, 2021

Oats87 modified the milestones: v1.21.5+k3s1, v1.21.6+k3s1 Sep 20, 2021

rancher-max self-assigned this Oct 19, 2021

rancher-max modified the milestones: v1.21.6+k3s1, v1.21.7+k3s1 Oct 27, 2021

rancher-max closed this as completed Nov 13, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[epic] Bootstrap data synchronization #3040

[epic] Bootstrap data synchronization #3040

brandond commented Mar 10, 2021 •

edited

Loading

davidnuzik commented Apr 14, 2021

galal-hussein commented Jun 7, 2021

brandond commented Jun 7, 2021

fapatel1 commented Jun 16, 2021

cwayne18 commented Oct 5, 2021

rancher-max commented Nov 13, 2021

[epic] Bootstrap data synchronization #3040

[epic] Bootstrap data synchronization #3040

Comments

brandond commented Mar 10, 2021 • edited Loading

Cert expiry with external DB

Cert expiry with user-provided certs

Cert restoration with external DB

Cert replacement with external DB

Cert restoration with managed etcd

Kubeconfig restoration with any datastore

token encryption rotation with any datastore type

davidnuzik commented Apr 14, 2021

galal-hussein commented Jun 7, 2021

brandond commented Jun 7, 2021

fapatel1 commented Jun 16, 2021

cwayne18 commented Oct 5, 2021

rancher-max commented Nov 13, 2021

brandond commented Mar 10, 2021 •

edited

Loading