Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[epic] Bootstrap data synchronization #3040

Closed
brandond opened this issue Mar 10, 2021 · 6 comments
Closed

[epic] Bootstrap data synchronization #3040

brandond opened this issue Mar 10, 2021 · 6 comments
Assignees
Labels
kind/bug Something isn't working priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Milestone

Comments

@brandond
Copy link
Member

brandond commented Mar 10, 2021

Environmental Info:
K3s Version:
All versions

Node(s) CPU architecture, OS, and Version:
N/A

Cluster Configuration:
N/A

Describe the bug:
The cluster bootstrap data (ca certs, etc) is only written to the datastore once, by the first node, after the initial startup generates the keying material. If any of the certificates expire and are renewed, or are otherwise altered by the end-user, the bootstrap data in the datastore will contain stale data. This will cause problems when the bootstrap data is used by new nodes join the cluster, or when the cluster datastore is restored from backup.

Additionally, the bootstrap data is NEVER written to the datastore when using managed etcd. This means that the complete cluster state cannot be restored from an etcd snapshot.

Steps To Reproduce:

Cert expiry with external DB

  1. Start Node A as server using external DB
  2. Stop Node A, move time 10+ years into the future past the end of the CA certificate expiration
  3. Start Node A, note that certificates are renewed to extend expiry
  4. Join Node B as server to the cluster.
    Current behavior: Node B comes up with the original, expired certs from the datastore that need to be renewed locally.
    Desired behavior: Node B comes up with the renewed certificates as updated by Node A.

Cert expiry with user-provided certs

  1. Pre-create CA certificates on disk, signed by an existing trusted CA.
  2. Start Node A as server using any datastore type
  3. Stop Node A, move time into the future past the end of the CA certificate expiration
    Current behavior: Certificates are renewed to extend expiry, but are now self-signed. This may break things in other interesting ways?
    Desired behavior: K3s fails to start with error indicating that user-provided certificates cannot be renewed.

Cert restoration with external DB

  1. Start Node A as server using external DB
  2. Stop Node A
  3. Delete certificates from disk
  4. Start Node A
    Current behavior: Certificates are restored from the database and the cluster starts up normally
    Desired behavior: Certificates are restored from the database and the cluster starts up normally

Cert replacement with external DB

  1. Start Node A as server using external DB
  2. Stop Node A
  3. Update --datastore-endpoint to point at a different cluster's database
  4. Start Node A
    Current behavior: CA certificates and other keying material from the original database are used instead of the certificates from the new cluster's database, breaking things in strange and interesting ways
    Desired behavior: CA certificates from new cluster's database are written to disk, and any other downstream keying materials (encryption configuration, ipsec keys, token signing certs, kubeconfig client certs, etc) are regenerated as well.

Cert restoration with managed etcd

  1. Start Node A as server using managed etcd
  2. Stop Node A
  3. Delete certificates from disk
  4. Start Node A
    Current behavior: New certificates are generated and the cluster fails to start properly (cert errors from kubectl, pods crash, etc)
    Desired behavior: Certificates are restored from the datastore via HTTP bootstrap from another node in the cluster - if possible. May need a --cluster-reset --cluster-reset-restore in order to properly extract the correct certs from the etcd datasture.

Kubeconfig restoration with any datastore

  1. Start Node A as server using any datastore
  2. Stop Node A
  3. Delete /var/lib/rancher/k3s/server/cred/*.kubeconfig
  4. Start Node A
    Current behavior: Components fail to start due to missing kubeconfigs, which are only generated if the certs+keys for the relevant kubeconfig are missing.
    Desired behavior: Kubeconfigs are regenerated

token encryption rotation with any datastore type

  1. Start Node A as a server using any datastore, and --token=oldpass
  2. Restart Node A with --token=newpass
  3. Stop K3s on Node A, and delete /var/lib/rancher/k3s
  4. Start K3s on Node A. If using managed etcd, attempt to restore from an etcd snapshot. If using external datastore, no action is necessary.
    Current behavior: Restoration (or connection to current database) fails as the bootstrap data does not match, and new CA certificates and keying material are generated.
    Desired behavior: Bootstrap data is encrypted with new key when the token is changed, and subsequent restores or reconnections to the external datastore properly load the previous CA certs and keying material.

Additional context / logs:
Is preventing #2902 from actually working
Related to behavior described in #3015
Related to secrets encryption rotation needed for #3407

@brandond brandond added this to the 1.21 initial release milestone Mar 10, 2021
@brandond brandond added kind/bug Something isn't working priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Mar 10, 2021
@davidnuzik
Copy link
Contributor

Fixed in etcd so far but not sqlite. Bump to next milestone as this is not critical.

@galal-hussein
Copy link
Contributor

As far as I can see, the issue still not fixed in etcd:

  • start k3s server with --cluster-init
bash-5.1# md5sum /var/lib/rancher/k3s/server/tls/*-ca.crt
482a736a585e5f889b9c01a5ba606fc5  /var/lib/rancher/k3s/server/tls/client-ca.crt
3276baf0d72a5a3598bfcd209fa0f3d4  /var/lib/rancher/k3s/server/tls/request-header-ca.crt
99d3b20961505410426a4234bd1cb898  /var/lib/rancher/k3s/server/tls/server-ca.crt

bash-5.1# sudo ETCDCTL_API=3 etcd-v3.4.15-linux-amd64/etcdctl --cert /var/lib/rancher/k3s/server/tls/etcd/server-client.crt --key /var/lib/rancher/k3s/server/tls/etcd/server-client.key --endpoints https://127.0.0.1:2379 --cacert /var/lib/rancher/k3s/server/tls/etcd/server-ca.crt get --prefix / --keys-only | grep "bootstrap/"
/bootstrap/614b7a1032e4
      
bash-5.1# rm -rf /var/lib/rancher/k3s/server/tls/

restart k3s server with --cluster-init

bash-5.1# md5sum /var/lib/rancher/k3s/server/tls/*-ca.crt
9b151203b64baa9680a3e0a94ca4956c  /var/lib/rancher/k3s/server/tls/client-ca.crt
215bc4dd6fa94668d29dea916d40cc0e  /var/lib/rancher/k3s/server/tls/request-header-ca.crt
11a7d01d3272e3f4e201788555ac04eb  /var/lib/rancher/k3s/server/tls/server-ca.crt

bash-5.1# sudo ETCDCTL_API=3 etcd-v3.4.15-linux-amd64/etcdctl --cert /var/lib/rancher/k3s/server/tls/etcd/server-client.crt --
key /var/lib/rancher/k3s/server/tls/etcd/server-client.key --endpoints https://127.0.0.1:2379 --cacert /var/lib/rancher/k3s/server/tls/etcd/server-ca.crt get --prefix / --keys-only | grep "bootstrap/"
/bootstrap/614b7a1032e4

@brandond
Copy link
Member Author

brandond commented Jun 7, 2021

Yeah, as noted in our call earlier the cluster-reset-restore path will overwrite the certs on disk with the data from the bootstrap data, but the logic about what we sync and when needs to be completely rethought.

@fapatel1
Copy link

This is depending on #3015 and @briandowns to link other issue that Hussein is working on

@brandond brandond changed the title Bootstrap data is never updated once written, and is never written at all for managed etcd [epic] Bootstrap data synchronization Jun 16, 2021
@davidnuzik davidnuzik modified the milestones: v1.21.3+k3s1, v1.21.4+k3s1 Jul 7, 2021
@cjellick cjellick modified the milestones: v1.21.4+k3s1, v1.21.5+k3s1 Aug 13, 2021
@Oats87 Oats87 modified the milestones: v1.21.5+k3s1, v1.21.6+k3s1 Sep 20, 2021
@cwayne18
Copy link
Member

cwayne18 commented Oct 5, 2021

The PR attached to #3015 should cover this epic, let's move to "To test" when that's landed

@rancher-max rancher-max self-assigned this Oct 19, 2021
@rancher-max
Copy link
Contributor

Validated on master-branch commit 8271d98a766b060463bc73ef66c5085b5797b4cc following the same steps as in #3015 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Projects
None yet
Development

No branches or pull requests

9 participants