Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding a k3s server node from a previous cluster causes 'x509: certificate signed by unknown authority' #2034

Closed
dkeightley opened this issue Jul 16, 2020 · 8 comments
Labels
priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.

Comments

@dkeightley
Copy link

dkeightley commented Jul 16, 2020

What kind of request is this (question/bug/enhancement/feature request): bug

Steps to reproduce (least amount of steps as possible):

  1. Create a single server node cluster (a) with external datasource
  2. Create a single server node cluster (b) with just a local sqlite datasource
  3. Stop k3s on node (b), leave k3s installed as-is
  4. Install k3s again on node (b), this time using the external datasource of (a) - ie. a two server node cluster sharing the same external datasource
  5. The below logs can be observed

Result:

node (a):

Jul 16 04:11:04 osboxes k3s[16905]: time="2020-07-16T04:11:04.726999721Z" level=info msg="Tunnel endpoint watch event: [192.168.1.121:6443 192.168.1.176:6443]"
Jul 16 04:11:04 osboxes k3s[16905]: time="2020-07-16T04:11:04.728139835Z" level=info msg="Tunnel endpoint watch event: [192.168.1.176:6443]"
Jul 16 04:11:04 osboxes k3s[16905]: time="2020-07-16T04:11:04.728303781Z" level=info msg="Stopped tunnel to 192.168.1.121:6443"
Jul 16 04:11:04 osboxes k3s[16905]: time="2020-07-16T04:11:04.728463728Z" level=info msg="Connecting to proxy" url="wss://192.168.1.121:6443/v1-k3s/connect"
Jul 16 04:11:04 osboxes k3s[16905]: time="2020-07-16T04:11:04.777540692Z" level=error msg="Failed to connect to proxy" error="x509: certificate signed by unknown authority"
Jul 16 04:11:04 osboxes k3s[16905]: time="2020-07-16T04:11:04.777884602Z" level=error msg="Remotedialer proxy error" error="x509: certificate signed by unknown authority"
Jul 16 04:11:05 osboxes k3s[16905]: time="2020-07-16T04:11:05.490119463Z" level=info msg="Tunnel endpoint watch event: [192.168.1.121:6443 192.168.1.176:6443]"
Jul 16 04:11:05 osboxes k3s[16905]: time="2020-07-16T04:11:05.490549928Z" level=info msg="Connecting to proxy" url="wss://192.168.1.121:6443/v1-k3s/connect"
Jul 16 04:11:05 osboxes k3s[16905]: time="2020-07-16T04:11:05.496783853Z" level=error msg="Failed to connect to proxy" error="x509: certificate signed by unknown authority"
Jul 16 04:11:05 osboxes k3s[16905]: time="2020-07-16T04:11:05.497033531Z" level=error msg="Remotedialer proxy error" error="x509: certificate signed by unknown authority"
Jul 16 04:11:06 osboxes k3s[16905]: time="2020-07-16T04:11:06.902346885Z" level=info msg="Active TLS secret k3s-serving (ver=1772) (count 8): map[listener.cattle.io/cn-10.43.0.1:10.43.0.1 listener.cattle.io/cn-127.0.0.1:127.0.0.1 listener.cattle.io/cn-192.168.1.121:192.168.1.121 listener.cattle.io/cn-192.168.1.176:192.168.1.176 listener.cattle.io/cn-kubernetes:kubernetes listener.cattle.io/cn-kubernetes.default:kubernetes.default listener.cattle.io/cn-kubernetes.default.svc.cluster.local:kubernetes.default.svc.cluster.local listener.cattle.io/cn-localhost:localhost listener.cattle.io/hash:afe1d070f4360758c4709136f4c1002a31990ae67bc8258fcf852b0b10c779f9]"
Jul 16 04:11:07 osboxes k3s[16905]: time="2020-07-16T04:11:07.470714773Z" level=error msg="failed to authenticate request: x509: certificate signed by unknown authority"

node (b):

Jul 16 04:13:02 minibox k3s[23949]: E0716 04:13:02.776513   23949 authentication.go:53] Unable to authenticate the request due to an error: [invalid bearer token, Token has been invalidated]
Jul 16 04:13:02 minibox k3s[23949]: E0716 04:13:02.776665   23949 authentication.go:53] Unable to authenticate the request due to an error: [invalid bearer token, Token has been invalidated]
Jul 16 04:13:02 minibox k3s[23949]: time="2020-07-16T04:13:02.835369211Z" level=info msg="Connecting to proxy" url="wss://192.168.1.176:6443/v1-k3s/connect"
Jul 16 04:13:02 minibox k3s[23949]: time="2020-07-16T04:13:02.850108791Z" level=error msg="Failed to connect to proxy" error="websocket: bad handshake"
Jul 16 04:13:02 minibox k3s[23949]: time="2020-07-16T04:13:02.850153505Z" level=error msg="Remotedialer proxy error" error="websocket: bad handshake"
Jul 16 04:13:02 minibox k3s[23949]: E0716 04:13:02.913762   23949 authentication.go:53] Unable to authenticate the request due to an error: [invalid bearer token, Token has been invalidated]
  1. The node (b) will fail to join, although the k3s-serving secret will be updated and signed by the CA on node (b).

Other details that may be helpful:

  1. To recover the cluster the following steps were used (may need revising).

7a. Delete all nodes and the k3s-serving cert

kubectl --insecure-skip-tls-verify=true delete node $(hostname -s)
kubectl --insecure-skip-tls-verify=true -n kube-system delete secret k3s-serving
/usr/local/bin/k3s-uninstall.sh

7b. Reinstall k3s on at least 2 nodes (for me the issue didn't recover until I added 2). Deleting the k3s-serving secret and a k3s restart may be needed.
7c. To recover from invalidated tokens I had to clear all SA tokens from all namespaces and all pods (note: many pods were stuck in Terminating, so I used forceful commands):

kubectl get secret -A | awk '{ if ($3 == "kubernetes.io/service-account-token") system("kubectl -n " $1 " delete secret " $2) }'
kubectl delete pods -A --all --force --grace-period=0 

Cluster information

  • Kubernetes version (use kubectl version): v1.18.4+k3s1 (97b7a0e)

gz#11262

@brandond
Copy link
Member

I believe this should probably be moved to k3s but I will let @cjellick decide

@cjellick cjellick transferred this issue from rancher/rancher Jul 16, 2020
@brandond
Copy link
Member

This is probably an odd corner case - k3s nodes don't expect to be hot-swapped into different clusters without having stuff from the previous installation cleaned out. However, the node joining the cluster should probably fail to do so if its local certs don't match those on the other nodes.

@brandond brandond self-assigned this Jul 18, 2020
@davidnuzik davidnuzik added this to the v1.20 - Backlog milestone Sep 15, 2020
@brandond brandond added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Nov 23, 2020
@dverbeek84
Copy link

I have got the exact same issue, only i want to recover the whole cluster. After cluster failures.

@fapatel1
Copy link

Tracking this is #3040

@briandowns
Copy link
Contributor

PR #3398 should take care of this issue as it will introduce behavior that will update the certs on disk if they don't match and are older than the certificates in the datastore.

@dkeightley
Copy link
Author

Is it possible to confirm which release the fix is included? Thanks!

@zhoub
Copy link

zhoub commented Jul 16, 2022

Bump, would like to know which version has shipped the fix.

@brandond
Copy link
Member

This issue was closed like a year ago. Every currently supported version has the fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Projects
None yet
Development

No branches or pull requests

7 participants