Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restoring snapshot without pki bundle or rkestate file #1336

Closed
galal-hussein opened this issue May 8, 2019 · 17 comments
Closed

Restoring snapshot without pki bundle or rkestate file #1336

galal-hussein opened this issue May 8, 2019 · 17 comments

Comments

@galal-hussein
Copy link
Contributor

galal-hussein commented May 8, 2019

For etcd restore to work correctly the certificates generated originally for the cluster must be present, in rke 0.1 we saved the certs as pki.bundle.tar.gz and in 0.2 we moved to cluster.rkestate file, we expect the state file to be present during the restore, however we also save a copy of the state as a configmap with the name full-cluster-state which we might fetch from etcd snapshot directly, the steps to implement this might include:

1- restore the etcd snapshot on any node from the cluster.yml
2- fetch the state using etcd client
3- write the state locally on disk
4- continue with restoration

gz#6680

@maggieliu
Copy link

Discussed Solution:

  • Bundle in RKE state file & Cluster.yml file in the snapshot going forward
  • When restoring:
              - User can pass in an option for RKE state file location, which will overwrite the one bundled in the snapshot.
              - Check if the certs is valid or not. If it's expired, prompt for certs rotation

@maggieliu maggieliu modified the milestones: v1.1.x - Rancher v2.4.x, v1.1 - Rancher v2.4.2 Mar 9, 2020
@superseb
Copy link
Contributor

The issue here is that the way files are deployed in RKE (by providing the file contents in an environment variable to the container, and echo-ing this variable to a file) will hit a limit of the size of the environment variable in case of 3 nodes with custom certs signed using cfssl. This can also be reproduced by just having a lot of nodes. This needs to be rewritten to launch a container and copy the state file to the container.

@soumyalj
Copy link

Re-opened as executables are not available for the rcs v1.2.0-rc7 and v1.1.5-rc6.

@soumyalj
Copy link

Tested using v1.1.5-rc5 .

  • Created a 3 node cluster using custom certs with the steps below
rke cert generate-csr --cert-dir <custom-dir-path> 

rke up —custom-certs —cert-dir <custom-dir-path>

rke etcd snapshot-save —name snp1 —config cluster.yml

rke state file is not present in the snapshot zip folder

root@soumyacheck1:/opt/rke/etcd-snapshots# ls
backup  snp1.zip
root@soumyacheck1:/opt/rke/etcd-snapshots# cd backup/
root@soumyacheck1:/opt/rke/etcd-snapshots/backup# ls
snp1
root@soumyacheck1:/opt/rke/etcd-snapshots/backup# 

Tested with v1.2.0-rc7

  • Created a 3 cluster using custom certs with the same steps above using v1.2.0-rc7, the rke state file is present in the snapshot folder.
root@ip-172-31-10-199:/opt/rke/etcd-snapshots# ls
backup  etc  snp1.zip
root@ip-172-31-10-199:/opt/rke/etcd-snapshots# cd etc/
root@ip-172-31-10-199:/opt/rke/etcd-snapshots/etc# ls
kubernetes
root@ip-172-31-10-199:/opt/rke/etcd-snapshots/etc# cd kubernetes/
root@ip-172-31-10-199:/opt/rke/etcd-snapshots/etc/kubernetes# ls
snp1.rkestate
root@ip-172-31-10-199:/opt/rke/etcd-snapshots/etc/kubernetes# 

Performed a restore using custom certs with the below command:

./rke_darwin-amd64-v120rc7 etcd snapshot-restore --name snp2 --custom-certs cert-dir /Users/soumya/rke/120certs --config cluster.yml 

Restore fails with the below error though the pem key kube-etcd-abcd.pem is present in the cert directory

INFO[0044] Initiating Kubernetes cluster                
FATA[0044] Failed to validates certificates from dir [./cluster_certs]: Failed to find etcd [kube-etcd-abcd] Certificate or Key

Restore succeeds when the flag --custom-certs cert-dir is not specified.

@sangeethah
Copy link

Reopening the issue for the following reasons:

  1. In v1.1.5-rc5 , we dont see the fix for rke state file not being present in the snapshot zip folder when testing with 3 node etcd cluster created with custom cert.

  2. In v1.2.0-rc7 , able to validate that the rke state file is present in the snapshot zip folder when testing with 3 node etcd cluster created with custom cert

But not able to restore using custom certs .

@sangeethah
Copy link

#2191

@soumyalj
Copy link

soumyalj commented Aug 19, 2020

Tested using v1.1.5-rc7 .

  • Create a 3 node cluster using custom certs with the steps below:
rke cert generate-csr --cert-dir <custom-dir-path> 

rke up —custom-certs —cert-dir <custom-dir-path>

rke etcd snapshot-save —name snp1 —config cluster.yml
  • Verified rke state file is present in the snapshot zip folder
root@ip-172-31-10-250:/opt/rke/etcd-snapshots# unzip snp1.zip 
Archive:  snp1.zip
warning:  stripped absolute path spec from /backup/snp1
  inflating: backup/snp1             
warning:  stripped absolute path spec from /etc/kubernetes/snp1.rkestate
  inflating: etc/kubernetes/snp1.rkestate  
  • Perform a restore specifying the custom certs as below:
./rke_darwin-amd64-v115rc7 etcd snapshot-restore --custom-certs --cert-dir /Users/soumya/rke/newfincerts --name snp1 --config clusterbkp.yml

Restore is successful. rkestate file is extracted from the snapshot. Logs below:

NFO[0002] Successfully started [etcd-extract-statefile] container on host [abcd] 
INFO[0002] Waiting for [etcd-extract-statefile] container to exit on host [efgh] 
INFO[0002] Waiting for [etcd-extract-statefile] container to exit on host [hijk] 
INFO[0002] Removing container [etcd-extract-statefile] on host [abcd], try #1 
INFO[0002] [remove/etcd-extract-statefile] Successfully removed container on host [abcd] 
INFO[0002] State file is successfully extracted from snapshot [snp1] 
INFO[0002] Restoring etcd snapshot snp1                 

  • Restore is also successful without specifying the custom certs
./rke_darwin-amd64-v115rc7 etcd snapshot-restore --name snp1 --config clusterbkp.yml

@soumyalj
Copy link

Re-tested with v1.2.0-rc7

Created a 3 node cluster using custom certs with the steps below

rke cert generate-csr --cert-dir <custom-dir-path> 

rke up —custom-certs —cert-dir <custom-dir-path>

rke etcd snapshot-save —name snp1 —config cluster.yml

Verified rke state file is present in the snapshot zip folder
Perform a restore specifying the custom certs as below:

./rke_darwin-amd64-v120rc7 etcd snapshot-restore --name snp1 --custom-certs --cert-dir /Users/soumya/rke/certs --config clusterfin.yml

Restore is successful. Statefile is extracted from the zip folder

INFO[0001] Starting container [etcd-extract-statefile] on host [abcd], try #1 
INFO[0002] Successfully started [etcd-extract-statefile] container on host [abcd] 
INFO[0002] Waiting for [etcd-extract-statefile] container to exit on host [abcd] 
INFO[0002] Waiting for [etcd-extract-statefile] container to exit on host [abcd] 
INFO[0002] Removing container [etcd-extract-statefile] on host [abcd], try #1 
INFO[0002] [remove/etcd-extract-statefile] Successfully removed container on host [abcd] 
INFO[0002] State file is successfully extracted from snapshot [snp1] 
INFO[0002] Restoring etcd snapshot snp1      

Restore without specifying custom certs is also functional

./rke_darwin-amd64-v120rc7 etcd snapshot-restore --name snp1 --config clusterfin.yml

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants