diff --git a/docs/src/snap/howto/restore-quorum.md b/docs/src/snap/howto/restore-quorum.md index aeb15b721..99a8c4e8b 100755 --- a/docs/src/snap/howto/restore-quorum.md +++ b/docs/src/snap/howto/restore-quorum.md @@ -1,9 +1,9 @@ # Recovering a Cluster After Quorum Loss Highly available {{product}} clusters can survive losing one or more -nodes. [Dqlite], the default datastore, implements a [Raft] based protocol where -an elected leader holds the definitive copy of the database, which is then -replicated on two or more secondary nodes. +nodes. [Dqlite], the default datastore, implements a [Raft] based protocol +where an elected leader holds the definitive copy of the database, which is +then replicated on two or more secondary nodes. When the a majority of the nodes are lost, the cluster becomes unavailable. If at least one database node survived, the cluster can be recovered using the @@ -64,8 +64,8 @@ sudo snap stop k8s ## Recover the Database -Choose one of the remaining alive cluster nodes that has the most recent version -of the Raft log. +Choose one of the remaining alive cluster nodes that has the most recent +version of the Raft log. Update the ``cluster.yaml`` files, changing the role of the lost nodes to "spare" (2). Additionally, double check the addresses and IDs specified in @@ -73,7 +73,8 @@ Update the ``cluster.yaml`` files, changing the role of the lost nodes to files were moved across nodes. The following command guides us through the recovery process, prompting a text -editor with informative inline comments for each of the dqlite configuration files. +editor with informative inline comments for each of the dqlite configuration +files. ``` sudo /snap/k8s/current/bin/k8sd cluster-recover \ @@ -82,29 +83,40 @@ sudo /snap/k8s/current/bin/k8sd cluster-recover \ --log-level 0 ``` -Please adjust the log level for additional debug messages by increasing its value. -The command creates database backups before making any changes. +Please adjust the log level for additional debug messages by increasing its +value. The command creates database backups before making any changes. -The above command will reconfigure the Raft members and create recovery tarballs -that are used to restore the lost nodes, once the Dqlite configuration is updated. +The above command will reconfigure the Raft members and create recovery +tarballs that are used to restore the lost nodes, once the Dqlite +configuration is updated. ```{note} -By default, the command will recover both Dqlite databases. If one of the databases -needs to be skipped, use the ``--skip-k8sd`` or ``--skip-k8s-dqlite`` flags. -This can be useful when using an external Etcd database. +By default, the command will recover both Dqlite databases. If one of the +databases needs to be skipped, use the ``--skip-k8sd`` or ``--skip-k8s-dqlite`` +flags. This can be useful when using an external Etcd database. ``` -Once the "cluster-recover" command completes, restart the k8s services on the node: +```{note} +Non-interactive mode can be requested using the ``--non-interactive`` flag. +In this case, no interactive prompts or text editors will be displayed and +the command will assume that the configuration files have already been updated. + +This allows automating the recovery procedure. +``` + +Once the "cluster-recover" command completes, restart the k8s services on the +node: ``` sudo snap start k8s ``` -Ensure that the services started successfully by using ``sudo snap services k8s``. -Use ``k8s status --wait-ready`` to wait for the cluster to become ready. +Ensure that the services started successfully by using +``sudo snap services k8s``. Use ``k8s status --wait-ready`` to wait for the +cluster to become ready. -You may notice that we have not returned to an HA cluster yet: ``high availability: no``. -This is expected as we need to recover +You may notice that we have not returned to an HA cluster yet: +``high availability: no``. This is expected as we need to recover ## Recover the remaining nodes @@ -113,28 +125,34 @@ nodes. For k8sd, copy ``recovery_db.tar.gz`` to ``/var/snap/k8s/common/var/lib/k8sd/state/recovery_db.tar.gz``. When the k8sd -service starts, it will load the archive and perform the necessary recovery steps. +service starts, it will load the archive and perform the necessary recovery +steps. The k8s-dqlite archive needs to be extracted manually. First, create a backup of the current k8s-dqlite state directory: ``` -sudo mv /var/snap/k8s/common/var/lib/k8s-dqlite /var/snap/k8s/common/var/lib/k8s-dqlite.bkp +sudo mv /var/snap/k8s/common/var/lib/k8s-dqlite \ + /var/snap/k8s/common/var/lib/k8s-dqlite.bkp ``` Then, extract the backup archive: ``` sudo mkdir /var/snap/k8s/common/var/lib/k8s-dqlite -sudo tar xf recovery-k8s-dqlite-$timestamp-post-recovery.tar.gz -C /var/snap/k8s/common/var/lib/k8s-dqlite +sudo tar xf recovery-k8s-dqlite-$timestamp-post-recovery.tar.gz \ + -C /var/snap/k8s/common/var/lib/k8s-dqlite ``` Node specific files need to be copied back to the k8s-dqlite state dir: ``` -sudo cp /var/snap/k8s/common/var/lib/k8s-dqlite.bkp/cluster.crt /var/snap/k8s/common/var/lib/k8s-dqlite -sudo cp /var/snap/k8s/common/var/lib/k8s-dqlite.bkp/cluster.key /var/snap/k8s/common/var/lib/k8s-dqlite -sudo cp /var/snap/k8s/common/var/lib/k8s-dqlite.bkp/info.yaml /var/snap/k8s/common/var/lib/k8s-dqlite +sudo cp /var/snap/k8s/common/var/lib/k8s-dqlite.bkp/cluster.crt \ + /var/snap/k8s/common/var/lib/k8s-dqlite +sudo cp /var/snap/k8s/common/var/lib/k8s-dqlite.bkp/cluster.key \ + /var/snap/k8s/common/var/lib/k8s-dqlite +sudo cp /var/snap/k8s/common/var/lib/k8s-dqlite.bkp/info.yaml \ + /var/snap/k8s/common/var/lib/k8s-dqlite ``` Once these steps are completed, restart the k8s services: @@ -143,13 +161,15 @@ Once these steps are completed, restart the k8s services: sudo snap start k8s ``` -Repeat these steps for all remaining nodes. Once a quorum is achieved, the cluster -will be reported as "highly available": +Repeat these steps for all remaining nodes. Once a quorum is achieved, +the cluster will be reported as "highly available": ``` $ sudo k8s status cluster status: ready -control plane nodes: 10.80.130.168:6400 (voter), 10.80.130.167:6400 (voter), 10.80.130.164:6400 (voter) +control plane nodes: 10.80.130.168:6400 (voter), + 10.80.130.167:6400 (voter), + 10.80.130.164:6400 (voter) high availability: yes datastore: k8s-dqlite network: enabled diff --git a/src/k8s/cmd/k8sd/k8sd_cluster_recover.go b/src/k8s/cmd/k8sd/k8sd_cluster_recover.go index 766acc49f..204dbbabd 100755 --- a/src/k8s/cmd/k8sd/k8sd_cluster_recover.go +++ b/src/k8s/cmd/k8sd/k8sd_cluster_recover.go @@ -28,7 +28,7 @@ import ( "github.com/canonical/k8s/pkg/utils" ) -const recoveryConfirmation = `You should only run this command if: +const preRecoveryMessage = `You should only run this command if: - A quorum of cluster members is permanently lost - You are *absolutely* sure all k8s daemons are stopped (sudo snap stop k8s) - This instance has the most up to date database @@ -36,8 +36,17 @@ const recoveryConfirmation = `You should only run this command if: Note that before applying any changes, a database backup is created at: * k8sd (microcluster): /var/snap/k8s/common/var/lib/k8sd/state/db_backup..tar.gz * k8s-dqlite: /var/snap/k8s/common/recovery-k8s-dqlite--pre-recovery.tar.gz +` + +const recoveryConfirmation = "Do you want to proceed? (yes/no): " + +const nonInteractiveMessage = `Non-interactive mode requested. -Do you want to proceed? (yes/no): ` +The command will assume that the dqlite configuration files have already been +modified with the updated cluster member roles and addresses. + +Initiating the dqlite database recovery. +` const clusterK8sdYamlRecoveryComment = `# Member roles can be modified. Unrecoverable nodes should be given the role "spare". # @@ -75,6 +84,7 @@ const yamlHelperCommentFooter = "# ------- everything below will be written ---- var clusterRecoverOpts struct { K8sDqliteStateDir string + NonInteractive bool SkipK8sd bool SkipK8sDqlite bool } @@ -145,6 +155,8 @@ func newClusterRecoverCmd() *cobra.Command { cmd.Flags().StringVar(&clusterRecoverOpts.K8sDqliteStateDir, "k8s-dqlite-state-dir", "", "k8s-dqlite datastore location") + cmd.Flags().BoolVar(&clusterRecoverOpts.NonInteractive, "non-interactive", + false, "disable interactive prompts, assume that the configs have been updated") cmd.Flags().BoolVar(&clusterRecoverOpts.SkipK8sd, "skip-k8sd", false, "skip k8sd recovery") cmd.Flags().BoolVar(&clusterRecoverOpts.SkipK8sDqlite, "skip-k8s-dqlite", @@ -171,8 +183,8 @@ func recoveryCmdPrechecks(ctx context.Context) error { log.V(1).Info("Running prechecks.") - if !termios.IsTerminal(unix.Stdin) { - return fmt.Errorf("this command is meant to be run in an interactive terminal") + if !termios.IsTerminal(unix.Stdin) && !clusterRecoverOpts.NonInteractive { + return fmt.Errorf("interactive mode requested in a non-interactive terminal") } if clusterRecoverOpts.K8sDqliteStateDir == "" { @@ -182,21 +194,31 @@ func recoveryCmdPrechecks(ctx context.Context) error { return fmt.Errorf("k8sd state dir not specified") } - reader := bufio.NewReader(os.Stdin) - fmt.Print(recoveryConfirmation) + fmt.Print(preRecoveryMessage) + fmt.Print("\n") - input, err := reader.ReadString('\n') - if err != nil { - return fmt.Errorf("couldn't read user input, error: %w", err) - } - input = strings.TrimSuffix(input, "\n") + if clusterRecoverOpts.NonInteractive { + fmt.Print(nonInteractiveMessage) + fmt.Print("\n") + } else { + reader := bufio.NewReader(os.Stdin) + fmt.Print(recoveryConfirmation) + + input, err := reader.ReadString('\n') + if err != nil { + return fmt.Errorf("couldn't read user input, error: %w", err) + } + input = strings.TrimSuffix(input, "\n") - if strings.ToLower(input) != "yes" { - return fmt.Errorf("cluster edit aborted; no changes made") + if strings.ToLower(input) != "yes" { + return fmt.Errorf("cluster edit aborted; no changes made") + } + + fmt.Print("\n") } if !clusterRecoverOpts.SkipK8sDqlite { - if err = ensureK8sDqliteMembersStopped(ctx); err != nil { + if err := ensureK8sDqliteMembersStopped(ctx); err != nil { return err } } @@ -376,59 +398,64 @@ func recoverK8sd() (string, error) { clusterYamlPath := path.Join(m.FileSystem.DatabaseDir, "cluster.yaml") clusterYamlCommentHeader := fmt.Sprintf("# K8sd cluster configuration\n# (based on the trust store and %s)\n", clusterYamlPath) - clusterYamlContent, err := yamlEditorGuide( - "", - false, - slices.Concat( - []byte(clusterYamlCommentHeader), - []byte("#\n"), - []byte(clusterK8sdYamlRecoveryComment), - []byte(yamlHelperCommentFooter), - []byte("\n"), - oldMembersYaml, - ), - false, - ) - if err != nil { - return "", fmt.Errorf("interactive text editor failed, error: %w", err) - } + clusterYamlContent := oldMembersYaml + if !clusterRecoverOpts.NonInteractive { + // Interactive mode requested (default). + // Assist the user in configuring dqlite. + clusterYamlContent, err = yamlEditorGuide( + "", + false, + slices.Concat( + []byte(clusterYamlCommentHeader), + []byte("#\n"), + []byte(clusterK8sdYamlRecoveryComment), + []byte(yamlHelperCommentFooter), + []byte("\n"), + oldMembersYaml, + ), + false, + ) + if err != nil { + return "", fmt.Errorf("interactive text editor failed, error: %w", err) + } - infoYamlPath := path.Join(m.FileSystem.DatabaseDir, "info.yaml") - infoYamlCommentHeader := fmt.Sprintf("# K8sd info.yaml\n# (%s)\n", infoYamlPath) - _, err = yamlEditorGuide( - infoYamlPath, - true, - slices.Concat( - []byte(infoYamlCommentHeader), - []byte("#\n"), - []byte(infoYamlRecoveryComment), - utils.YamlCommentLines(clusterYamlContent), - []byte("\n"), - []byte(yamlHelperCommentFooter), - ), - true, - ) - if err != nil { - return "", fmt.Errorf("interactive text editor failed, error: %w", err) - } + infoYamlPath := path.Join(m.FileSystem.DatabaseDir, "info.yaml") + infoYamlCommentHeader := fmt.Sprintf("# K8sd info.yaml\n# (%s)\n", infoYamlPath) + _, err = yamlEditorGuide( + infoYamlPath, + true, + slices.Concat( + []byte(infoYamlCommentHeader), + []byte("#\n"), + []byte(infoYamlRecoveryComment), + utils.YamlCommentLines(clusterYamlContent), + []byte("\n"), + []byte(yamlHelperCommentFooter), + ), + true, + ) + if err != nil { + return "", fmt.Errorf("interactive text editor failed, error: %w", err) + } - daemonYamlPath := path.Join(m.FileSystem.StateDir, "daemon.yaml") - daemonYamlCommentHeader := fmt.Sprintf("# K8sd daemon.yaml\n# (%s)\n", daemonYamlPath) - _, err = yamlEditorGuide( - daemonYamlPath, - true, - slices.Concat( - []byte(daemonYamlCommentHeader), - []byte("#\n"), - []byte(daemonYamlRecoveryComment), - utils.YamlCommentLines(clusterYamlContent), - []byte("\n"), - []byte(yamlHelperCommentFooter), - ), - true, - ) - if err != nil { - return "", fmt.Errorf("interactive text editor failed, error: %w", err) + daemonYamlPath := path.Join(m.FileSystem.StateDir, "daemon.yaml") + daemonYamlCommentHeader := fmt.Sprintf("# K8sd daemon.yaml\n# (%s)\n", daemonYamlPath) + _, err = yamlEditorGuide( + daemonYamlPath, + true, + slices.Concat( + []byte(daemonYamlCommentHeader), + []byte("#\n"), + []byte(daemonYamlRecoveryComment), + utils.YamlCommentLines(clusterYamlContent), + []byte("\n"), + []byte(yamlHelperCommentFooter), + ), + true, + ) + if err != nil { + return "", fmt.Errorf("interactive text editor failed, error: %w", err) + } } newMembers := []cluster.DqliteMember{} @@ -465,40 +492,53 @@ func recoverK8sd() (string, error) { func recoverK8sDqlite() (string, string, error) { k8sDqliteStateDir := clusterRecoverOpts.K8sDqliteStateDir + var err error + clusterYamlContent := []byte{} clusterYamlPath := path.Join(k8sDqliteStateDir, "cluster.yaml") clusterYamlCommentHeader := fmt.Sprintf("# k8s-dqlite cluster configuration\n# (%s)\n", clusterYamlPath) - clusterYamlContent, err := yamlEditorGuide( - clusterYamlPath, - true, - slices.Concat( - []byte(clusterYamlCommentHeader), - []byte("#\n"), - []byte(clusterK8sDqliteRecoveryComment), - []byte(yamlHelperCommentFooter), - ), - true, - ) - if err != nil { - return "", "", fmt.Errorf("interactive text editor failed, error: %w", err) - } - infoYamlPath := path.Join(k8sDqliteStateDir, "info.yaml") - infoYamlCommentHeader := fmt.Sprintf("# k8s-dqlite info.yaml\n# (%s)\n", infoYamlPath) - _, err = yamlEditorGuide( - infoYamlPath, - true, - slices.Concat( - []byte(infoYamlCommentHeader), - []byte("#\n"), - []byte(infoYamlRecoveryComment), - utils.YamlCommentLines(clusterYamlContent), - []byte("\n"), - []byte(yamlHelperCommentFooter), - ), - true, - ) - if err != nil { - return "", "", fmt.Errorf("interactive text editor failed, error: %w", err) + if clusterRecoverOpts.NonInteractive { + clusterYamlContent, err = os.ReadFile(clusterYamlPath) + if err != nil { + return "", "", fmt.Errorf( + "could not read k8s-dqlite cluster.yaml, error: %w", err) + } + } else { + // Interactive mode requested (default). + // Assist the user in configuring dqlite. + clusterYamlContent, err = yamlEditorGuide( + clusterYamlPath, + true, + slices.Concat( + []byte(clusterYamlCommentHeader), + []byte("#\n"), + []byte(clusterK8sDqliteRecoveryComment), + []byte(yamlHelperCommentFooter), + ), + true, + ) + if err != nil { + return "", "", fmt.Errorf("interactive text editor failed, error: %w", err) + } + + infoYamlPath := path.Join(k8sDqliteStateDir, "info.yaml") + infoYamlCommentHeader := fmt.Sprintf("# k8s-dqlite info.yaml\n# (%s)\n", infoYamlPath) + _, err = yamlEditorGuide( + infoYamlPath, + true, + slices.Concat( + []byte(infoYamlCommentHeader), + []byte("#\n"), + []byte(infoYamlRecoveryComment), + utils.YamlCommentLines(clusterYamlContent), + []byte("\n"), + []byte(yamlHelperCommentFooter), + ), + true, + ) + if err != nil { + return "", "", fmt.Errorf("interactive text editor failed, error: %w", err) + } } newMembers := []dqlite.NodeInfo{}