Skip to content

Latest commit

 

History

History
72 lines (49 loc) · 7.87 KB

upgrades.adoc

File metadata and controls

72 lines (49 loc) · 7.87 KB

Upgrades

Background

The versioning scheme used by Aerospike Server is <major>.<minor>.<patch>.<revision>. Changes introduced in each version can be categorized as follows:

  • Major upgrades usually introduce major features.

  • Minor upgrades usually introduce new features. Deprecated features may be removed, and upgrades to the storage or wire protocols may be introduced.

  • Patch upgrades usually introduce improvements to existing features. They may also deprecate (but not remove) existing features and configuration properties.

  • Revision upgrades usually introduce bug fixes only.

Aerospike Tools [1] (which aerospike-operator uses to implement the backup and restore functionality) follows a similar versioning scheme.

Historically, directly upgrading between revision versions (e.g. 4.0.0.1 to 4.0.0.4) and patch versions (e.g. 3.15.0.2 to 3.15.1.3) has been fully supported, regardless of the source and target versions. This does not happen with minor versions, however. For example, an upgrade from 3.12.1.3 to 3.16.0.6 is required to "touch-base" at 3.13, and further requires updating the Aerospike configuration in order to upgrade the heartbeat and paxos protocols [2] before proceeding further with the upgrade. Upgrading between major versions has historically been a more involved procedure [3], and usually requires careful case-by-case analysis and planning.

  1. Wait until there are no migrations in progress on the target node.

  2. Stop the Aerospike server process.

  3. Upgrade the version of Aerospike server.

  4. Update the configuration (if necessary).

  5. Cold-start [5] the Aerospike server process.

This procedure must then be repeated for the remaining nodes in the cluster. The Aerospike documentation further mentions [6] that when upgrading from a version greater than 3.13 it is not necessary to wait for migrations to complete on a node before proceeding to upgrading the next node (i.e. after Step 5), being enough to wait for the node to rejoin the cluster. It should be noted, however, that cold-starting an Aerospike node holding considerable amounts of data can take a long time [7], and that the node only rejoins the cluster after the initial data loading process is complete.

Goals

  • Provide support for upgrading an existing Aerospike cluster to a more recent minor, patch or release version.

  • Provide adequate validation of the upgrade path before actually starting the upgrade process.

  • Perform the upgrade while causing no cluster downtime [8].

  • Ensure that no permanent data loss occurs as a result of an upgrade operation.

Non-Goals

  • Provide support for downgrading an existing Aerospike cluster.

  • Provide support for upgrading an existing Aerospike cluster to a different major version.

  • Implement automatic rollback or restore after a failed upgrade.

Design Overview

A version upgrade on a given Aerospike cluster is triggered by a change to .spec.version field of the associated AerospikeCluster resource. The upgrade procedure performed by aerospike-operator on the target Aerospike cluster is the procedure recommended [4] in the Aerospike documentation and described above. For every pod in the cluster, aerospike-operator will:

  1. Wait until there are no migrations in progress.

  2. Delete the pod.

  3. Create a new pod running the target version of Aerospike.

ℹ️
Existing persistent volumes holding Aerospike namespace data will be reused when creating the new pod.

By following the recommended procedure, aerospike-operator ensures maximum service and data availability during cluster maintenance in almost all scenarios [9]. Furthermore, and in order to ensure the safety of the data managed by the cluster, aerospike-operator will create a backup of each namespace [10] in the target cluster to cloud storage before actually starting the upgrade process. These backups can later be manually restored to a new Aerospike cluster shall the upgrade process fail. An overview of the whole procedure is provided below:

Upgrade process provided by `aerospike-operator`

As mentioned above, aerospike-operator does its best to validate the transition between the source and target versions before actually starting the upgrade process. As such, every version of aerospike-operator will feature a whitelist of supported Aerospike versions, as well as of the transitions between them. New releases of Aerospike will be tracked and whitelisted by updated versions of aerospike-operator. These updates to aerospike-operator will also, whenever necessary, introduce custom code for handling a particular upgrade path (such as the "manual" upgrade steps required by Aerospike 3.13 [11] or 4.2 [12]).

Alternatives Considered

An alternative upgrade procedure was initially considered to replace the one proposed in Design Overview. This alternative approach would involve the creation of a "surge pod" running the target Aerospike version before deleting a pod running the source Aerospike version. This would help ensuring maximum service and data availability during the upgrade process. However, and because in this scenario the existing persistent volumes would not be reused, this method would cause data loss in clusters containing namespace with a replication factor of 1. Hence, a different method would have to be considered for this scenario. As it is not practical to have different upgrade processes based on the replication factor of a namespace, this approach has been abandoned.

An alternative approach for automatic pre-upgrade backups was also considered. This alternative approach would involve backing-up namespaces to persistent volumes rather than to cloud storage. Then, in case of a failed upgrade, the affected namespaces could be manually restored from the abovementioned persistent volume. However, using this approach would mean that a different, separate method for backup and restore would need to be supported and maintained (something that would likely cause confusion). Hence, this approach has also been discarded.


7. Even though the documentation mentions "40+ minutes" (per-node) for a cold-start, such as in https://www.aerospike.com/docs/operations/manage/aerospike/fast_start, our tests show that it can take considerably more depending on the amount of data stored in each node.
8. As exception must be made here for single-node clusters. In this scenario it is not possible to perform the upgrade procedure without cluster downtime.
9. For clusters using a replication factor of 1, full data availability during the upgrade procedure cannot be ensured.
10. The number of Aerospike namespaces per Aerospike cluster is currently limited to a single one