The versioning scheme used by Aerospike Server is <major>.<minor>.<patch>.<revision>
. Changes introduced in each version can be categorized as follows:
-
Major upgrades usually introduce major features.
-
Minor upgrades usually introduce new features. Deprecated features may be removed, and upgrades to the storage or wire protocols may be introduced.
-
Patch upgrades usually introduce improvements to existing features. They may also deprecate (but not remove) existing features and configuration properties.
-
Revision upgrades usually introduce bug fixes only.
Aerospike Tools [1] (which aerospike-operator
uses to implement the backup and restore functionality) follows a similar versioning scheme.
Historically, directly upgrading between revision versions (e.g. 4.0.0.1
to 4.0.0.4
) and patch versions (e.g. 3.15.0.2
to 3.15.1.3
) has been fully supported, regardless of the source and target versions. This does not happen with minor versions, however. For example, an upgrade from 3.12.1.3
to 3.16.0.6
is required to "touch-base" at 3.13
, and further requires updating the Aerospike configuration in order to upgrade the heartbeat and paxos protocols [2] before proceeding further with the upgrade. Upgrading between major versions has historically been a more involved procedure [3], and usually requires careful case-by-case analysis and planning.
The recommended flow [4] for upgrading an Aerospike cluster involves upgrading one node at a time using the following procedure:
-
Wait until there are no migrations in progress on the target node.
-
Stop the Aerospike server process.
-
Upgrade the version of Aerospike server.
-
Update the configuration (if necessary).
-
Cold-start [5] the Aerospike server process.
This procedure must then be repeated for the remaining nodes in the cluster. The Aerospike documentation further mentions [6] that when upgrading from a version greater than 3.13
it is not necessary to wait for migrations to complete on a node before proceeding to upgrading the next node (i.e. after Step 5), being enough to wait for the node to rejoin the cluster. It should be noted, however, that cold-starting an Aerospike node holding considerable amounts of data can take a long time [7], and that the node only rejoins the cluster after the initial data loading process is complete.
-
Provide support for upgrading an existing Aerospike cluster to a more recent minor, patch or release version.
-
Provide adequate validation of the upgrade path before actually starting the upgrade process.
-
Perform the upgrade while causing no cluster downtime [8].
-
Ensure that no permanent data loss occurs as a result of an upgrade operation.
-
Provide support for downgrading an existing Aerospike cluster.
-
Provide support for upgrading an existing Aerospike cluster to a different major version.
-
Implement automatic rollback or restore after a failed upgrade.
A version upgrade on a given Aerospike cluster is triggered by a change to .spec.version
field of the associated AerospikeCluster
resource. The upgrade procedure performed by aerospike-operator
on the target Aerospike cluster is the procedure recommended [4] in the Aerospike documentation and described above. For every pod in the cluster, aerospike-operator
will:
-
Wait until there are no migrations in progress.
-
Delete the pod.
-
Create a new pod running the target version of Aerospike.
ℹ️
|
Existing persistent volumes holding Aerospike namespace data will be reused when creating the new pod. |
By following the recommended procedure, aerospike-operator
ensures maximum service and data availability during cluster maintenance in almost all scenarios [9]. Furthermore, and in order to ensure the safety of the data managed by the cluster, aerospike-operator
will create a backup of each namespace [10] in the target cluster to cloud storage before actually starting the upgrade process. These backups can later be manually restored to a new Aerospike cluster shall the upgrade process fail. An overview of the whole procedure is provided below:
As mentioned above, aerospike-operator
does its best to validate the transition between the source and target versions before actually starting the upgrade process. As such, every version of aerospike-operator
will feature a whitelist of supported Aerospike versions, as well as of the transitions between them. New releases of Aerospike will be tracked and whitelisted by updated versions of aerospike-operator
. These updates to aerospike-operator
will also, whenever necessary, introduce custom code for handling a particular upgrade path (such as the "manual" upgrade steps required by Aerospike 3.13 [11] or 4.2 [12]).
An alternative upgrade procedure was initially considered to replace the one proposed in Design Overview. This alternative approach would involve the creation of a "surge pod" running the target Aerospike version before deleting a pod running the source Aerospike version. This would help ensuring maximum service and data availability during the upgrade process. However, and because in this scenario the existing persistent volumes would not be reused, this method would cause data loss in clusters containing namespace with a replication factor of 1. Hence, a different method would have to be considered for this scenario. As it is not practical to have different upgrade processes based on the replication factor of a namespace, this approach has been abandoned.
An alternative approach for automatic pre-upgrade backups was also considered. This alternative approach would involve backing-up namespaces to persistent volumes rather than to cloud storage. Then, in case of a failed upgrade, the affected namespaces could be manually restored from the abovementioned persistent volume. However, using this approach would mean that a different, separate method for backup and restore would need to be supported and maintained (something that would likely cause confusion). Hence, this approach has also been discarded.