Problem: users are only able to upgrade the cluster components that are part of the kubernetes distribution. Any other core component of EKS Anywhere remains untouched from the cluster creation.
This stops users from enjoying the latest features and bug fixes from EKS Anywhere and any of its core components as well as from being able to get security patches.
- Simple: simple to use, simple to maintain, simple to build
As an EKS Anywhere user:
- I want to able to enjoy the latest EKS Anywhere features and bug fixes
- I want to keep my cluster up to date with security patches
- I don’t want to need to stay informed about latest versions, security patches or compatibility between components
In scope
- Upgrade core EKS Anywhere components to latest versions available for an specific cli version
- Core CAPI
- CAPI providers
- Cert-manager
- Etcdadm CAPI provider
- CNIs (cilium)
- EKS Anywhere controller and CRDs
- Flux
Not in scope
- Install/upgrade add-ons
- Upgrade cli
- Upgrades to user defined component versions
Future scope
- Version changes report
- “Notification” for new available versions
- Automatic/scheduled upgrades
Upgrade all EKS Anywhere components to the versions declared in the latest bundles manifest when running the upgrade cluster
command.
Since we already have:
- An interface that users use to upgrade clusters (
upgrade cluster
command) - A declarative list of all EKS Anywhere components at their latest available versions that have been tested together (bundles manifest)
It seems like the simplest solution is to take advantage of both.
This solution fits very well with both the goal of a user experience as simple as possible and the concept of the versions bundle (a group of versioned components that work well together):
- We don’t introduce more concepts/commands. Users just have to use the same command they are already used to. The syntax stays the same and there is no need to define what needs to be upgraded and to what version. Moreover, users don't ever have to worry about the individual versions and dependencies between them. The cli takes care of this transparently.
- Components bundled together are always upgraded together. This reduces our testing matrix considerably, since there is no room for version miss-alignments.
This is also very similar to how EKS handles upgrades, taking care of bundling and versioning for the user. Here it has to be manually triggered (for now, due to management cluster limitations), as opposed to being automated.
The cli needs the ability to detect both component version changes as well as new builds of the same versioned component (such as a new container build due to a patch in a dependency like AL2). The process to upgrade some components could be different for these two scenarios.
Components should be upgraded before the new Kubernetes version is rolled out (if needed). It is more common for an old component to be incompatible with a new Kubernetes version than vice-versa. There might be case by case exceptions. Right now there are no exceptions, at least with the current set of core components.
This also facilitates CAPI API version upgrades, since updating its components (CRDs and controllers) will also update the current objects in the cluster (through conversion webhooks). This allows us to only maintain only one API version of provider templates, since these are only reapplied when rolling out the new Kubernetes version.
All the CAPI providers objects in the bundles manifest already contain a version
field. Unfortunately it only reflects the original upstream version that component was built from. We should add the build metadata information to that semver, like v0.4.2+build-2
. This will allow us to detect new builds as well as preserve the original upstream version, which is important in order to verify API version changes. We might need to trim the build metadata info when interacting with clusterctl
, I haven’t tested if it’s supported yet.
We will need to add such field to all the other components (Flux, EKS Anywhere controllers and crds, cilium...) that don’t have it yet.
All the CAPI providers, components and cert-manager can be updated with clusterctl
. This works both with version bumps and new builds of the same version. We just need to create the proper overrides layer, configure the images in the clusterctl
config file, and pass specific versions for every provider to the upgrade apply
command.
Even if there is CAPI API version change, the process is the same. The providers will take care of converting the objects in the cluster to the new API version. The only requirement is that all the installed providers need to be upgraded to a version that supports the new API version.
Both the controller and CRDs can be upgraded by just applying the new components manifest with kubectl. This should work fine for both new versions and new builds of the same version.
We have to be careful and make sure that any changes to the CRDs are backwards compatible, since at this point there are already EKS Anywhere objects in the cluster. Examples of changes that are not backwards compatible include: adding a required field, changing the type of field, making an existing field required, and so on. It’s important to note that applying the manifest with the new CRD won’t fail, even if the changes are not backwards compatible. But the objects already in the cluster will become invalid. So any kind of update to those (such as flux reconciliation) will be be rejected by the Kubernetes API server.
Updates to the controller code should also take this into consideration. They always need to consider how to handle existing objects in the cluster. For example, if changing the pattern to name certain CAPI objects in the provider templates, the controller has to be able to handle both the old and the new pattern.
If there is a new EKS Anywhere API version, we will need to write conversion webhooks.
The process to upgrade the Flux components is very similar to the installation process. It uses the flux
cli:
- Write the “kustomization” layer to disk to set our own Flux images
- Run the
bootstrap
command. This will generate the new manifest (flux-system/gotk-components.yaml
) and commit all files to the repository - Run the
reconcile
command forflux-system
which tells Flux to pull the manifests from git and upgrade itself
This process should work for both new versions and new builds of the same version.
Full documentation: https://fluxcd.io/docs/installation/#bootstrap-upgrade
The process to upgrade Cilium is a bit more involved. It uses helm
to generate intermediate manifests and kubectl
to apply them.
- Generate pre-flight checks with
helm
and apply them withkubectl
. This “pre-pulls" the new images on the nodes to avoidErrImagePull
errors later during the upgrade - Once it’s finished, delete the pre-flight components with
kubectl
- Generate the new Cilium manifest using the right “upgrade compatibility” options with
helm
(the generated manifest is different depending on the minor version jump) and apply it withkubectl
.
Depending on the implementation, we might be able to avoid creating some/all of these manifests by including them in the bundle manifest.
Full documentation: https://docs.cilium.io/en/v1.10/operations/upgrade/
As mentioned in the CAPI section, we don’t need any extra process to perform CAPI component upgrades that involve an API version change. However, there will some extra effort needed to support v1alpha4
in the cli. Most of this will probably apply to future API version changes.
- Add support for
v1alpha4
in the etcdadm provider. This includes writing conversion webhooks. - Port all the patches made to upstream cluster-api to
v1alpha4
CRDs and new controller’s code (build on top of latest patch ofv0.4.x
) - Update all provider templates in the cli to generate
v1alpha4
CAPI specs - Build compatible versions of dependencies like
clusterctl
and cert-manager
There are not backwards breaking changes. However, the expected behavior of the upgrade cluster
command will be changing, so it should be clearly communicated in the changelog.
We should include a section about how to “get the latest version of components” or something similar.
In the upgrade cluster
command documentation we should document that we will always upgrade internal EKS Anywhere components to the latest available version, even when the Kubernetes version remains unchanged.
Since the plan is to only support two minor versions of the cli at the same time, we should add E2E tests that:
- Create the cluster using the previous minor release and upgrade them using the latest version from the current branch
- For
main
, this makes sure that the next (future) minor release is compatible with clusters created by the current latest release - When running this in a release branch, this makes sure that the next (future) patch release is compatible with clusters created by the previous minor release
- For
When building a new bundle release for a certain version we should, at least, run E2E tests that:
- Create a cluster using the current latest release of the cli and the current latest bundle, followed by upgrading the cluster using the same version of the cli and the soon-to-be-released bundles manifest