Prevent removal of apiVersions while they are still used #8566

sbueringer · 2023-04-25T10:26:56Z

What would you like to be added (User Story)?

As a user it would be nice if Cluster API would block removal of apiVersions of CRDs as long as they are still used.

Detailed Description

Today it is possible that during a Cluster API upgrade (e.g. when bumping an infra provider) an apiVersion is removed. If this apiVersion was still referenced in other resources, they won't reconcile anymore.

Affected resources:

ClusterClass
Cluster, MD, MS, Machine, ...
- Refs are probably not affected as they don't actually use the apiVersion from the object spec (they calculate the version they use based on the contract label on the CRD).
  - Would be good to double check this again.
- OwnerRefs
  - They might not be super relevant as our controllers always keep the ownerRefs up-to-date

A possible solution is to implement a webhook which validates CRD updates. In this webhook we could check if the update on the CRD removes an apiVersion which is still referenced in relevant resources (at least ClusterClass).

Note: This webhook should probably have failurePolicy: Ignore to avoid deadlocks if Cluster API is not working for some reason.

Anything else you would like to add?

Somewhat related: #6539 (as it might lead to us dropping apiVersion from all our refs except the ones in ClusterClass)

Label(s) to be applied

/kind feature
One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels.

sbueringer · 2023-04-25T10:27:13Z

cc @killianmuldoon @srm09 @fabriziopandini

killianmuldoon · 2023-04-25T10:33:31Z

/triage accepted

killianmuldoon · 2023-04-25T10:34:30Z

One clarification - how does this relate to

Ensure clusterctl upgrade warns users if the upgrade would break them (e.g. v0.3 => v1.x (version without v1alpha3 apiVersion))
This should already be covered by our CRD migration code, but let's verify it.
This might only become relevant with v1.6 when we actually remove the apiVersion from the CRDs

From #8038

sbueringer · 2023-04-25T12:00:39Z

Hm totally independent. The point from #8038 is only that clusterctl should correctly fail the CRD update if we jump from a version which doesn't have v1beta1 to a version which drops v1alpha3 (basically because that's not a valid CRD update in general)

fabriziopandini · 2023-05-01T14:05:25Z

This is a nice check to have, but it comes with a couple of problems to be addressed, top of mind:

Where to do the check (webhook or clusterctl)
How to encode the knowledge of where references exist

For 1, it seems to me that the simplest way forward is to rely on clusterctl, because it already has all the discovery logic that is required for performing such a check (I'm assuming the discovery logic that exists in clusterctl move can be re-used for this task).

For 2, we can start thinking of a solution that focuses on references existing in core CAPI types, so we cover 90% of the issue. Then, in a follow-up iteration, we can extend this for references existing in provider types.

/help

k8s-ci-robot · 2023-05-01T14:05:26Z

@fabriziopandini:
This request has been marked as needing help from a contributor.

Guidelines

Please ensure that the issue body includes answers to the following questions:

Why are we solving this issue?
To address this issue, are there any code changes? If there are code changes, what needs to be done in the code and what places can the assignee treat as reference points?
Does this issue have zero to low barrier of entry?
How can the assignee reach out to you for help?

For more details on the requirements of such an issue, please see here and ensure that they are met.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help command.

In response to this:

This is a nice check to have, but it comes with a couple of problems to be addressed, top of mind:

Where to do the check (webhook or clusterctl)

How to encode the knowledge of where references exist

For 1, it seems to me that the simplest way forward is to rely on clusterctl, because it already has all the discovery logic that is required for performing such a check (I'm assuming the discovery logic that exists in clusterctl move can be re-used for this task).

For 2, we can start thinking of a solution that focuses on references existing in core CAPI types, so we cover 90% of the issue. Then, in a follow-up iteration, we can extend this for references existing in provider types.

/help

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

sbueringer · 2023-05-02T12:17:13Z

This is a nice check to have, but it comes with a couple of problems to be addressed, top of mind:

Where to do the check (webhook or clusterctl)

How to encode the knowledge of where references exist

For 1, it seems to me that the simplest way forward is to rely on clusterctl, because it already has all the discovery logic that is required for performing such a check (I'm assuming the discovery logic that exists in clusterctl move can be re-used for this task).

For 2, we can start thinking of a solution that focuses on references existing in core CAPI types, so we cover 90% of the issue. Then, in a follow-up iteration, we can extend this for references existing in provider types.

I would do it in the webhook so everyone profits not only clusterctl users. (we don't need the discovery logic if we only implement it for core resources)
I think we don't have to encode that knowledge if we only implement it for core CAPI resources. Specifically I think only ClusterClass, in all other resources we automatically bump the apiVersion based on contract (which is also what providers should do). ClusterClass should be the only case where we intentionally use specific apiVersions (because we have to).

ykakarap · 2023-05-11T02:06:55Z

So a validating webhook for kind: CustomResourceDefinition that checks that when a CRD whose .spec.group is one of controlplane.cluster.x-k8s.io or bootstrap.cluster.x-k8s.io or infrastructure.cluster.x-k8s.io is updated does not drop a version that is used by at least one ClusterClass. Is my understanding correct?

sbueringer · 2023-05-11T06:54:27Z

Pretty much. I think we don't have to filter early by group. We can just take the group+ kind etc and check if it occurs in a ClusterClass

This way we can also support CRDs in other groups. Afaik there is no contract that a provider has to use one of those groups

ykakarap · 2023-05-11T16:30:26Z

You are right! Being in those groups is not part of the contract (https://cluster-api.sigs.k8s.io/developer/providers/cluster-infrastructure.html#cluster-api-controllers). We will just have to run it for all CRDs.

sbueringer · 2023-05-11T16:58:05Z

I think that's okay. It's just important that we use failurePolicy: Ignore. This ensures that the cluster is not broken if CAPI is not up for some reason.

This will make the whole thing a bit best-effort but I think that's acceptable.

Ankitasw · 2023-06-08T05:52:57Z

Is this issue up for grab? I would like to work on it if not started.

killianmuldoon · 2023-06-08T10:07:06Z

Is this issue up for grab? I would like to work on it if not started.

I think you can feel free to assign yourself - thanks for taking a look at this!

Ankitasw · 2023-06-08T10:13:40Z

/assign

Ankitasw · 2023-06-11T15:34:08Z

/lifecycle active

killianmuldoon · 2023-07-18T17:18:20Z

@Ankitasw have you been able to make progress on this?

Ankitasw · 2023-07-18T17:24:24Z

@killianmuldoon couldn't start with this, I might not have bandwidth as well, I can unassign it for now and come back if it's still unpicked.

/unassign

killianmuldoon · 2023-07-18T19:15:57Z

/assign

Hopefully I'll get around to this - but if anyone else wants to please feel free to pick it up. Mostly assigning myself to keep track of this one as part of #8038

killianmuldoon · 2023-11-03T11:03:38Z

/unassign
/help

We should really try to get some solution on this as we move towards removing v1alpha4.

@Ankitasw in case you have more bandwidth now.

fabriziopandini · 2024-04-11T19:12:52Z

/priority important-longterm

k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Apr 25, 2023

sbueringer added the area/api Issues or PRs related to the APIs label Apr 25, 2023

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Apr 25, 2023

k8s-ci-robot added the help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. label May 1, 2023

k8s-ci-robot assigned Ankitasw Jun 8, 2023

k8s-ci-robot added the lifecycle/active Indicates that an issue or PR is actively being worked on by a contributor. label Jun 11, 2023

k8s-ci-robot removed the lifecycle/active Indicates that an issue or PR is actively being worked on by a contributor. label Jul 6, 2023

k8s-ci-robot unassigned Ankitasw Jul 18, 2023

k8s-ci-robot assigned killianmuldoon Jul 18, 2023

k8s-ci-robot unassigned killianmuldoon Nov 3, 2023

k8s-ci-robot added the priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. label Apr 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prevent removal of apiVersions while they are still used #8566

Prevent removal of apiVersions while they are still used #8566

sbueringer commented Apr 25, 2023 •

edited

Loading

sbueringer commented Apr 25, 2023

killianmuldoon commented Apr 25, 2023

killianmuldoon commented Apr 25, 2023

sbueringer commented Apr 25, 2023 •

edited

Loading

fabriziopandini commented May 1, 2023

k8s-ci-robot commented May 1, 2023

sbueringer commented May 2, 2023 •

edited

Loading

ykakarap commented May 11, 2023 •

edited

Loading

sbueringer commented May 11, 2023

ykakarap commented May 11, 2023

sbueringer commented May 11, 2023

Ankitasw commented Jun 8, 2023

killianmuldoon commented Jun 8, 2023

Ankitasw commented Jun 8, 2023

Ankitasw commented Jun 11, 2023

killianmuldoon commented Jul 18, 2023

Ankitasw commented Jul 18, 2023

killianmuldoon commented Jul 18, 2023

killianmuldoon commented Nov 3, 2023

fabriziopandini commented Apr 11, 2024

Prevent removal of apiVersions while they are still used #8566

Prevent removal of apiVersions while they are still used #8566

Comments

sbueringer commented Apr 25, 2023 • edited Loading

What would you like to be added (User Story)?

Detailed Description

Anything else you would like to add?

Label(s) to be applied

sbueringer commented Apr 25, 2023

killianmuldoon commented Apr 25, 2023

killianmuldoon commented Apr 25, 2023

sbueringer commented Apr 25, 2023 • edited Loading

fabriziopandini commented May 1, 2023

k8s-ci-robot commented May 1, 2023

Guidelines

sbueringer commented May 2, 2023 • edited Loading

ykakarap commented May 11, 2023 • edited Loading

sbueringer commented May 11, 2023

ykakarap commented May 11, 2023

sbueringer commented May 11, 2023

Ankitasw commented Jun 8, 2023

killianmuldoon commented Jun 8, 2023

Ankitasw commented Jun 8, 2023

Ankitasw commented Jun 11, 2023

killianmuldoon commented Jul 18, 2023

Ankitasw commented Jul 18, 2023

killianmuldoon commented Jul 18, 2023

killianmuldoon commented Nov 3, 2023

fabriziopandini commented Apr 11, 2024

sbueringer commented Apr 25, 2023 •

edited

Loading

sbueringer commented Apr 25, 2023 •

edited

Loading

sbueringer commented May 2, 2023 •

edited

Loading

ykakarap commented May 11, 2023 •

edited

Loading