Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent removal of apiVersions while they are still used #8566

Open
sbueringer opened this issue Apr 25, 2023 · 20 comments
Open

Prevent removal of apiVersions while they are still used #8566

sbueringer opened this issue Apr 25, 2023 · 20 comments
Labels
area/api Issues or PRs related to the APIs help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/feature Categorizes issue or PR as related to a new feature. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@sbueringer
Copy link
Member

sbueringer commented Apr 25, 2023

What would you like to be added (User Story)?

As a user it would be nice if Cluster API would block removal of apiVersions of CRDs as long as they are still used.

Detailed Description

Today it is possible that during a Cluster API upgrade (e.g. when bumping an infra provider) an apiVersion is removed. If this apiVersion was still referenced in other resources, they won't reconcile anymore.

Affected resources:

  • ClusterClass
  • Cluster, MD, MS, Machine, ...
    • Refs are probably not affected as they don't actually use the apiVersion from the object spec (they calculate the version they use based on the contract label on the CRD).
      • Would be good to double check this again.
    • OwnerRefs
      • They might not be super relevant as our controllers always keep the ownerRefs up-to-date

A possible solution is to implement a webhook which validates CRD updates. In this webhook we could check if the update on the CRD removes an apiVersion which is still referenced in relevant resources (at least ClusterClass).

Note: This webhook should probably have failurePolicy: Ignore to avoid deadlocks if Cluster API is not working for some reason.

Anything else you would like to add?

Somewhat related: #6539 (as it might lead to us dropping apiVersion from all our refs except the ones in ClusterClass)

Label(s) to be applied

/kind feature
One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels.

@k8s-ci-robot k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Apr 25, 2023
@sbueringer
Copy link
Member Author

@sbueringer sbueringer added the area/api Issues or PRs related to the APIs label Apr 25, 2023
@killianmuldoon
Copy link
Contributor

/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Apr 25, 2023
@killianmuldoon
Copy link
Contributor

One clarification - how does this relate to

Ensure clusterctl upgrade warns users if the upgrade would break them (e.g. v0.3 => v1.x (version without v1alpha3 apiVersion))
This should already be covered by our CRD migration code, but let's verify it.
This might only become relevant with v1.6 when we actually remove the apiVersion from the CRDs

From #8038

@sbueringer
Copy link
Member Author

sbueringer commented Apr 25, 2023

Hm totally independent. The point from #8038 is only that clusterctl should correctly fail the CRD update if we jump from a version which doesn't have v1beta1 to a version which drops v1alpha3 (basically because that's not a valid CRD update in general)

@fabriziopandini
Copy link
Member

This is a nice check to have, but it comes with a couple of problems to be addressed, top of mind:

  1. Where to do the check (webhook or clusterctl)
  2. How to encode the knowledge of where references exist

For 1, it seems to me that the simplest way forward is to rely on clusterctl, because it already has all the discovery logic that is required for performing such a check (I'm assuming the discovery logic that exists in clusterctl move can be re-used for this task).

For 2, we can start thinking of a solution that focuses on references existing in core CAPI types, so we cover 90% of the issue. Then, in a follow-up iteration, we can extend this for references existing in provider types.

/help

@k8s-ci-robot
Copy link
Contributor

@fabriziopandini:
This request has been marked as needing help from a contributor.

Guidelines

Please ensure that the issue body includes answers to the following questions:

  • Why are we solving this issue?
  • To address this issue, are there any code changes? If there are code changes, what needs to be done in the code and what places can the assignee treat as reference points?
  • Does this issue have zero to low barrier of entry?
  • How can the assignee reach out to you for help?

For more details on the requirements of such an issue, please see here and ensure that they are met.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help command.

In response to this:

This is a nice check to have, but it comes with a couple of problems to be addressed, top of mind:

  1. Where to do the check (webhook or clusterctl)
  2. How to encode the knowledge of where references exist

For 1, it seems to me that the simplest way forward is to rely on clusterctl, because it already has all the discovery logic that is required for performing such a check (I'm assuming the discovery logic that exists in clusterctl move can be re-used for this task).

For 2, we can start thinking of a solution that focuses on references existing in core CAPI types, so we cover 90% of the issue. Then, in a follow-up iteration, we can extend this for references existing in provider types.

/help

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. label May 1, 2023
@sbueringer
Copy link
Member Author

sbueringer commented May 2, 2023

This is a nice check to have, but it comes with a couple of problems to be addressed, top of mind:

  1. Where to do the check (webhook or clusterctl)
  2. How to encode the knowledge of where references exist

For 1, it seems to me that the simplest way forward is to rely on clusterctl, because it already has all the discovery logic that is required for performing such a check (I'm assuming the discovery logic that exists in clusterctl move can be re-used for this task).

For 2, we can start thinking of a solution that focuses on references existing in core CAPI types, so we cover 90% of the issue. Then, in a follow-up iteration, we can extend this for references existing in provider types.

  1. I would do it in the webhook so everyone profits not only clusterctl users. (we don't need the discovery logic if we only implement it for core resources)
  2. I think we don't have to encode that knowledge if we only implement it for core CAPI resources. Specifically I think only ClusterClass, in all other resources we automatically bump the apiVersion based on contract (which is also what providers should do). ClusterClass should be the only case where we intentionally use specific apiVersions (because we have to).

@ykakarap
Copy link
Contributor

ykakarap commented May 11, 2023

So a validating webhook for kind: CustomResourceDefinition that checks that when a CRD whose .spec.group is one of controlplane.cluster.x-k8s.io or bootstrap.cluster.x-k8s.io or infrastructure.cluster.x-k8s.io is updated does not drop a version that is used by at least one ClusterClass. Is my understanding correct?

@sbueringer
Copy link
Member Author

Pretty much. I think we don't have to filter early by group. We can just take the group+ kind etc and check if it occurs in a ClusterClass

This way we can also support CRDs in other groups. Afaik there is no contract that a provider has to use one of those groups

@ykakarap
Copy link
Contributor

You are right! Being in those groups is not part of the contract (https://cluster-api.sigs.k8s.io/developer/providers/cluster-infrastructure.html#cluster-api-controllers). We will just have to run it for all CRDs.

@sbueringer
Copy link
Member Author

I think that's okay. It's just important that we use failurePolicy: Ignore. This ensures that the cluster is not broken if CAPI is not up for some reason.

This will make the whole thing a bit best-effort but I think that's acceptable.

@Ankitasw
Copy link
Member

Ankitasw commented Jun 8, 2023

Is this issue up for grab? I would like to work on it if not started.

@killianmuldoon
Copy link
Contributor

Is this issue up for grab? I would like to work on it if not started.

I think you can feel free to assign yourself - thanks for taking a look at this!

@Ankitasw
Copy link
Member

Ankitasw commented Jun 8, 2023

/assign

@Ankitasw
Copy link
Member

/lifecycle active

@k8s-ci-robot k8s-ci-robot added the lifecycle/active Indicates that an issue or PR is actively being worked on by a contributor. label Jun 11, 2023
@k8s-ci-robot k8s-ci-robot removed the lifecycle/active Indicates that an issue or PR is actively being worked on by a contributor. label Jul 6, 2023
@killianmuldoon
Copy link
Contributor

@Ankitasw have you been able to make progress on this?

@Ankitasw
Copy link
Member

@killianmuldoon couldn't start with this, I might not have bandwidth as well, I can unassign it for now and come back if it's still unpicked.

/unassign

@killianmuldoon
Copy link
Contributor

/assign

Hopefully I'll get around to this - but if anyone else wants to please feel free to pick it up. Mostly assigning myself to keep track of this one as part of #8038

@killianmuldoon
Copy link
Contributor

/unassign
/help

We should really try to get some solution on this as we move towards removing v1alpha4.

@Ankitasw in case you have more bandwidth now.

@fabriziopandini
Copy link
Member

/priority important-longterm

@k8s-ci-robot k8s-ci-robot added the priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. label Apr 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/api Issues or PRs related to the APIs help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/feature Categorizes issue or PR as related to a new feature. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

6 participants