Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP-2839: Add KEP for in-use protection #2840

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

mkimuram
Copy link
Contributor

  • One-line PR description: A generic feature to protect objects from deletion while it is in use
  • Other comments:

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jul 27, 2021
@k8s-ci-robot k8s-ci-robot added sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jul 27, 2021
@bgrant0607
Copy link
Member

Cross-linking an old issue: kubernetes/kubernetes#10179

@mkimuram
Copy link
Contributor Author

mkimuram commented Aug 4, 2021

@lavalamp @bgrant0607

Sharing just a rough idea.

Just for preventing deletion, the below design will be enough (We still need to consider how to protect deletion of LienClaim itself). Also, update may be implemented in the same way, like adding "spec-update" to LienClaim.spec.prevents and appending uuid to *.metadata.spec-update-liens. However, it would be difficult to define preventing update for specific fields as we can specify through "field manager", by this way.

  1. User creates LienClaim in their namespace with specifying targets and prevents.
apiVersion: v1beta1
kind: LienClaim
metadata:
  namespace: ns1
  name: my-lien
spec:
  targets:
  - apiVersion: v1
    kind: secret
    namespace: ns2
    name: important-secret
  prevents:
  - delete
  1. Lien controller watches LienClaim and create cluster-scoped Lien and append its name to *-liens, or delete-liens in the below example, to the target if a new LienClaim is created. Note that a name for Lien is generated as UUID.
apiVersion: v1beta1
kind: Lien
metadata:
  name: 12345678-1234-1234-1234-1234567890ab
spec:
  targets:
  - apiVersion: v1
    kind: secret
    namespace: ns2
    name: important-secret
  source:
    namespace: ns1
    name: my-lien
  prevents:
  - delete
apiVersion: v1
kind: Secret
metadata:
  namespace: ns2
  name: important-secret
  delete-liens:
  - 12345678-1234-1234-1234-1234567890ab
type: Opaque
data:
  1. Lien controller may also update LienClaim status if 2 succeeds
apiVersion: v1beta1
kind: LienClaim
metadata:
  namespace: ns1
  name: my-lien
spec:
  targets:
  - apiVersion: v1
    kind: secret
    namespace: ns2
    name: important-secret
  prevents:
  - delete
status:
  lien: 12345678-1234-1234-1234-1234567890ab
  phase: preventing
  1. On deletion of resources, lien admission controller checks if metadata.delete-liens is empty. If not, return error to block deletion.

  2. Lien controller watches LienClaim and delete the lien from *-liens on the target and delete Lien if the LienClaim is deleted.

@mkimuram
Copy link
Contributor Author

mkimuram commented Aug 4, 2021

From the viewpoint of consuming lien from secret protection, which is my biggest concern. How to create LienClaim before creating referencing objects and delete LienClaim after deleting referencing objects is need to be considered.
(I'm not sure if mutation hook or ownerReference can achieve it. Although, we can at least leave it for users.)

@lavalamp
Copy link
Member

lavalamp commented Aug 4, 2021

Why do you want to have lien objects? Why are text strings not sufficient (as used for finalizers). Adding additional object types adds more places to have races, I'm against it unless you can convince me it's absolutely necessary.

@mkimuram
Copy link
Contributor Author

mkimuram commented Aug 4, 2021

Why do you want to have lien objects?

It's to make it easier to manage who has control over the specific lien in the delete-liens field.

If multiple liens are set to a single resource, we will need to ensure that the specific lien is managed by the client/controller when deleting the lien. We may add the LienClaim's identity information to delete-liens but it will tend to be long and difficult to much.

Instead, we might just add LienClaim's uid. However, just adding uid, it is difficult for the owner of the resource to find from the uid. Also, I'm not sure if object uid can be regarded as unique. Creating an object with the name should ensure it.

Adding additional object types adds more places to have races, I'm against it unless you can convince me it's absolutely necessary.

Agree. If above concern can be solved by the other way, I agree to remove cluster-scoped Lien object.

@lavalamp
Copy link
Member

lavalamp commented Aug 4, 2021

You want to ACL liens? I don't see a need to enforce this any more than we do for finalizers. And I would not solve that problem by adding Lien or LienClaim objects. I don't think a solution here should require any additional object types.

@mkimuram
Copy link
Contributor Author

mkimuram commented Aug 4, 2021

If only talking about the use case for secret protection, the feature needed is like block deleting "Secret A" that can be used by "Pod B" and "PersistentVolume C" while the secret is used. So, something like below will be enough.

apiVersion: v1
kind: Secret
metadata:
  namespace: ns2
  name: A
  delete-liens: 
    - "ID to show that Pod B is using"
    - "ID to show that PersistentVolume C is using"
type: Opaque
data:

And to generalize it, I was thinking about a way to manage such ID for each reason to avoid conflict in deleting liens by lien system's side. However, we can leave it to consumers.

@mkimuram
Copy link
Contributor Author

mkimuram commented Aug 4, 2021

And also thinking about using a lien per controller, not per reason, like below.

apiVersion: v1
kind: Secret
metadata:
  namespace: ns2
  name: A
  delete-liens: 
    - "k8s.io/secret-protection-controller"
type: Opaque
data:

Then, I start to think about which will be easier "implementing own admission hook to block deletion" or "implementing controller to add lien"?

(Obviously, for users who would like to add such protection manually, lien systems is useful as this shows, but for controllers, it might not.)

@lavalamp
Copy link
Member

lavalamp commented Aug 4, 2021

There is no need for a lien controller. We would add code to kube-apiserver (reject deletion if there are liens, don't permit new liens if deletion timestamp is set).

We might want slightly more structure than raw text.

You can prototype today with e.g. specially formatted annotations and a webhook admission controller.

@lavalamp
Copy link
Member

lavalamp commented Aug 4, 2021

Sorry, I misread :) A lien per controller or per reason doesn't matter from the perspective of the lien mechanism, it's up to controller authors how they want to use the mechanism.

@mkimuram
Copy link
Contributor Author

@lavalamp

You can prototype today with e.g. specially formatted annotations and a webhook admission controller.

I've implemented a prototype of lien as this.
It is separated to 4 commits, but only 2 commits, the first commit and the last commit, will be the points of interest.

I will update the KEP based on this prototype.

Note that I've tested below way and work as shown below:

  • Deploy:
ENABLE_ADMISSION_PLUGINS=Lien hack/local-up-cluster.sh 
  • Test:
  1. Without liens, deletion is not blocked
kubectl create secret generic test-secret --from-literal='username=my-app' --from-literal='password=39528$vdg7Jb'
kubectl get secret test-secret -o jsonpath='{.metadata.liens}{"\n"}'

kubectl delete secret test-secret
secret "test-secret" deleted
  1. With liens, deletion is blocked, and once all liens are removed, deletion is not blocked
kubectl create secret generic test-secret --from-literal='username=my-app' --from-literal='password=39528$vdg7Jb'
kubectl patch secret test-secret -p '{"metadata":{"liens":["never delete me"]}}' --type=merge
secret/test-secret patched

kubectl get secret test-secret -o jsonpath='{.metadata.liens}{"\n"}'
[never delete me]

kubectl delete secret test-secret
Error from server (Forbidden): secrets "test-secret" is forbidden: deletion not allowed by liens

kubectl patch secret test-secret -p '{"metadata":{"liens":[]}}' --type=merge
secret/test-secret patched

kubectl delete secret test-secret
secret "test-secret" deleted

@mkimuram mkimuram force-pushed the issue/2839 branch 2 times, most recently from 4ec78cb to 659be2f Compare August 16, 2021 18:38
@mkimuram
Copy link
Contributor Author

@lavalamp

I've updated the KEP. PTAL

Also, I will upate the prototype of secret-protection to rely on this in-use protection mechanism to check the feasibility and also update the KEP for secret-protection.


To minimize issues caused by a cross namespace dependency, it should be considered to implement
a mechanism to opt-in/opt-out adding liens across namespace in such controllers.
For example, a controller may choose to add `Liens` only to resources that has `example.com/sample-controller/allow-cross-namespace-liens: true` annotation, if the dependency resource isn't in the same namespace.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would recommend removing this whole section, honestly. I think it's a separate discussion for each controller and I don't think this KEP should recommend a general solution; I'm not convinced there's a problem at all.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I would like to confirm that this situation is allowed by design from community. If it is allowed, I agree to delete such a mechanism to avoid it completely.

@sftim
Copy link
Contributor

sftim commented Jun 3, 2023

Here's how a docs framing can help work out a viable way forward: write (or sketch) two things.

  1. The documentation page for liens, with security advice and including any relevant cautionary messages for cluster operators and controller authors.
  2. The blog announcement for when these reach on-by-default (assume there wasn't any previous detailed blog article), again including any message of caution.

Sometimes - rarely - I've tried documenting a thing, the required words of caution have seemed faintly ridiculous, and the implementers have felt the same on seeing a draft. There's a redesign or something to avoid the worst of the snags.
I'm proposing these docs / blog article sketches as a way to get the same feedback earlier.

@thockin
Copy link
Member

thockin commented Jun 5, 2023

@mkimuram is this something you want to push for in the current cycle?


Most of the design ideas come from [here](https://github.com/kubernetes/kubernetes/issues/10179).
Users or controllers can add a string to `Liens` to ask to protect a resource.
A newly introduced validation in api-server will reject deletion requests for resources with non-empty `Liens` and return ["409 Conflict" error code](https://datatracker.ietf.org/doc/html/rfc2616#section-10.4.10).
Copy link

@negz negz Jun 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This puts the onus on clients to keep retrying if they want something to be deleted, right?

I prefer the declarative, eventually consistent nature of the current PVC protection implementation where a client can declare that it wants a thing to be deleted, and once it's safe to delete the thing, it is. We've built similar machinery to this in Crossplane and I believe it works pretty well.

For our use case "eventually consistent deletes" has nice symmetry with other operations. Most Crossplane resources correspond to some external system like a cloud provider. Let's say I want to create a GCP Subnet with Crossplane, but doing so requires the VPC it lives in to exist first. We don't reject the creation of the Subnet object in the API server because the VPC doesn't exist. Instead we accept the desired state (that a Subnet should exist) and our controller keeps trying to create it. Presumably it will eventually succeed when when the VPC exists.

Is there some way to address the requirement that deletes "don't start" (in the controller-reconciling-the-delete sense) without out-right rejecting the desire to delete a resource by an entity that has permission to do so?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where a client can declare that it wants a thing to be deleted, and once it's safe to delete the thing, it is

This is a very different semantic than "I want to delete this" -> "no". The only way I know to achieve what you describe is with a finalizer, but this is not the right semantic. At that point, the object in question is VISIBLY being deleted and async controllers are doing their own cleanup. This is way to signal "begin the end-of-life process", not a way to delay it it.

Is there a different mechanism you are thinking of?

Copy link

@negz negz Jun 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very different semantic than "I want to delete this" -> "no".

Right. This KEP appears to propose that all forms of "in-use protection" use "I want to delete this" -> "no" semantics. Those semantics appear to be different from how the contemporary PV and PVC implementations of in-use protection work, which are more like "I want to delete this" -> "okay, eventually". For Crossplane's use-case, I prefer the semantics of the contemporary implementations to what is proposed by this KEP.

My understanding is that PV and PVC in-use protection are more like "I want to delete this" -> "okay, eventually, when not in-use". The deleted objects transition into a Terminating state and go away once they're no longer in-use. We have the same "okay, eventually" semantics for Crossplane's ProviderConfig in-use protection, and are looking to generalize the pattern.

Is there a different mechanism you are thinking of?

I'm not sure what you're asking. Can you elaborate?

To be clear - I think I understand the issues with finalizers and agree that they're not a good solution. I also don't have a good solution in mind. 🙂 What I would like is that somehow:

  • If a client asks for an object be deleted
  • The object can't be deleted yet because doing so would break some other, dependent object
  • The delete (i.e. desired state) be accepted and reconciled when possible

I don't think my preferred semantics necessarily make sense for all forms of deletion protection. The "Motivation" section of this KEP states that it's attempting to solve for two use cases:

  1. Guaranteeing proper order of deletion (i.e. the PV and PVC use case, and Crossplane's)
  2. Avoiding accidental deletion of important resources

I feel that "I want to delete this" -> "no" semantics makes sense for the second use case, but not the first.

Copy link

@fabiand fabiand Jun 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In KubeVirt we have a use-case with same or similar requirements:

  1. We are running our processes in the original pod
  2. There is an eviction request for this pod
  3. Upon this request we want to transfer some state of the processes to a new pod
  4. Upon the completed state transfer, the original pod is now free to be terminated

Important aspects are
a. Application can observe eviction request
b. Pod stays running until the state is transfered
c. caller of eviction is aware that the pod is preparing a shut-down

finalizers do not work, because the pod will be stuck in terminating state, even if the state transfer fails, and we need ot keep the original pod
pdbs protect the pods, but have no way to tell the caller that the eviction was observed and application is startng the process (we use this today, but it's leading to trouble, i.e. with descheduler).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a. Application can observe eviction request
b. Pod stays running until the state is transfered

That sounds like grounds for a separate enhancement. Right now evictions are, AIUI, either accepted immediately or denied.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fabiand We could model this a few ways, but evictions are generally "urgent", right? It's not clear to me how that should intersect with any sort of deletion delay.

Copy link

@negz negz Jun 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@thockin bringing your related comment back into this thread: 😄

To do what you want, we need a new lifecycle which includes some indefinite pre-delete state.

That may be the case, but what we want does seem in-scope for the goals defined by this design - i.e. guaranteeing proper order of deletion.

Take this manifest:

pod-with-volume.yaml
---
apiVersion: v1
kind: Pod
metadata:
  name: task-pv-pod
spec:
  volumes:
    - name: task-pv-storage
      persistentVolumeClaim:
        claimName: task-pv-claim
  containers:
    - name: task-pv-container
      image: nginx
      ports:
        - containerPort: 80
          name: "http-server"
      volumeMounts:
        - mountPath: "/usr/share/nginx/html"
          name: task-pv-storage
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: task-pv-claim
spec:
  storageClassName: manual
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 3Gi
---
apiVersion: v1
kind: PersistentVolume
metadata:
  name: task-pv-volume
  labels:
    type: local
spec:
  storageClassName: manual
  capacity:
    storage: 10Gi
  accessModes:
    - ReadWriteOnce
  hostPath:
    path: "/mnt/data"

Today I can kubectl apply -f pod-with-volume.yaml and despite there being dependencies between the resources, the create just works and the controllers eventually arrive at my desired state - that they're created. I can kubectl delete -f pod-with-volume.yaml too. My desired state (that all this stuff be deleted) is immediately accepted, and eventually it's all gone thanks to (ab)use of finalizers.

I believe under this proposal the kubectl delete -f pod-with-volume.yaml UX would change. The Pod would delete immediately but the PersistentVolume and PersistentVolumeClaim would fail with a 409 Conflict, and I'd basically have to retry a couple of times for everything to actually get deleted.

This does technically achieve the goal of guaranteeing proper order of deletion - it just seems like a pretty poor UX for doing so.

Copy link

@fabiand fabiand Sep 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@thockin I was not ware that evictions had the notion of urgency.
Evictions respect PDBs, evictions are less hard than DELETE. Thus isn't eviction rather the less urgent approach and delete the real urgent way of getting rid of workloads?
fwiw kubectl drain --force --grace-period=0 is urgent.

Thus to me a delay to an eviction is similar to systemd inhibitors - a way to allow workloads to shut down gracefully. or even fail. And that's the difference: Fail gracefully, give admin/owner time to eeact to this. Before an admin jumps in with --force --grace-period=0

and to me this relates to kubernetes/kubernetes#66811 - making eviction declarative.

@thockin
Copy link
Member

thockin commented Jun 10, 2023

Is there a different mechanism you are thinking of?

I'm not sure what you're asking. Can you elaborate?

I meant to ask if there is some other mechanism you know of that I don't :) It's a big project nowadays.

To do what you want, we need a new lifecycle which includes some indefinite pre-delete state.

@fabiand
Copy link

fabiand commented Sep 8, 2023

I've came across https://github.com/adobe/k8s-shredder#how-it-works

K8s-shredder will periodically run eviction loops, based on configured EvictionLoopInterval, trying to clean up all the pods from the parked nodes. Once all the pods are cleaned up, cluster-autoscaler should chime in and recycle the parked node.

To me this connects with the desire to make applications aware of interruption requests.
Without further effort, PDBs will reject evictions, apps don't know about it.

If we had an i.e. taint based draining, apps could watch this life-cycling - terminate themselfes properly, and free the node eventually.

k8s-shredder would not need to do (imperative) eviction loops, instead: Taint (declare) a node to be drained, watch for node to be free.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle stale
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 21, 2024
@porridge
Copy link
Member

porridge commented Jan 22, 2024 via email

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 22, 2024
@turkenh
Copy link

turkenh commented Feb 15, 2024

We have implemented a new API named Usage in Crossplane which is following a similar approach as proposed here, e.g. block deletion of objects in use with an admission webhook by returning a 409.

This works fine in most cases, except the degraded experience due to huge delays caused by the backoff in Kubernetes Garbage collector. We discussed some possible solutions during the design but couldn't find any elegant solution. Basically, we need a way to reset garbage collectors backoff on a given resources. Is there were a way to achieve this? Just like most controllers, if making a change on a given resource (e.g. add an annotation/label) could requeue the resource immediately to the garbage collector work queue, we could just do that. Apparently this is not possible.

Is there a way to reset garbage collectors backoff period by patching the resource somehow? Otherwise, the proposal here is in subject to the same problem I believe.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle stale
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 15, 2024
@thockin
Copy link
Member

thockin commented May 15, 2024

I repeat myself: #2840 (comment)

@deads2k any softening of position?

@sftim
Copy link
Contributor

sftim commented May 16, 2024

You know what, I'd like to repeat #2840 (comment) as well. I still recommend: write viable docs for the thing, then build it if we like what we see.

@porridge
Copy link
Member

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 20, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle stale
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 18, 2024
@porridge
Copy link
Member

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 19, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle stale
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 17, 2024
@porridge
Copy link
Member

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. sig/cli Categorizes an issue or PR as relevant to SIG CLI. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
Status: Needs Triage
Status: Needs Triage
Development

Successfully merging this pull request may close these issues.