-
Notifications
You must be signed in to change notification settings - Fork 5.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Postpone Deletion of a Persistent Volume Claim in case It Is Used by a Pod #1174
Postpone Deletion of a Persistent Volume Claim in case It Is Used by a Pod #1174
Conversation
- In case the PVC is used by a pod: sets condition Terminating to true for the PVC. | ||
|
||
### Scheduler | ||
Scheduler will check if a pod uses a PVC and if any of the PVCs has condition Terminating set to true. In case this is true the scheduling of the pod will end up with an error: "PVC (%pvcName) is in Terminating state". |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you just check if the PVC has the deletion timestamp set?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
anything that reacts to absence of a PV or PVC (scheduler, kubelet) should also react to the presence of a deletionTimestamp on a PV or PVC
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me firstly think loudly about the deletionTimestamp
. The description of the deletionTimestamp
states: "The resource will be deleted (no longer visible from resource lists, and not reachable by name) after the time in this field." So in case the deletionTimestamp
is set for a PVC the PVC must be deleted after the time in the deletionTimestamp
. However, in case the PVC is used by a pod it's not guaranteed that the pod won't be using the PVC after the time in the deletionTimestamp
. That's why I prefer not to set the deletionTimestamp
in PVCs, but use the condition Terminating for storing the information that the PVC is going to be deleted.
On the other hand, scheduler must be able to decide whether PVC is going to be deleted or not in order to fail scheduling of a pod that uses the PVC that is going to be deleted.
So it will be better if the condition that PVC is going to be deleted will be: deletionTimestampIsSet(PVC) || terminatingConditionIsTrue(PVC)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An object with a deletionTimestamp and non-empty finalizers blocks until finalizers are removed. The description of deletionTimestamp should be updated. The scheduler and kubelet should not schedule or mount based on PVs or PVCs with deletionTimestamps set
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume that after kubectl delete pvc pvc-name
the PVC may or may not have the deletionTimestamp
set. So it's not possible to rely solely on deletionTimestamp
for finding out whether a PVC is being deleted or not. Am I right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume that after kubectl delete pvc pvc-name the PVC may or may not have the deletionTimestamp set
Not sure what you mean. An effective DELETE API call either sets the deletionTimestamp or completely removes the object from etcd.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So in case of a PVC with a finalizer set the DELETE API call will always set the deletionTimestamp
as the deletion has to be postponed because of the finalizer.
Thus, there can't be a PVC that is being deleted with deletionTimestamp == nil
.
That said, there is no need for using condition Terminating for marking a PVC that is being deleted.
When a PVC is created finalizer is added into its metadata. | ||
|
||
### Finalizer | ||
Finalizer can list pods and watch for pods. It caches status of pods that use a PVC. It also watches for PVC deletes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a new controller to handle PVC finalizer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's a new internal controller.
Finalizer can list pods and watch for pods. It caches status of pods that use a PVC. It also watches for PVC deletes. | ||
|
||
Finalizer reacts to PVC delete in one of the below ways: | ||
- In case the PVC is not used by a pod: removes the finalizer information from the PVC metadata. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you have a race condition between adding and deleting pods?
- Pod A is currently using PVC.
- Pod A is deleted.
- Soon after, Pod B is created and scheduled that uses PVC.
- PVC is deleted.
- Informer gets Pod delete for Pod A, sees no other Pod is using the PVC in its cache, and removes finalizer.
- Informer gets Pod add for Pod B, but finalizer is already deleted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Presence of a PV or PVC deletion timestamp should fail setting up a PVC volume for a pod. That would prevent new pods from starting once the PV or PVC had started deletion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if Pod B got setup with the volume before the informer receives the updates?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @msau42 that the informer may get PVC delete before Pod B setup so in case the finalizer is removed immediately after PVC delete there is a chance for a race condition.
Currently, I can't think of other solution for the race condition than to postpone the finalizer removal from PVC for at least deletionPostponePeriod
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's not really safe either. The controller could be down.
Isn't this best effort? How strong of a guarantee is required? The description above implies a strong guarantee isn't required, and is vague about what the consequences are.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's not really safe either.
I agree.
The controller could be down.
In case the controller is done the finalizer setting is not being removed from PVCs so PVCs are not being deleted and PVCs marked for deletion are piling up.
In addition, the controller is not setting finalizer for newly created PVCs.
Isn't this best effort? How strong of a guarantee is required? The description above implies a strong guarantee isn't required, and is vague about what the consequences are.
The worst possible consequences are data lost in some of the possible scenarios that always depend on the storage type being used:
- in case the dynamic provisioning is used and reclaim policy is
Delete
the PVC deletion triggers deletion of the associated storage asset and PV. - the same as above applies for the static provisioning and
Delete
reclaim policy.
So, IMHO, it makes sense to try to find a solution for the above race condition.
…finalizer-info Automatic merge from submit-queue. Update deletionTimestamp with information about finalizer effect. Updating description of `deletionTimestamp` as adviced [here](#1174 (comment)).
I addressed some of the comments in the update. |
|
||
In case a pod is being scheduled and is using PVCs that do not have condition PVCUsedByPod set it will set this condition for these PVCs. | ||
|
||
In case a pod is terminated and was using PVCs the scheduler will update PVCUsedByPod condition for these PVCs accordingly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The scheduler does not handle pod termination today.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So it will mean extending scheduler to handle pod termination. I assume it means that it is a no-go alternative.
|
||
Note: scheduler is the source of truth of pods that are being started. The information on active pods may be a little bit outdated that causes that deletion of a PVC may be postponed (pod status in schedular is active while the pod is terminated in API server), but this does not cause any harm. | ||
|
||
The disadvantage is that scheduler will become responsible for PVC deletion postponing that will make scheduler bigger. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I don't think it's in the scheduler's scope to handle PVC deletion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. So this is a no-go alternative.
2. Pod A is deleted. | ||
3. Soon after, Pod B is created and scheduled that uses PVC. | ||
4. PVC is deleted. | ||
5. PVC deletion controller gets Pod delete for Pod A, sees no other Pod is using the PVC in its cache, and removes finalizer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could the race be solved if instead of removing the finalizer immediately, you resync all the Pods first?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having PVC deletion controller resyncs all the pods before removing the finalizer is not enough. There must be a PVC resync done in scheduler or in kubelet. IMHO, these two resyncs will solve the race. I'll update the proposal accordingly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There must be a PVC resync done in scheduler or in kubelet. IMHO, these two resyncs will solve the race. I'll update the proposal accordingly.
Not sure what you mean by resync in the kubelet... the kubelet does a live lookup of the PVC when setting up the pod's volume. If the PVC has a deletionTimestamp, the kubelet should not continue setting up the volume.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what you mean by resync in the kubelet... the kubelet does a live lookup of the PVC when setting up the pod's volume. If the PVC has a deletionTimestamp, the kubelet should not continue setting up the volume.
I didn't know that kubelet does a live lookup of the PVC(s) used by the pod so I wanted to add it.
If the PVC has a deletionTimestamp kubelet must not start the pod.
|
||
## Implementation | ||
|
||
### API Sever, PVC Admission Controller, PVC Create |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo : Sever -> Server
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for spotting it.
Should we consider using the same process for postponing deletion of PVs that are actively used by pods? |
Should we consider having the |
### Kubelet | ||
Kubelet does currently live lookup of PVC(s) that are used by a pod. | ||
|
||
In case any of the PVC(s) used by the pod has the `deletionTimestamp` set kubelet won't start the pod but will report and error: "can't start pod (%pod) because it's using PVC (%pvcName) that is being deleted" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can log a specific message, but should follow the same code path as if the PVC was not found (fails in volume mount creation stage, retries N times, etc)
|
||
In case any of the PVC(s) used by the pod has the `deletionTimestamp` set kubelet won't start the pod but will report and error: "can't start pod (%pod) because it's using PVC (%pvcName) that is being deleted" | ||
|
||
Note: as soon as all the PVC(s) with `deletionTimestamp` set are deleted (in case none of the PVC(s) don't have finalizer information set) the pod that now uses non-existent PVC(s) will be started and will be in pending state (waiting for the non-existent PVC(s) to be created). Otherwise, in case at least one PVC has finalizer information set, i.e. PVC deletion controller is waiting for the scheduled pod to be terminated, there will be a deadlock: kubelet waiting for PVC deletion while PVC deletion controller waiting for pod termination. The deadlock can be resolved only by deleting the scheduled pod. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kubelets do not retry volume mounts forever. When it fails, the pod should no longer block deletion of the PVC
### PVC Deletion Controller | ||
PVC deletion controller is a new internal controller. | ||
|
||
PVC deletion controller can list pods and watch for pods. It caches status of pods that use a PVC. It also watches for PVC deletes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typically, the pattern used here is informers on both resources feeding a single workqueue. In this case, the controller should watch both pods and PVCs:
For PVC add/update/delete events:
- if deletionTimestamp is nil and finalizer is missing, add to queue
- if deletionTimestamp is non-nil and finalizer is present and first, add to queue
For pod add events:
- if pod is terminated, add all referenced PVCs to queue
For pod update events: - if pod is changing from non-terminated to terminated, add all referenced PVCs to queue
For pod delete events: - add all referenced PVCs to queue
When processing the PVC queue:
- If PVC is not found in cache, skip
- If PVC is in cache with nil deletionTimestamp and missing finalizer, attempt to add finalizer, requeue on error
- If PVC is in cache with non-nil deletionTimestamp and present finalizer at first position, do live pod list in namespace, and if all pods referencing the PVC are not yet bound to a node or are terminated, attempt to remove the finalizer, requeue on error
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That removes the need to track PVCs in use, removes the need for special logic at startup, unifies retry cases in case of conflicts on PVC updates (or other errors)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See https://github.com/kubernetes/community/blob/master/contributors/devel/controllers.md for more details about the queuing approach, or https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/serviceaccount/serviceaccounts_controller.go for an example of a controller feeding a single workqueue from informers on two separate resources
The PVC is the object we want to prevent getting deleted prematurely, so that is where the finalizer should go, not on the pod. |
A new plugin for PVC admission controller will be created. The plugin will automatically add finalizer information into newly created PVC's metadata. | ||
|
||
### Scheduler | ||
Scheduler will check if a pod uses a PVC and if any of the PVCs has `deletionTimestamp` set. In case this is true the scheduling of the pod will end up with an error: "PVC (%pvcName) is in scheduled for deletion state". |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should just involve an additional check for a non-nil deletionTimestamp on the PVC, and following the same code path as if the PVC was not found
@liggitt - Just clarifying... My comment about adding a finalizer on a pod would be in addition to a finalizer on the PVC. |
The additional finalizer doesn't get you anything. We don't actually care about preventing deletion of pods using PVCs, so there's no reason to block it with an additional finalizer on the Pod object. |
A pod PVC delete finalizer could be used to handle a pod with a PVC/PV being deleted, instead of watching all pods for deletion. This finalizer could reduce the number of events that need to be handled by the controller, to only those pod events that could be related to PVC deletion. |
A finalizing controller for pods referencing PVCs would still have to watch all pods (there's no filter to say "just watch pods containing finalizer X")
As described above, that can be determined by inspection of the pod spec on deletion/termination, without blocking the pod delete, requiring an additional update to remove the finalizer from the pod, and increasing latency of pod cleanup. The finalizer on the PVC is what we want. |
I prefer to approach the PVC and PV deletion problems per partes However, I expect that the solution for PV deletion problem will capitalize on experience with solution of the PVC deletion bug and will be probably similar to PVC deletion bug solution.
In case a finalizer is in an object it means the the object deletion is postponed until the finalizer is removed from the object by a controller. I agree with @liggitt that postponing pod deletion (adding a finalizer to a pod that have PV/PVC) doesn't bring any benefit as the pod can be deleted immediately and the |
@liggitt thank you very much for your comments and proposals. I basically updated the design according to your proposals. |
PVC deletion controller watches for both PVC and pod events that are processed as described below: | ||
1. PVC add/update/delete events: | ||
- If `deletionTimestamp` is `nil` and finalizer is missing, the PVC is added to PVC queue. | ||
- If `deletionTimestamp` is `non-nil` and finalizer is present and first, the PVC is added to PVC queue. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the finalizer position matters?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you're right, I was thinking finalizers were handled in order, but that is not the case. @pospispa you can strike " and first"
In case any of the PVC(s) used by the pod has the `deletionTimestamp` set kubelet won't start the pod but will report and error: "can't start pod (%pod) because it's using PVC (%pvcName) that is being deleted". Kubelet will follow the same code path as if PVC(s) do not exist. | ||
|
||
### PVC Deletion Controller | ||
PVC deletion controller is a new internal controller. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/deletion/finalizing/ might make the intent clearer
4. Pod delete events: | ||
- All referenced PVCs are added to PVC queue. | ||
|
||
PVC and pod information are kept in a cache. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is done inherently when using an informer (the cache should be the one feeding the informer)
The PVC queue holds PVCs that need to be processed according to the below rules: | ||
- If PVC is not found in cache, the PVC is skipped. | ||
- If PVC is in cache with `nil` `deletionTimestamp` and missing finalizer, finalizer is added to the PVC. In case the adding finalizer operation fails, the PVC is re-queued into the PVC queue. | ||
- If PVC is in cache with `non-nil` `deletionTimestamp` and present finalizer at first position, live pod list is done for the PVC namespace. If all pods referencing the PVC are not yet bound to a node or are terminated, the finalizer removal is attempted. In case the finalizer removal operation fails the PVC is re-queued. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
strike " at first position", finalizers are processed in any order
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm with a nit
PVC finalizing controller won't watch pods because the information whether a PVC is used by a pod or not is now maintained by the scheduler. | ||
PVC finalizing controller will remove the finalizer infromation from a PVC in case the PVC has `deletionTimestamp` set and does not have the PVCUsedByPod condition set to true. | ||
|
||
In this case the known race condition discussed in the implementation section won't happen on two different objects (PVC and pod), but on the same object, i.e. PVC. Therefore, the probability of the race condition will be smaller. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This race is no longer discussed :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed the information about the race.
Should I already squash all the commits? |
yes thanks |
…a Pod Proposal for postponing deletion of Persistent Volume Claim in case it's used by a pod. It will fix issue kubernetes/kubernetes#45143
76d182c
to
b378919
Compare
Done. |
/lgtm |
/test all [submit-queue is verifying that this PR is safe to merge] |
Automatic merge from submit-queue. |
Proposal for postponing deletion of Persistent Volume Claim in case it's used by a pod.
Implementation will fix issue kubernetes/kubernetes#45143
/sig storage
/cc @kubernetes/sig-storage-proposals