Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Postpone Deletion of a Persistent Volume Claim in case It Is Used by a Pod #1174

Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,106 @@
# Postpone Deletion of a Persistent Volume Claim in case It Is Used by a Pod

Status: Proposal

Version: GA

Implementation Owner: @pospispa

## Motivation

User can delete a Persistent Volume Claim (PVC) that is being used by a pod. This may have negative impact on the pod and it may result in data loss.

For more details see issue https://github.com/kubernetes/kubernetes/issues/45143

## Proposal

Postpone the PVC deletion until the PVC is not used by any pod.

## User Experience

### Use Cases

1. User deletes a PVC that is being used by a pod. This may have negative impact on the pod and may result in data loss. As a user, I want that any PVC deletion does not have any negative impact on any pod. As a user, I do not want to experience data loss.

#### Scenarios for data loss
Depending on the storage type the data loss occurs in one of the below scenarios:
- in case the dynamic provisioning is used and reclaim policy is `Delete` the PVC deletion triggers deletion of the associated storage asset and PV.
- the same as above applies for the static provisioning and `Delete` reclaim policy.

## Implementation

### API Server, PVC Admission Controller, PVC Create
A new plugin for PVC admission controller will be created. The plugin will automatically add finalizer information into newly created PVC's metadata.

### Scheduler
Scheduler will check if a pod uses a PVC and if any of the PVCs has `deletionTimestamp` set. In case this is true an error will be logged: "PVC (%pvcName) is in scheduled for deletion state" and scheduler will behave as if PVC was not found.

### Kubelet
Kubelet does currently live lookup of PVC(s) that are used by a pod.

In case any of the PVC(s) used by the pod has the `deletionTimestamp` set kubelet won't start the pod but will report and error: "can't start pod (%pod) because it's using PVC (%pvcName) that is being deleted". Kubelet will follow the same code path as if PVC(s) do not exist.

### PVC Finalizing Controller
PVC finalizing controller is a new internal controller.

PVC finalizing controller watches for both PVC and pod events that are processed as described below:
1. PVC add/update/delete events:
- If `deletionTimestamp` is `nil` and finalizer is missing, the PVC is added to PVC queue.
- If `deletionTimestamp` is `non-nil` and finalizer is present, the PVC is added to PVC queue.
2. Pod add events:
- If pod is terminated, all referenced PVCs are added to PVC queue.
3. Pod update events:
- If pod is changing from non-terminated to terminated state, all referenced PVCs are added to PVC queue.
4. Pod delete events:
- All referenced PVCs are added to PVC queue.

PVC and pod information are kept in a cache that is done inherently for an informer.

The PVC queue holds PVCs that need to be processed according to the below rules:
- If PVC is not found in cache, the PVC is skipped.
- If PVC is in cache with `nil` `deletionTimestamp` and missing finalizer, finalizer is added to the PVC. In case the adding finalizer operation fails, the PVC is re-queued into the PVC queue.
- If PVC is in cache with `non-nil` `deletionTimestamp` and finalizer is present, live pod list is done for the PVC namespace. If all pods referencing the PVC are not yet bound to a node or are terminated, the finalizer removal is attempted. In case the finalizer removal operation fails the PVC is re-queued.

### CLI
In case a PVC has the `deletionTimestamp` set the commands `kubectl get pvc` and `kubectl describe pvc` will display that the PVC is in terminating state.

### Client/Server Backwards/Forwards compatibility

N/A

## Alternatives considered

1. Check in admission controller whether PVC can be deleted by listing all pods and checking if the PVC is used by a pod. This was discussed and rejected in PR https://github.com/kubernetes/kubernetes/pull/46573

There were alternatives discussed in issue https://github.com/kubernetes/kubernetes/issues/45143

### Scheduler Live Lookups PVC(s) Instead of Kubelet
The implementation proposes that kubelet live updates PVC(s) used by a pod before it starts the pod in order not to start a pod that uses a PVC that has the `deletionTimestamp` set.

An alternative is that scheduler will live update PVC(s) used by a pod in order not to schedule a pod that uses a PVC that has the `deletionTimestamp` set.

But live update represents a performance penalty. As the live update performance penalty is already present in the kubelet it's better to do the live update in kubelet.

### Scheduler Maintains PVCUsedByPod Information in PVC
Scheduler will maintain information on both pods and PVCs from API server.

In case a pod is being scheduled and is using PVCs that do not have condition PVCUsedByPod set it will set this condition for these PVCs.

In case a pod is terminated and was using PVCs the scheduler will update PVCUsedByPod condition for these PVCs accordingly.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The scheduler does not handle pod termination today.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it will mean extending scheduler to handle pod termination. I assume it means that it is a no-go alternative.


PVC finalizing controller won't watch pods because the information whether a PVC is used by a pod or not is now maintained by the scheduler.

In case PVC finalizing controller gets an update of a PVC and this PVC has `deletionTimestamp` set it will do live PVC update for this PVC in order to get up-to-date value of its PVCUsedByPod field. In case the PVCUsedByPod is not true it will remove the finalizer information from this PVC.

### Scheduler In the Role of PVC Finalizing Controller
Scheduler will be responsible for removing the finalizer information from PVCs that are being deleted.

So scheduler will watch pods and PVCs and will maintain internal cache of pods and PVCs.

In case a PVC is deleted scheduler will do one of the below:
- In case the PVC is used by a pod it will add the PVC into its internal set of PVCs that are waiting for deletion.
- In case the PVC is not used by a pod it will remove the finalizer information from the PVC metadata.

Note: scheduler is the source of truth of pods that are being started. The information on active pods may be a little bit outdated that causes that deletion of a PVC may be postponed (pod status in schedular is active while the pod is terminated in API server), but this does not cause any harm.

The disadvantage is that scheduler will become responsible for PVC deletion postponing that will make scheduler bigger.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I don't think it's in the scheduler's scope to handle PVC deletion.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. So this is a no-go alternative.