Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Minimizing need to delete provisioned volume #65100

Closed
wants to merge 1 commit into from
Closed

Minimizing need to delete provisioned volume #65100

wants to merge 1 commit into from

Conversation

sbezverk
Copy link
Contributor

@sbezverk sbezverk commented Jun 14, 2018

Current behaviour of PV controller, if by some reason PV object cannot be saved by API server is to delete just provisioned volume. It is not desired behaviour because provision/de-provision operations could be long for some storage backends. This PR changes this behaviour and proposes to store successfully provisioned volume information as a annotation in PVC object. On PV object save failure, the controller will re-attempt to bind PVC to PV, but it will not need to provision a volume since it has already been done. It needs just to retrieve PV definition from PVC's annotation. On any error, PV controller will switch to old logic.

Signed-off-by: Serguei Bezverkhi sbezverk@cisco.com

NONE

Signed-off-by: Serguei Bezverkhi <sbezverk@cisco.com>
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jun 14, 2018
@k8s-ci-robot k8s-ci-robot requested review from msau42 and saad-ali June 14, 2018 13:39
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: sbezverk
To fully approve this pull request, please assign additional approvers.
We suggest the following additional approver: saad-ali

Assign the PR to them by writing /assign @saad-ali in a comment when ready.

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@sbezverk
Copy link
Contributor Author

/sig storage
/assign @saad-ali

@k8s-ci-robot k8s-ci-robot added sig/storage Categorizes an issue or PR as relevant to SIG Storage. release-note-none Denotes a PR that doesn't merit a release note. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Jun 14, 2018
@k8s-ci-robot
Copy link
Contributor

k8s-ci-robot commented Jun 14, 2018

@sbezverk: The following test failed, say /retest to rerun them all:

Test name Commit Details Rerun command
pull-kubernetes-bazel-test 1fa644e link /test pull-kubernetes-bazel-test

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@jsafrane
Copy link
Member

There are several issues in this PR:

  1. PV contains sensitive information, like ID of the volume, that can't be edited by regular users that can create PVC. With this pr, rogue user could craft a PVC with annStoredVolumeData with ID of some else's volume and steal data.

  2. If saving PV fails due to API server throttling, it's very likely that patching PVC will fail too.

If we have troubles with in-tree volumes (and I haven't noticed any), I would propose to add exponential backoff to writing PV, now it tries 5x with 10 sec sleep in between.

@sbezverk
Copy link
Contributor Author

@jsafrane Would it be acceptable solution if I find a way to to encrypt annStoredVolumeData?

@@ -22,6 +22,8 @@ import (
"strings"
"time"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be no empty line.

ctrl.eventRecorder.Event(claim, v1.EventTypeWarning, events.ProvisioningFailed, strerr)
return
volRecovered := false
if claim.ObjectMeta.Annotations[annVolumeAlreadyProvisioned] == "yes" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if annotation,ok :=claim.ObjectMeta.Annotations[annVolumeAlreadyProvisioned]; ok && annotation== "yes" {

A little suggestion. :=)

@jsafrane
Copy link
Member

Would it be acceptable solution if I find a way to to encrypt annStoredVolumeData?

It looks extremely ugly to me. You bring security somewhere it should not be. It opens whole new can on worms, e.g. you need a way how to prevent reply attacks on different PVC.

IMO, internal provisioners are quite fine as they are now. And the external ones have wide variety of ways how to fix themselves, starting from increasing the timeout or by using CRDs in their own namespace.

Note that deleting unwanted PVs is quite complex. Current code prefers deleting volumes to save space and keep the volumes in storage backend in sync with PVs. This PR may leave orphan volumes in the storage backend without PVs for them in case user deletes PVC between provisioning retries. Both ways have its pros and cons.

@sbezverk
Copy link
Contributor Author

@jsafrane Thanks for the comments. One question though, should the external provisioner be in sync with the logic of the in-tree pv-controller? If they behave differently (with suggested changes to only external provisioner), it might result different experience for a user. I am not sure if it was a goal to provide seamless user experience whether they use in-tree or out of tree controllers.

@liggitt
Copy link
Member

liggitt commented Jun 15, 2018

@jsafrane Would it be acceptable solution if I find a way to to encrypt annStoredVolumeData?

I don't think putting data on the PVC is a good idea.

a couple more fundamental questions:

  1. shouldn't CreateVolume be idempotent? if the controller crashes after calling CreateVolume but before creating the PV, we shouldn't end up with orphaned volumes.
  2. can the controller pre-create the PV in Pending state, then do the CSI provision call, then update the PV with capacity/attribute data from the results of the CreateVolume call and update the PV state?

@sbezverk
Copy link
Contributor Author

@liggitt It should not, (really depends on CSI driver implementation) but even for hostpath driver we do check if request comes for already existing volume and if it is the case no new volume gets created and in addition external provisioner will generate the same volume name in CreateVolume request for the same PVC.
WRT point 2. I am building PoC to see how it behaves in large scale env.

@jsafrane
Copy link
Member

  1. can the controller pre-create the PV in Pending state, then do the CSI provision call, then update the PV with capacity/attribute data from the results of the CreateVolume call and update the PV state?

We did that in the first release of alpha dynamic provisioning and it has proven to be unreliable and error prone. For example, this PV will get bound to a PVC and scheduler will schedule pods that use this PVC, assuming it has complete topology labels (which it does not have) and A/D controller will try to attach it or kubelet will try to mount it.

Sure, we could extend PV controller not to bind to such PVs, but it IMO breaks API.

@jsafrane
Copy link
Member

Another problem we had: such incomplete PVs go through admission plugin. We have a plugin that fills topology labels and it got confused by PVs with no real volumes behind. We can fix that easily, however, it again shows that it's API change and users would need to change their admission handlers too.

@liggitt
Copy link
Member

liggitt commented Jun 18, 2018

does requiring CreateVolume to be idempotent resolve the need to delete the created volume if writing the PV encounters an error and the provision needs to be requeued?

@msau42
Copy link
Member

msau42 commented Jun 18, 2018

@liggitt makes a good point. If CreateVolume is idempotent then we shouldn't need to delete the volume if PV creation fails. I'm not sure if in-tree Provision() is idempotent, but at least CSI should be.

@jsafrane
Copy link
Member

In-tree provisioner calls Delete() because there is no PV for the volume yet:

  1. User creates a PVC
  2. Driver creates a volume
  3. Provisioner fails to save the PV
  4. [Provisioner deletes the volume]
  5. User deletes the PVC.

After 5., no provisioning is called, because there is nothing to provision - Kubernetes does not need the PV at this time. If Kubernetes did not delete the volume at 4., the volume would be never deleted.

@liggitt
Copy link
Member

liggitt commented Jun 19, 2018

In-tree provisioner calls Delete() because there is no PV for the volume yet:

  1. User creates a PVC
  2. Driver creates a volume
  3. Provisioner fails to save the PV
  4. [Provisioner deletes the volume]
  5. User deletes the PVC.

After 5., no provisioning is called, because there is nothing to provision - Kubernetes does not need the PV at this time. If Kubernetes did not delete the volume at 4., the volume would be never deleted.

I expected something like this:

  • when 3 fails, the provisioner should requeue the pvc to be processed
  • the workqueue would have sufficient information to locate the pvc, the pv, and the created volume
  • when syncing, if both the pv and pvc no longer exist, the provisioner would delete the created volume (tolerating a "not found" error)

@msau42
Copy link
Member

msau42 commented Jun 19, 2018

To handle the case where both PVC and PV are missing, the provisioner would need to keep an in-memory cache of created volumes, which would be lost on restarts.

@liggitt
Copy link
Member

liggitt commented Jun 19, 2018

To handle the case where both PVC and PV are missing, the provisioner would need to keep an in-memory cache of created volumes, which would be lost on restarts.

Then we're back to persisting local state prior to calling CreateVolume. The PV object seems the most coherent object to do that on.

@jsafrane
Copy link
Member

Then we're back to persisting local state prior to calling CreateVolume. The PV object seems the most coherent object to do that on.

And we're back at API breakage. Until now, PVs were only fully provisioned volumes, ready for binding and scheduling. With PVs for not fully provisioned volumes we need to change at least PV controller (not to bind PVC until provisioning is complete), scheduler (to wait for the PV to be fully provisioned and get topology labels in case someone force-binds PVC) and kubelet (to do the same when pod is scheduled directly, e.g. by DaemonSet). There is unknown number of external components that may need this change too, and IMO this counts as API breakage.

@sbezverk
Copy link
Contributor Author

@jsafrane imho forcing storage backend to create/delete/create/delete volumes as a result of API server related issues is not right, each subsystem should deal with its issues internally, minimizing exposure to other subsystems. If changing API, not sure why it is consider breaking, makes api subsystem more robust, I think it should be explored.

@wongma7
Copy link
Contributor

wongma7 commented Jun 21, 2018

Yeah this is not worth breaking pv api over, it is complicated enough, adding another phase before Available is imo out of the question.

I believe the motivation for this pr is ultimately kubernetes-csi/external-provisioner#68. This supposed issue of PVs failing to save and wasting storage backend create API calls is just a symptom of the true problem kubernetes-csi/external-provisioner#68 causing the API server throttling in the first place. Otherwise we have 0 evidence that a PV failing to save 5 times in a row is a common enough occurrence that we need to change this code.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 20, 2018
@k8s-ci-robot
Copy link
Contributor

@sbezverk: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@vladimirvivien
Copy link
Member

@wongma7 should this be closed ?

@wongma7
Copy link
Contributor

wongma7 commented Jul 24, 2018

@vladimirvivien yes

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 22, 2018
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Nov 21, 2018
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closed this PR.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. release-note-none Denotes a PR that doesn't merit a release note. sig/storage Categorizes an issue or PR as relevant to SIG Storage. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants