Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

generic ephemeral inline volumes KEP #1701

Merged
merged 1 commit into from
May 19, 2020

Conversation

pohly
Copy link
Contributor

@pohly pohly commented Apr 20, 2020

This describes an alternative, more general approach for specifying volumes inside a pod spec ("inline"), with a lifecycle of the volumes that is tied to the lifecycle of the pod ("ephemeral").

Enhancement issue: #1698

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Apr 20, 2020
@k8s-ci-robot k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. sig/storage Categorizes an issue or PR as relevant to SIG Storage. labels Apr 20, 2020
@pohly pohly changed the title generic inline volumes KEP WIP: generic inline volumes KEP Apr 20, 2020
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 20, 2020
@msau42
Copy link
Member

msau42 commented Apr 23, 2020

/assign @saad-ali

@pohly
Copy link
Contributor Author

pohly commented Apr 23, 2020

During a design review meeting (recording: https://www.youtube.com/watch?v=ANAlOZ5GhCs) we agreed to move ahead with investigating this because it looks promising.

We identified the need to clarify where the PVC creation will be handled. Options:

  • a webhook (but might not be viable because ownership relationship depends on pod uid, which might not be available yet)
  • volume scheduling code in kube-scheduler: doable, but not the right place
  • a new controller alongside the existing PV controller in kube-controller-manager: most promising solution

I'll investigate that.

@saad-ali
Copy link
Member

Meeting Notes:

  • Admission plugin? Webhook? Or Controller?
    • @jsafrane - pod with new type of volume - admission plugin that sees this volume, creates PVC and replace the volume.
      • @pohly - what happen if that inline spec gets updated? mutable fields like size? reclaim policy? I guess we could make it all immutable by only looking it at creation time and making spec immutable at cluster creation.
    • @jsafrane - API folks say no more new admission plugin, suggest webhook instead
    • @pohly - That's why I'd prefer a volume controller or scheduler.
    • @msau42 - architecturally scheduler doesn't make sense.
    • @jsafrane - Separate controller should create PVCs instead of scheduler.
    • @msau42 - agreed, should live in PV controller or new volume controller
    • TODO: @pohly while investigate where to implement
  • SIG Arch is likely to ask if use case is strong enough to change pod spec?
    • @pohly - So many features necessary to add to inline volumes, this would make it easy
  • @jsafrane - I don't like that it uses PersistentVolumeClaim for something that is not Persistent
    • @saad-ali agreed
    • @msau42 - we could do a v2
    • @pohly - not necessary; this is just a case where PVC is known to go away.

@pohly
Copy link
Contributor Author

pohly commented Apr 28, 2020

a webhook (but might not be viable because ownership relationship depends on pod uid, which might not be available yet)

I think we can rule out this option exactly for this reason: while "admission controllers may sometimes have side effects" (https://kubernetes.io/docs/reference/access-authn-authz/admission-controllers/), they also need a "corresponding reclamation or reconciliation process" because pod creation comes later. Not being able to rely on the ownership relation seems complicated.

a new controller alongside the existing PV controller in kube-controller-manager

Let's focus on that.

@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels May 11, 2020
@pohly pohly changed the title WIP: generic inline volumes KEP generic inline volumes KEP May 11, 2020
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 11, 2020
@pohly
Copy link
Contributor Author

pohly commented May 11, 2020

@jsafrane: I've finished updating the document with the kube-controller-manager approach. I think it is ready for review now.

If I remember correctly, we volunteered you as reviewer because you said that you like this approach, right? 😛 Can you have a look?

Copy link
Member

@jsafrane jsafrane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am still not sure about naming...

A new volume source will be introduced:

```
typedef InlineVolumeSource struct {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typedef -> type :-)


```
typedef InlineVolumeSource struct {
VolumeClaimTemplate PersistentVolumeClaim
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, we will need a separate type, PersistentVolumeClaimTemplate, but that can be solved during API review.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current approach follows how volumes are created in the stateful set. But yes, let's discuss this during the API review.

```
typedef InlineVolumeSource struct {
VolumeClaimTemplate PersistentVolumeClaim
ReadOnly bool
Copy link
Member

@jsafrane jsafrane May 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By default, newly provisioned volumes are empty and ReadOnly does not make sense. Is it for cloned volumes / snapshot restore? Please describe somewhere below how ReadOnly can be useful.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The driver itself or a mechanism like the image populator might create the volume with data, which then might be supposed to be left unmodified. Cloning is indeed another example.

But I'm making this up on-the-fly 😁 The original motivation was to have feature-parity with PersistentVolumeClaimVolumeSource, without having considered all implications.

It's definitely useful to write this down, so let me add a use-case.

Comment on lines 244 to 245
Alternatively, a label on a namespace could be used to disable the
feature just for that namespace.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow PodSecurityPolicy (PSP), where admin can allow/deny volume types used in a pod. Just add a new enum value for InlineVolumeSource to PSP.Spec.Volumes.

https://kubernetes.io/docs/concepts/policy/pod-security-policy/#volumes-and-file-systems

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, I hadn't thought of that.

The `GenericInlineVolumes` feature gate controls whether:
- the new controller is active in the `kube-controller-manager`,
- new pods can be created with a `InlineVolumeSource`,
- kubelet and scheduler accept such volumes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add attach/detach controller, PV/PVC protection controller, resize controller, VolumePluginMgr and all volume plugins. Basically everything that touches Pod.Spec.Volumes must be updated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code that merely copies a Pod.Spec.Volumes IMHO shouldn't need changes: if it gets an InlineVolumeSource, it can pass that through and let those components who really have to do something special with it check the feature gate.

Let's check during the implementation phase what changes are needed where. For now I rephrased that line item a bit to be more open-ended.

Comment on lines 334 to 335
downgrade), pods using generic inline volumes will no longer be
scheduled or started.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I remember scheduler correctly, you can't prevent a pod with an unknown VolumeSource from scheduling.

The pod gets stuck in kubelet, where VolumeManager can't find a volume plugin for an unknown VolumeSource and reports errors.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've removed the "be scheduled".

Comment on lines 294 to 300
Ideally, provisioning should use late binding (aka
`WaitForFirstConsumer`). The initial implementation assumes that the
user is taking care of that by selecting that provisioning mode in the
storage class for the PVC.

Later, `kube-scheduler` and `external-provisioner` can be changed to
automatically enable late binding for PVCs which are owned by a pod.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does late binding relate to this feature? While it's a good thing to use, especially when using topology, this KEP will work even when immediate binding is used for storage that does not have any topology.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't find a reason why late binding should not be used. The volume is clearly going to be created for use by this one pod, so using late binding seems like the right thing to do (don't create volumes unless the pod can start, better scheduling) in all cases.

By enabling late binding for inline volumes, we reduce the amount of storage classes which have to be defined in the cluster. It might not be relevant, though, if all classes already have late binding enabled.


## Design Details

### PVC meta data
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please mention somewhere that the newly created PVC is subject of namespace quota as any other PVC and does not allow user cheating.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably belongs under risks and mitigations. I've added something there.


## Design Details

### PVC meta data
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Related to above, please note that the new controller reports any errors on PVC creation as events on the Pod. This way, user can get for example quota errors.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added under implementation details, together with name collision.

A new volume source will be introduced:

```
typedef InlineVolumeSource struct {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Naming is hard. How is this InlineVolumeSource different from other inline volume sources (say ISCSIVolumeSource)? The name should imply that it is ephemeral (won't persist data beyond Pod lifetime)... EphemeralVolume?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Including "ephemeral" in the name forever rules out the possibility to make the volumes persistent. I don't think it'll be useful, but I'd prefer to not paint us into a corner here.

The more important aspect is that this inline volume uses the claim mechanism and normal provisioning. How about ClaimVolumeSource? It is more accurate, but doesn't role of the tongue that easily.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

VolumeClaimVolumeSource is another option, albeit with Volume in the name twice.

Copy link
Contributor Author

@pohly pohly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've pushed an update.

A new volume source will be introduced:

```
typedef InlineVolumeSource struct {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Including "ephemeral" in the name forever rules out the possibility to make the volumes persistent. I don't think it'll be useful, but I'd prefer to not paint us into a corner here.

The more important aspect is that this inline volume uses the claim mechanism and normal provisioning. How about ClaimVolumeSource? It is more accurate, but doesn't role of the tongue that easily.

A new volume source will be introduced:

```
typedef InlineVolumeSource struct {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

VolumeClaimVolumeSource is another option, albeit with Volume in the name twice.

```
typedef InlineVolumeSource struct {
VolumeClaimTemplate PersistentVolumeClaim
ReadOnly bool
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The driver itself or a mechanism like the image populator might create the volume with data, which then might be supposed to be left unmodified. Cloning is indeed another example.

But I'm making this up on-the-fly 😁 The original motivation was to have feature-parity with PersistentVolumeClaimVolumeSource, without having considered all implications.

It's definitely useful to write this down, so let me add a use-case.

Comment on lines 244 to 245
Alternatively, a label on a namespace could be used to disable the
feature just for that namespace.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, I hadn't thought of that.


## Design Details

### PVC meta data
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably belongs under risks and mitigations. I've added something there.


## Design Details

### PVC meta data
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added under implementation details, together with name collision.

The `GenericInlineVolumes` feature gate controls whether:
- the new controller is active in the `kube-controller-manager`,
- new pods can be created with a `InlineVolumeSource`,
- kubelet and scheduler accept such volumes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code that merely copies a Pod.Spec.Volumes IMHO shouldn't need changes: if it gets an InlineVolumeSource, it can pass that through and let those components who really have to do something special with it check the feature gate.

Let's check during the implementation phase what changes are needed where. For now I rephrased that line item a bit to be more open-ended.

Comment on lines 294 to 300
Ideally, provisioning should use late binding (aka
`WaitForFirstConsumer`). The initial implementation assumes that the
user is taking care of that by selecting that provisioning mode in the
storage class for the PVC.

Later, `kube-scheduler` and `external-provisioner` can be changed to
automatically enable late binding for PVCs which are owned by a pod.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't find a reason why late binding should not be used. The volume is clearly going to be created for use by this one pod, so using late binding seems like the right thing to do (don't create volumes unless the pod can start, better scheduling) in all cases.

By enabling late binding for inline volumes, we reduce the amount of storage classes which have to be defined in the cluster. It might not be relevant, though, if all classes already have late binding enabled.

Comment on lines 382 to 383
Yes, by disabling the feature gates. Existing pods will continue to run and
volumes will be deleted once they stop. Existing pods with generic inline
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Existing pods will continue to run and volumes will be deleted once they stop.

If someone just restarts kubelet with the feature gate disabled, it won't be able to unmount InLine volumes from running Pods. I haven't found documentation how to actually flip a feature gate in a running cluster. Can you please add a note that node must be drained before flipping the feature gate on kubelet, just to be on the safe side?

Avoid getting hung up on specific details and instead aim to get the goals of
the KEP clarified and merged quickly. The best way to do this is to just
start with the high-level sections and fill out details incrementally in
subsequent PRs.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you check the ones that are complete?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't sure which those are. The only safe one seems to be "Enhancement issue in release milestone, which links to KEP dir". I've marked that.

"Design details are appropriately documented" I think is okay, but perhaps someone else should confirm that?

"Test plan is in place" (yes), "giving consideration to SIG Architecture and SIG Testing input (no?)" - do they need to review?

"Graduation criteria is in place" - done.

- [ ] "Implementation History" section is up-to-date for milestone
- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
- [ ] Supporting documentation e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you check the ones already done?

type EphemeralVolumeSource struct {
// Required. Defined as a pointer because in the future there
// might be alternative ways to specify how the ephemeral
// volume gets provisioned.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Specify what it means if VolumeClaimTemplate is nil.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently it must not be nil. I've reworded a bit.

type VolumeSource struct {
...
CSI *CSIVolumeSource
PersistentVolumeClaim *PersistentVolumeClaimVolumeSource
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add comments above to describe what this is and to differentiate it from "Ephemeral" below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was assuming that readers are familiar with the existing API. Do I really need to explain it?

I've reworded the introduction a bit and added links.

Recent releases of memcached added [support for using Persistent
Memory](https://memcached.org/blog/persistent-memory/) (PMEM) instead
of normal DRAM. When deploying memcached through one of the app
controllers, `InlineVolumeSource` makes it possible to request a volume
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/InlineVolumeSource/EphemeralVolumeSource

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, also below.


For those, it might make sense to mount the volume read-only inside
the pod to prevent accidental modification of the data.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that "mounting the volume read-only inside the pod to prevent accidental modification of the data" is a valid use case, but the examples given above such as restoring a volume from snapshot is usually because the original volume is corrupted so that it needs to be restored. After restore, however, the volume will continue to be used in the application so that it cannot be read-only.

I think a valid case would be to have a pod created with read-only volumes created from snapshots, and then copy that data to someplace else such as a backup device.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added " For example, the goal might be to just retrieve the data and/or copy it elsewhere."


The `GenericInlineVolumes` feature gate controls whether:
- the new controller is active in the `kube-controller-manager`,
- new pods can be created with a `InlineVolumeSource`,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/InlineVolumeSource/EphemeralVolumeSource

The `GenericInlineVolumes` feature gate controls whether:
- the new controller is active in the `kube-controller-manager`,
- new pods can be created with a `InlineVolumeSource`,
- anything that specifically acts upon an `InlineVolumeSource` (scheduler,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/InlineVolumeSource/EphemeralVolumeSource


### Feature gate

The `GenericInlineVolumes` feature gate controls whether:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be called GenericEphemeralVolume now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay. Do I now also need to rename the KEP to match?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't agree (see my other comment), but if that's what it takes to get the KEP merged... done.

kubelet, etc.) instead of merely copying it (statefulset controller)
accepts the new volume source.

Existing pods with such a volumes will not be scheduled respectively
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/such a volumes/such volumes or such a volume


### Modifying volumes

Once the PVC for an inline volume has been created, it can be updated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/inline/ephemeral


When downgrading to a cluster which does not support generic inline
volumes (either by disabling the feature flag or an actual version
downgrade), pods using generic inline volumes will no longer be
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/inline/ephemeral

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one too.

Copy link
Contributor Author

@pohly pohly May 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated. I consider the full name of the feature "generic ephemeral inline volumes". Both are relevant aspects of the volumes, so only mentioning one of them is arbitrary. "Generic" has to be added to distinguish from the existing "CSI ephemeral inline volumes".

In the API, "inline" is redundant because it is implied by the usage, so there it gets dropped.


* **How can this feature be enabled / disabled in a live cluster?**
- [X] Feature gate
- Feature gate name: GenericInlineVolumes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GenericEphemeralVolumes

- kubelet

* **Does enabling the feature change any default behavior?**
If users are allowed to create pods but not PVCs, then generic inline volumes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/inline/ephemeral

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one is missed.


[supported limits]: https://git.k8s.io/community//sig-scalability/configs-and-limits/thresholds.md
[existing SLIs/SLOs]: https://git.k8s.io/community/sig-scalability/slos/slos.md#kubernetes-slisslos

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove sections that you have not filled out.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

beta: ""
stable: ""
feature-gate:
name: GenericInlineVolumes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GenericEphemeralVolumes

@pohly
Copy link
Contributor Author

pohly commented May 15, 2020

I've updated the KEP and fixed kep.yaml. Please have another look.


* **Can the feature be disabled once it has been enabled (i.e. can we rollback
the enablement)?**
Yes, by disabling the feature gates. Existing pods with generic inline
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one is missed: s/inline/ephemeral

- kubelet

* **Does enabling the feature change any default behavior?**
If users are allowed to create pods but not PVCs, then generic inline volumes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one is missed.

```

A new controller in `kube-controller-manager` is responsible for
creating new PVCs for each such inline volume. It does that:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's another one: s/inline/ephemeral

## Example

Here is a full example for a higher-level object that uses ephemeral
inline volumes:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"inline" can be removed?


### Upgrade / Downgrade Strategy

When downgrading to a cluster which does not support generic inline
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one is missed.


When downgrading to a cluster which does not support generic inline
volumes (either by disabling the feature flag or an actual version
downgrade), pods using generic inline volumes will no longer be
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one too.

Yes, by disabling the feature gates. Existing pods with generic inline
volumes that haven't started yet will not be able to start up anymore, because
kubelet does not know what to do with the volume. It also will not know how
to unmount generic inline volumes which would cause pods to get stuck, so nodes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/inline/ephemeral

@pohly pohly changed the title generic inline volumes KEP generic ephemeral inline volumes KEP May 18, 2020
@pohly
Copy link
Contributor Author

pohly commented May 18, 2020

@xing-yang I've renamed the KEP, please check again.

@@ -0,0 +1,30 @@
title: generic inline volumes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ephemeral

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, fixed.

```

A new controller in `kube-controller-manager` is responsible for
creating new PVCs for each such inline volume. It does that:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ephemeral

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

Let's hope this is the last one 😄 🤞

@xing-yang
Copy link
Contributor

Can you squash your commits?

@pohly
Copy link
Contributor Author

pohly commented May 18, 2020

Can you squash your commits?

Done, force-pushed.

hostPath:
path: /var/log
- name: scratch
ephemeral:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, this all looks good.

The only thing I'm not sure of is the naming here.

I would have loved to have the volume type ephemeral for the local inline ephemeral use case we have now. But I was told it had to hang off of the csi driver.

Having some ephemeral be under csi and some under volumeSource would be hard to explain to users I think.

At some point, maybe pod api v2 whenever that happens, the csi driver should probably fold back out so it looks like its more at the root of the volumeSource struct. As all drivers but configmap/secret/emptydir/hostpath will be moving to csi, so having csi in the api kind of stops making sense at some point.

All in all, I'm not against this. I think in the long run its more usable. But, I think in the intermediate time its kind of confusing to end users. Perhaps we should design the api here such that it could be used for all kinds of ephemeral storage. And we treat the csi version as an alias and deprecate it? In my mind, the local ephemeral drivers are just an optimization (shortcuts) of the usage of general ephemeral inline. So to a user, ideally should look the same. They shouldn't have to care they are different.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But, I think in the intermediate time its kind of confusing to end users. Perhaps we should design the api here such that it could be used for all kinds of ephemeral storage. And we treat the csi version as an alias and deprecate it?

This is part of the long term plan and the reason why VolumeClaimTemplate is a pointer: we can add a new, alternative pointer to a struct with the values for a "CSI ephemeral inline volume", allow that as alternative and then with a suitable deprecation period, remove the CSI entry in VolumeSource in favor of this unified Ephemeral entry.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see how that might work with the raw api. but what about the yaml representation? How will it know what type of object to generate?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ya we might need a wrapper struct here PersistentVolumeClaimTemplate to disambiguate and be consistent with existing PodTemplate.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The YAML example was broken, the API itself was fine. The correct example is:

volumes:
      - name: varlog
        hostPath:
          path: /var/log
      - name: scratch
        ephemeral:
          volumeClaimTemplate:
             metadata:
                labels:
                   type: fluentd-elasticsearch-volume
              spec:

I.e. the volumeClaimTemplate field name was missing.

Fixed.

Existing pods with such a volume will not be started when the feature
gate is off.

### Modifying volumes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about updating the volume template in a higher level controller like deployment or statefulset? Would that cause a new rollout of the deployment/statefulset?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would expect that to cause an update as its just a change from the deployment/statefulset/daemonset/etc's perspective. I think that's also a good behavior.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lifecycle should be tied to the pod. If higher level controller deletes pod and creates a new one, we should delete PVC and create a new one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At some point, the controller which created the pods has to delete the existing ones and create them anew, because updating the Volumes field in the pod spec is immutable (I checked). The new pods then start also with new PVCs/PVs.

I have not checked whether a deployment can update volumes in the replicaset.

owning-sig: sig-storage
participating-sigs:
- sig-storage
status: provisional
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this ready to be updated to implementable?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This to go in as an alpha next or beta?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope so. Let me make the change, otherwise it will miss 1.19.

Copy link
Member

@saad-ali saad-ali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some questions about details. But overall this looks good to me! Nice work @pohly

keps/sig-storage/1698-generic-ephemeral-volumes/README.md Outdated Show resolved Hide resolved
keps/sig-storage/1698-generic-ephemeral-volumes/README.md Outdated Show resolved Hide resolved
keps/sig-storage/1698-generic-ephemeral-volumes/README.md Outdated Show resolved Hide resolved
keps/sig-storage/1698-generic-ephemeral-volumes/README.md Outdated Show resolved Hide resolved
Existing pods with such a volume will not be started when the feature
gate is off.

### Modifying volumes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lifecycle should be tied to the pod. If higher level controller deletes pod and creates a new one, we should delete PVC and create a new one.

Copy link
Contributor Author

@pohly pohly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that we are close to merging, I amended the commit instead of creating a new one. The diff is:

diff --git a/keps/sig-storage/1698-generic-ephemeral-volumes/README.md b/keps/sig-storage/1698-generic-ephemeral-volumes/README.md
index dd606d1..4772f55 100644
--- a/keps/sig-storage/1698-generic-ephemeral-volumes/README.md
+++ b/keps/sig-storage/1698-generic-ephemeral-volumes/README.md
@@ -140,3 +140,3 @@ Items marked with (R) are required *prior to targeting to a milestone / release*
 This KEP proposes an alternative mechanism for specifying volumes
-inside a pod spec. Those inline volumes then get converted to normal
+inside a pod spec. Those ephemeral volumes then get converted to normal
 volume claim objects for provisioning, so storage drivers do not need
@@ -144,6 +144,10 @@ to be modified.
 
-The lifecycle of those volumes is ephemeral: they will get deleted
-once the pod terminates.
+The lifecycle of those volumes is ephemeral: they will get created
+when the pod is created (for immediate [volume
+binding](https://kubernetes.io/docs/concepts/storage/storage-classes/#volume-binding-mode))
+or when the pod starts (for `WaitForFirstConsumer` volume binding),
+and deleted once the pod terminates.
 
-Because this is expected to be slower than [CSI ephemeral inline
+Because volume creation and deletion is expected to be slower than
+with [CSI ephemeral inline
 volumes](https://github.com/kubernetes/enhancements/issues/596), both
@@ -181,2 +185,6 @@ by a more traditional storage system because:
 - Storage capacity tracking can be enabled also for such volumes.
+- Eventual API alignment with CSI ephemeral inline volumes, as well as
+  current ephemeral in-tree volumes (EmptyDir, etc.), to ensure
+  consistency across ephemeral volume APIs and minimize user
+  confusion.
 
@@ -200,3 +208,3 @@ type EphemeralVolumeSource struct {
     // Required, must not be nil.
-    VolumeClaimTemplate *PersistentVolumeClaim
+    VolumeClaimTemplate *PersistentVolumeClaimTemplate
     ReadOnly             bool
@@ -207,5 +215,10 @@ This mimics a
 [`PersistentVolumeClaimVolumeSource`](https://pkg.go.dev/k8s.io/api/core/v1?tab=doc#PersistentVolumeClaimVolumeSource),
-except that it contains a `PersistentVolumeClaim` instead of
-referencing one by name.  A full `PersistentVolumeClaim` is used for
-consistency with the StatefulSet API.
+except that it contains a `PersistentVolumeClaimTemplate` instead of
+referencing a `PersistentVolumeClaim` by name. The content of this
+`PersistentVolumeClaimTemplate` is identical to
+`PersistentVolumeClaim` and just uses a different name to clarify the
+intend, which is to serve as template for creating a volume claim
+object. Embedding a full object is similar to the StatefulSet API and
+allows storing labels and annotations. The name follows the example
+set by [`PodTemplate`](https://pkg.go.dev/k8s.io/api/core/v1?tab=doc#PodTemplate).
 
@@ -368,4 +381,4 @@ new controller will accidentally modify unrelated PVCs.
 
-The volume scheduling library and kubelet must check that a PVC is
-owned by the pod before using it. Otherwise they must ignored it. This
+The volume scheduling library and kubelet must verify that the PVC has
+an owner reference to the pod before using it. Otherwise they must ignored it. This
 ensures that a volume is not used accidentally for a pod in case of a
@@ -430,2 +443,5 @@ automatically enable late binding for PVCs which are owned by a pod.
   merged with `EphemeralVolumeSource`
+- Decide whether in-tree ephemeral volume sources, like EmptyDir (GA
+  already), should also be added EphemeralVolumeSource for sake of API
+  consistency
 - Tests are in Testgrid and linked in KEP
diff --git a/keps/sig-storage/1698-generic-ephemeral-volumes/kep.yaml b/keps/sig-storage/1698-generic-ephemeral-volumes/kep.yaml
index 2435ed3..bb1b683 100644
--- a/keps/sig-storage/1698-generic-ephemeral-volumes/kep.yaml
+++ b/keps/sig-storage/1698-generic-ephemeral-volumes/kep.yaml
@@ -7,3 +7,3 @@ participating-sigs:
   - sig-storage
-status: provisional
+status: implementable
 creation-date: 2020-04-17

hostPath:
path: /var/log
- name: scratch
ephemeral:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The YAML example was broken, the API itself was fine. The correct example is:

volumes:
      - name: varlog
        hostPath:
          path: /var/log
      - name: scratch
        ephemeral:
          volumeClaimTemplate:
             metadata:
                labels:
                   type: fluentd-elasticsearch-volume
              spec:

I.e. the volumeClaimTemplate field name was missing.

Fixed.

keps/sig-storage/1698-generic-ephemeral-volumes/README.md Outdated Show resolved Hide resolved
keps/sig-storage/1698-generic-ephemeral-volumes/README.md Outdated Show resolved Hide resolved
keps/sig-storage/1698-generic-ephemeral-volumes/README.md Outdated Show resolved Hide resolved
keps/sig-storage/1698-generic-ephemeral-volumes/README.md Outdated Show resolved Hide resolved
Existing pods with such a volume will not be started when the feature
gate is off.

### Modifying volumes
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At some point, the controller which created the pods has to delete the existing ones and create them anew, because updating the Volumes field in the pod spec is immutable (I checked). The new pods then start also with new PVCs/PVs.

I have not checked whether a deployment can update volumes in the replicaset.

owning-sig: sig-storage
participating-sigs:
- sig-storage
status: provisional
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope so. Let me make the change, otherwise it will miss 1.19.

This KEP proposes an alternative mechanism for specifying volumes
inside a pod spec. Those inline volumes then get converted to normal
volume claim objects for provisioning, so storage drivers do not need
to be modified.

The lifecycle of those volumes is ephemeral: they will get deleted
once the pod terminates.

Because this is expected to be slower than CSI ephemeral inline
volumes, both approaches will need to be supported.
@pohly
Copy link
Contributor Author

pohly commented May 19, 2020

The diff is:
...

+The lifecycle of those volumes is ephemeral: they will get created
+when the pod is created (for immediate volume
+binding
)
+or when the pod starts (for WaitForFirstConsumer volume binding),
+and deleted once the pod terminates.
...

That last change hadn't made it into the squashed commit and was still sitting in my tree - sorry! I've squashed it and force-pushed, now the commit is a080b55.

@saad-ali
Copy link
Member

Thanks @pohly

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 19, 2020
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: pohly, saad-ali

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 19, 2020
@k8s-ci-robot k8s-ci-robot merged commit 7faa578 into kubernetes:master May 19, 2020
@k8s-ci-robot k8s-ci-robot added this to the v1.19 milestone May 19, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. sig/storage Categorizes an issue or PR as relevant to SIG Storage. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants