Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

podresources: add Watch endpoint #1926

Closed
wants to merge 5 commits into from

Conversation

ffromani
Copy link
Contributor

@ffromani ffromani commented Aug 6, 2020

Extend the protocol with a simple implementation of ListAndWatch
to enable monitoring agents to be notified of resource allocation
changes.

Signed-off-by: Francesco Romani fromani@redhat.com

Extend the protocol with a simple implementation of ListAndWatch
to enable monitoring agents to be notified of resource allocation
changes.

Signed-off-by: Francesco Romani <fromani@redhat.com>
@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Aug 6, 2020
@k8s-ci-robot
Copy link
Contributor

Welcome @fromanirh!

It looks like this is your first PR to kubernetes/enhancements 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/enhancements has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Aug 6, 2020
@k8s-ci-robot
Copy link
Contributor

Hi @fromanirh. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Aug 6, 2020
@k8s-ci-robot k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. sig/node Categorizes an issue or PR as relevant to SIG Node. labels Aug 6, 2020
@ffromani
Copy link
Contributor Author

ffromani commented Aug 6, 2020

@dashpole could be interested in reviewing this PR

@swatisehgal
Copy link
Contributor

This feature would be useful for the topology exporter agent that we intend to introduce as part of Topology Aware Scheduling work

Add a v1alpha1 Kubelet GRPC service, at `/var/lib/kubelet/pod-resources/kubelet.sock`, which returns information about the kubelet's assignment of devices to containers. It obtains this information from the internal state of the kubelet's Device Manager.
The GRPC Service can return:
- a single PodResourcesResponse, enabling monitor applications to poll for resources allocated to pods and containers on the node.
- a stream of PodResourcesResponse, enabling monitor applications to be notified of new resource allocation, release or resource allocation updates.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the stream be of individual pods, rather than the entire list when it changes? We would need to do something to signal deletion... But I would worry that a high rate of pod churn could make this very expensive.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's reasonable to introduce threshold on kubelet side, configurable via KubeletConfiguration, to not send notification so often, one notification will contain a bunch of podresources (it already described in this KEP). I think it worth mentioning in this KEP, but for implementation I think it's step 2.

Copy link
Contributor Author

@ffromani ffromani Aug 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dashpole the intention is totally to stream only individual pod changes. The API will return:

  1. initially, the state of all the pods (exactly the same output as the current List() API)
  2. then, only the pod whose resource allocation changed for any reason, newly created pod, or deleted pod.
    Regarding deletion, I was thinking to just send a message with all the resources (currently devices only) cleared.

So the monitoring app observes

  1. message which contains pod P1, with container C1 (with devices d1, d2) and container C2 (with devices d3)
    (some time passes, pod gets deleted)
  2. a message which contains pod P1, with container C1 (no devices) and container C2 (no devices)
    So the monitor app can unambiguously learn that "all the resources previously allocated to C1 and C2 in P1 can now be cleared".

Makes sense?

Also, I believe this should be documented, processwise is ok to add this in the KEP, or should be added as comment to the .proto file? or both?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also: yes, we can totally add configurable thresholds, I'll add to the KEP.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, that addresses my primary concern. I'm not entirely sold on "pod without resources" indicating deletion being the best way to represent it, but as long as we consider some alternatives and still prefer it, it is reasonable.

Copy link
Contributor Author

@ffromani ffromani Aug 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem regarding better ways to represent deletions: I'm very open on alternatives.
To elaborate my rationale, I considered this approach because it required no extra changes to the API - the diff is minimal and the semantic seemed clear enough.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dashpole the intention is totally to stream only individual pod changes. The API will return:

  1. initially, the state of all the pods (exactly the same output as the current List() API)

I would like to avoid it

  1. then, only the pod whose resource allocation changed for any reason, newly created pod, or deleted pod.
    Regarding deletion, I was thinking to just send a message with all the resources (currently devices only) cleared.

Maybe just add action field into ListPodResourcesResponse with following possible values: ADDED, UPDATED, DELETED

So the monitoring app observes

  1. message which contains pod P1, with container C1 (with devices d1, d2) and container C2 (with devices d3)
    (some time passes, pod gets deleted)
  2. a message which contains pod P1, with container C1 (no devices) and container C2 (no devices)
    So the monitor app can unambiguously learn that "all the resources previously allocated to C1 and C2 in P1 can now be cleared".

Makes sense?

Also, I believe this should be documented, processwise is ok to add this in the KEP, or should be added as comment to the .proto file? or both?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dashpole the intention is totally to stream only individual pod changes. The API will return:

  1. initially, the state of all the pods (exactly the same output as the current List() API)

I would like to avoid it

Which issues do you see with this approach?

  1. then, only the pod whose resource allocation changed for any reason, newly created pod, or deleted pod.
    Regarding deletion, I was thinking to just send a message with all the resources (currently devices only) cleared.

Maybe just add action field into ListPodResourcesResponse with following possible values: ADDED, UPDATED, DELETED

If we send the complete allocation with each message, besides the DELETED case we are still discussing, I don't really see the benefit of a separate action field: could you please elaborate on this?
I see some benefit if each message provides the resource allocation delta (changes from the previous message), but I'd like to avoid sending deltas, it seems more robust (and not much more inefficient) to send total allocation each time.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re serving initial state as part of "watch": kubernetes/kubernetes#13969
This is causing us a bunch of burden and is misleading to users.

I would really prefer those two to be separate calls for list vs watch (stream). There is a question how you can ensure consistency (i.e. that nothing happened between the list and the watch calls that won't be reflected in watch and also weren't in list yet).

Maybe just add action field into ListPodResourcesResponse with following possible values: ADDED, UPDATED, DELETED

That would be consistent with k8s watch, so I would really support this one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re serving initial state as part of "watch": kubernetes/kubernetes#13969
This is causing us a bunch of burden and is misleading to users.

I would really prefer those two to be separate calls for list vs watch (stream). There is a question how you can ensure consistency (i.e. that nothing happened between the list and the watch calls that won't be reflected in watch and also weren't in list yet).

Maybe just add action field into ListPodResourcesResponse with following possible values: ADDED, UPDATED, DELETED

That would be consistent with k8s watch, so I would really support this one.

Ok, all of those are good points, and I was not aware of the issue you mentioned. I will update the KEP.

Add a v1alpha1 Kubelet GRPC service, at `/var/lib/kubelet/pod-resources/kubelet.sock`, which returns information about the kubelet's assignment of devices to containers. It obtains this information from the internal state of the kubelet's Device Manager.
The GRPC Service can return:
- a single PodResourcesResponse, enabling monitor applications to poll for resources allocated to pods and containers on the node.
- a stream of PodResourcesResponse, enabling monitor applications to be notified of new resource allocation, release or resource allocation updates.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's reasonable to introduce threshold on kubelet side, configurable via KubeletConfiguration, to not send notification so often, one notification will contain a bunch of podresources (it already described in this KEP). I think it worth mentioning in this KEP, but for implementation I think it's step 2.

Address reviewers comment:
1. Add explicit Watch endpoint so APIs are composable (not bundled in
   ListAndWatch)
2. Add explicit action field in the Watch() endpoint response

Signed-off-by: Francesco Romani <fromani@redhat.com>
@ffromani
Copy link
Contributor Author

Thanks for all the comments. I think I addressed all of them but the consistency concern between List and Watch, which I'm still thinking about. I will update again once I'm happy with a proposal.
I didn't add the threshold mentioned here #1926 (comment) because I think this pertains more the implementation side; but if reviewers disagree I'll mention in the next update.

Last but not least: I'll squash commits when we reach agreement.

@ffromani ffromani changed the title podresources: add ListAndWatch function podresources: add Watch endpoint Aug 11, 2020

// WatchPodResourcesResponse is the response returned by Watch function
message WatchPodResourcesResponse {
WatchPodAction action = 1;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if just exposing the pod resourceVersion here is a good way forward

* Missed reference to "ListAndWatch", now replaced by "Watch"
* renamed UPDATED->MODIFIED
  To be more compliant with kube naming standards
  (https://kubernetes.io/docs/reference/using-api/api-concepts/#efficient-detection-of-changes)

Signed-off-by: Francesco Romani <fromani@redhat.com>
@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Aug 20, 2020
@ffromani
Copy link
Contributor Author

ffromani commented Sep 8, 2020

PR with the implemenatation: kubernetes/kubernetes#94612

Copy link
Member

@derekwaynecarr derekwaynecarr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a few questions, both related to this KEP, and how it relates to closing out the KubeletPodResources feature that is presently in beta.

Please clarify the following:

  • The PodResources message does not include the uuid, this means we are not able to differentiate across space and time as two different pods with same name/same namespace appear the same. I think we should emit the uid especially when adding watch.
  • The grpc interface for PodResources is v1alpha1, but the feature is beta. Should we just move this to v1 now and commit to backward compatibility on this interface. I do not see the need for a new feature gate related to this new operation, but would like to clean up our grpc surfaces to stable fwd compatible boundaries.
  • How is resource version derived? Is it the value on the pod itself?
  • Does the watch guarantee ordering? For example, if a pod A is deleted, device D is reclaimed, and pod C is started and assigned device D, am I guaranteed to get those events emitted in order? If so, how and where are you integrating emitting the watch events within the kubelet subsystems?
  • How do you intend to test the watch semantics in e2e?

// PodResources contains information about the node resources assigned to a pod
message PodResources {
string name = 1;
string namespace = 2;
repeated ContainerResources containers = 3;
int64 resource_version = 4;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this the pod resource version as stored in etcd or something else?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, that's the intention, in order to enable client code to reconcile data from Watch() with the data they get from List().

// WatchPodResourcesRequest is the request made to the Watch PodResourcesLister service
message WatchPodResourcesRequest {}

enum WatchPodAction {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when is each action emitted? can you clarify when modified would be used in life of pod?

Copy link
Contributor Author

@ffromani ffromani Sep 30, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actions should be emitted:

  • ADDED: when resources are assigned to the pod (I'm thinking about HintProvider's Allocate())
  • DELETED: when resources are claimed back (I'm thinking about UpdateAllocatedDevices())
    I'll document better in the KEP text.

In Hindsight we most likely don't need MODIFED, will just remove it.

To keep the implementation simple as possible, the kubelet does *not* store any historical list of changes.

In order to make sure not to miss any updates, client application can:
1. call the `Watch` endpoint to get a stream of changes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what triggers watch events from getting emitted in kubelet code flows?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After some experiments, I think ADDED should be triggered after succesfull allocation from topology manager (https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/cm/topologymanager/topology_manager.go#L232)
while DELETED should be triggered once device are claimed back (https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/cm/container_manager_linux.go#L1034)

@derekwaynecarr
Copy link
Member

/assign @dashpole @derekwaynecarr

@derekwaynecarr
Copy link
Member

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Sep 29, 2020
@derekwaynecarr
Copy link
Member

I see this feature as an evolution of the existing beta feature for KubeletPodResources where users just require the new watch versus list.

To track this properly in the project, we should do the following:

  • create an issue in this repo (unless one exists for this existing feature gate that @dashpole has handy)
  • update the design document to match the updated kep template that has been rolled out scoped to the prior issue
  • we can then get the issue and kep in a mergable state before enhancement freeze for 1.20

Do you agree with above @dashpole ?

@ffromani
Copy link
Contributor Author

I have a few questions, both related to this KEP, and how it relates to closing out the KubeletPodResources feature that is presently in beta.

Please clarify the following:

* The `PodResources` message does not include the uuid, this means we are not able to differentiate across space and time as two different pods with same name/same namespace appear the same.  I think we should emit the uid especially when adding watch.

Good point, needs to be added.

* The grpc interface for PodResources is v1alpha1, but the feature is beta.  Should we just move this to v1 now and commit to backward compatibility on this interface.  I do not see the need for a new feature gate related to this new operation, but would like to clean up our grpc surfaces to stable fwd compatible boundaries.

I'm not sure which is the question here: surely I want to meet the beta graduation criterias (good e2e tests for example).
From my perspective having a code which is solid enough and backward compatible to not require a feature gate is a fair requirement implementationwise. But not sure I got your point :)

 * How is resource version derived?  Is it the value on the pod itself?

Yes. This just seemed the simplest and safest approach but I'm open to alternatives here.

* Does the watch guarantee ordering?  For example, if a pod A is deleted, device D is reclaimed, and pod C is started and assigned device D, am I guaranteed to get those events emitted in order?  If so, how and where are you integrating emitting the watch events within the kubelet subsystems?

Initially I thought ordering was not needed because this interface wants to enable resource accounting, but thinking about it some ordering guarantees are indeed needed to avoid some misreporting in corner cases (pod deletion/addition when at full resource capacity comes to mind, but there is likely more). So we need to guarantee this, and I'll document this in the KEP.

About emitting event I think we covered in inline comments.

* How do you intend to test the watch semantics in e2e?

I'll elaborate test plans (does this need to be in the KEP? asking because the original one doesn't mention that) but I think we'll start with

  • the fundamentals: basic reporting (events emissed when expected)
  • ensure consistent ordering of event reporting in corner case (no resources, full capacity...)
  • the same in concurrent scenarios
  • make sure the clients can safely reconcile List() and Watch() outputs

e2e tests is integral part of prototyping implementation already in this WIP PR: kubernetes/kubernetes@8c0e617
to cover the last point (List() and Watch() reconciliation) I also want to add e2e tests to cover the existing podresources API behaviour (alongside with unit tests obv.)

@ffromani
Copy link
Contributor Author

Next step for me is to reflect the review comments in the KEP PR, then I'll consolidate this PR in the existing #1884 ; last I'll close this one. This is meant to simplify the process and make review easier.

Add new field in the API responses objects to allow client applications
to consume both `List` and `Watch` endpoints.

The issue here is enabling client applications to not lose any updates
when both APIs are used.

The straightforward option is to follow the generic k8s approach (see
link below) and let kubelet keep a historical window of
the last recent changes, so client applications have the chance to
issue `List` and shortly after `Watch`, starting from the
resourceVersion returned in `List`.
The underlying assumption is indeed that `Watch` happens "shortly" after `List`,
otherwise the system cannot guarantee the lack of gaps.

However implementing this support requires to keep the aforementioned
sliding window of changes, which however requires careful implementation
to address scalability and safety guarantees.

However, the `podresources` API is a specific API, so, while is good to
follow as much as possible the generic API concepts, it also allows some
possible little differences which can help keep the implementation
simple and safe.

This patch proposes a simplest possible approach to reconcile the
`List` and `Watch` responses, providing the `resource_version` field and
suggesting a little change in the client applications programming model.

Inspired by the concepts found on
https://kubernetes.io/docs/reference/using-api/api-concepts/#efficient-detection-of-changes

Signed-off-by: Francesco Romani <fromani@redhat.com>
Signed-off-by: Francesco Romani <fromani@redhat.com>
@ffromani ffromani force-pushed the kep-list-and-watch branch from aea8a6e to 1346c2a Compare October 1, 2020 11:20
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: fromanirh
To complete the pull request process, please assign derekwaynecarr after the PR has been reviewed.
You can assign the PR to them by writing /assign @derekwaynecarr in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot
Copy link
Contributor

@fromanirh: The following test failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
pull-enhancements-verify 1346c2a link /test pull-enhancements-verify

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@ffromani
Copy link
Contributor Author

ffromani commented Oct 1, 2020

Next step for me is to reflect the review comments in the KEP PR, then I'll consolidate this PR in the existing #1884 ; last I'll close this one. This is meant to simplify the process and make review easier.

comments addressed and updated content added on top of #1884 - closing this one to reduce the confusion.

@ffromani ffromani closed this Oct 1, 2020
@k8s-ci-robot
Copy link
Contributor

@fromanirh: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 1, 2020
ffromani added a commit to ffromani/enhancements that referenced this pull request Oct 23, 2020
Per comments during the review of the Watch endpoint proposal:
1. kubernetes#1926 (comment)
2. kubernetes#1926 (review)

The agreed semantic of a Watch() response message
a. refers to a single pod. The intention was always to stream only individual pod changes
   (kubernetes#1926 (comment))
b. must allow the client to reconcile with the response of the List()
   endpoint, thus must include a pod resource version.

This patch thus adds the missing resource version field and removes the
`repeated` attribute to the `PodResources` field.
Removing `repeated` is the simplest possible change that aligns the
proposal to the intention.

Alternatively, it is possible to change the proto so we can allow a
`WarchPodResourcesResponse` object to convey information about more
pods; however the performance and UX benefits of this more invasive
change are unclear, so we avoid it at this moment.

This change was missing because it was lost in a rebase

Signed-off-by: Francesco Romani <fromani@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. sig/node Categorizes an issue or PR as relevant to SIG Node. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants