Skip to content

Commit

Permalink
Add GetAllocatableResource to PodResource API
Browse files Browse the repository at this point in the history
In order to simplify and make more understandable the KEP, and
to comply with the new process, we extract the unit of work still
ongoing in this KEP from kubernetes#1884

Work in this area was done during the 1.20 and 1.21 cycles  in
kubernetes/kubernetes#95734

Rationale, discussion and documentation for all the changes including
the one proposed in this KEP have been described in
https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/2043-pod-resource-concrete-assigments
and reported here were relevant

Signed-off-by: Francesco Romani <fromani@redhat.com>
  • Loading branch information
ffromani committed Feb 4, 2021
1 parent 996261d commit 2d90789
Show file tree
Hide file tree
Showing 3 changed files with 312 additions and 0 deletions.
3 changes: 3 additions & 0 deletions keps/prod-readiness/sig-node/2403.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
kep-number: 2403
stable:
approver: "@johnbelamaric"
264 changes: 264 additions & 0 deletions keps/sig-node/2403-pod-resources-allocatable-resources/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,264 @@
title: Extend kubelet pod resource assignment endpoint to return allocatable resources

## Table of Contents

<!-- toc -->
- [Release Signoff Checklist](#release-signoff-checklist)
- [Summary](#summary)
- [Motivation](#motivation)
- [Goals](#goals)
- [Proposal](#proposal)
- [User Stories](#user-stories)
- [Topology aware scheduling](#topology-aware-scheduling)
- [Risks and Mitigations](#risks-and-mitigations)
- [Design Details](#design-details)
- [Proposed API](#proposed-api)
- [Test Plan](#test-plan)
- [Graduation Criteria](#graduation-criteria)
- [Alpha](#alpha)
- [Alpha to Beta Graduation](#alpha-to-beta-graduation)
- [Beta to G.A Graduation](#beta-to-ga-graduation)
- [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy)
- [Version Skew Strategy](#version-skew-strategy)
- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire)
- [Feature enablement and rollback](#feature-enablement-and-rollback)
- [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning)
- [Monitoring requirements](#monitoring-requirements)
- [Dependencies](#dependencies)
- [Scalability](#scalability)
- [Troubleshooting](#troubleshooting)
- [Implementation History](#implementation-history)
- [Alternatives](#alternatives)
- [Add v1alpha1 Kubelet GRPC service, at <code>/var/lib/kubelet/pod-resources/kubelet.sock</code>, which returns a list of <a href="https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/apis/cri/runtime/v1alpha2/api.proto#L734">CreateContainerRequest</a>s used to create containers.](#add-v1alpha1-kubelet-grpc-service-at--which-returns-a-list-of-createcontainerrequests-used-to-create-containers)
- [Add a field to Pod Status.](#add-a-field-to-pod-status)
- [Use the Kubelet Device Manager Checkpoint file](#use-the-kubelet-device-manager-checkpoint-file)
- [Add a field to the Pod Spec:](#add-a-field-to-the-pod-spec)
<!-- /toc -->

## Release Signoff Checklist

Items marked with (R) are required *prior to targeting to a milestone / release*.

- [X] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements](https://github.com/kubernetes/enhancements/issues/2403)
- [X] (R) KEP approvers have approved the KEP status as `implementable`
- [X] (R) Design details are appropriately documented
- [X] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input
- [X] (R) Graduation criteria is in place
- [X] (R) Production readiness review completed
- [X] Production readiness review approved
- [X] "Implementation History" section is up-to-date for milestone
- ~~ [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io] ~~
- [X] Supporting documentation e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes

[kubernetes.io]: https://kubernetes.io/
[kubernetes/enhancements]: https://git.k8s.io/enhancements
[kubernetes/kubernetes]: https://git.k8s.io/kubernetes
[kubernetes/website]: https://git.k8s.io/website

## Summary

This document presents an addition to the kubelet pod resources endpoint (pod resources API) which allows third party consumers to learn about the
compute device allocation, thus, alongside the existing pod resources API endpoint, properly evaluate the node capacity.

## Motivation

### Goals

* Enable node monitoring agents to know the allocatable compute resources on a node, thus properly calculate the node compute resource utilization.

## Proposal

### User Stories

#### Node Feature Discovery

Enable the Node Feature Discovery to [expose hardware topology information](https://github.com/kubernetes-sigs/node-feature-discovery/issues/333).

#### Topology aware scheduling

This interface can be used to track down allocated resources with information about the NUMA topology of the worker node in general way.
This interface can be used to the available resources on the worker node. The kubelet is the best source of information because it manages concrete resources assignment. The information can then be used in NUMA aware scheduling.
Combining the information reported by the `List` API, which pertains the current allocation, with the information reported by the `GetAllocatableResources` API, monitoring agent can reliably report the compute device
utilization and availability.


### Risks and Mitigations

This API is read-only, which removes a large class of risks. The aspects that we consider below are as follows:
- What are the risks associated with the API service itself?
- What are the risks associated with the data itself?

| Risk | Impact | Mitigation |
| --------------------------------------------------------- | ------------- | ---------- |
| Too many requests risk impacting the kubelet performances | High | Implement rate limiting and or passive caching, follow best practices for gRPC resource management. |
| Improper access to the data | Low | Server is listening on a root owned unix socket. This can be limited with proper pod security policies. |


## Design Details

### Proposed API

We propose to extend the existing pod resources gRPC service of the Kubelet, listening on a unix socket at `/var/lib/kubelet/pod-resources/kubelet.sock`.

The GRPC Service will expose and additional endpoint:
- 'GetAllocatableResources`, which returns a single AllocatableResourcesResponse, enabling monitor applications to query for the allocatable set of resources available on the node.

The extended interface is shown in proto below:
```protobuf
// PodResources is a service provided by the kubelet that provides information about the
// node resources consumed by pods and containers on the node
service PodResources {
rpc List(ListPodResourcesRequest) returns (ListPodResourcesResponse) {}
rpc GetAllocatableResources(AllocatableResourcesRequest) returns (AllocatableResourcesResponse) {}
}
message AllocatableResourcesRequest {}
// AvailableResourcesResponses contains informations about all the devices known by the kubelet
message AllocatableResourcesResponse {
repeated ContainerDevices devices = 1;
repeated int64 cpu_ids = 2;
}
// ListPodResourcesRequest is the request made to the PodResources service
message ListPodResourcesRequest {}
// ListPodResourcesResponse is the response returned by List function
message ListPodResourcesResponse {
repeated PodResources pod_resources = 1;
}
// PodResources contains information about the node resources assigned to a pod
message PodResources {
string name = 1;
string namespace = 2;
repeated ContainerResources containers = 3;
}
// ContainerResources contains information about the resources assigned to a container
message ContainerResources {
string name = 1;
repeated ContainerDevices devices = 2;
repeated int64 cpu_ids = 3;
}
// Topology describes hardware topology of the resource
message TopologyInfo {
repeated NUMANode nodes = 1;
}
// NUMA representation of NUMA node
message NUMANode {
int64 ID = 1;
}
// ContainerDevices contains information about the devices assigned to a container
message ContainerDevices {
string resource_name = 1;
repeated string device_ids = 2;
TopologyInfo topology = 3;
}
```

### Test Plan

The implementation PR adds a suite of E2E tests which cover both the existing `List` endpoint already implemented in the podresources API and
the new proposed `GetAllocatableResources` API.

### Graduation Criteria

#### Alpha
- [X] Implement the new service API.
- [X] Ensure proper e2e node tests are in place.

#### Alpha to Beta Graduation
- [X] The new API is consumed by other public software components (e.g. NFD).
- [X] No major bugs reported in the previous cycle.

#### Beta to G.A Graduation
- [X] Allowing time for feedback (1 year).
- [X] Risks have been addressed.

### Upgrade / Downgrade Strategy

With gRPC the version is part of the service name.
Old versions and new versions should always be served and listened by the kubelet.

To a cluster admin upgrading to the newest API version, means upgrading Kubernetes to a newer version as well as upgrading the monitoring component.

To a vendor changes in the API should always be backwards compatible.

### Version Skew Strategy

Kubelet will always be backwards compatible, so going forward existing plugins are not expected to break.

## Production Readiness Review Questionnaire
### Feature enablement and rollback

* **How can this feature be enabled / disabled in a live cluster?**
- [X] Feature gate (also fill in values in `kep.yaml`).
- Feature gate name: `KubeletPodResourcesGetAllocatable`.
- Components depending on the feature gate: N/A.

* **Does enabling the feature change any default behavior?** No
* **Can the feature be disabled once it has been enabled (i.e. can we rollback the enablement)?** Yes, through feature gates.
* **What happens if we reenable the feature if it was previously rolled back?** The service recovers state from kubelet.
* **Are there any tests for feature enablement/disablement?** No, however no data is created or deleted.

### Rollout, Upgrade and Rollback Planning

* **How can a rollout fail? Can it impact already running workloads?** Kubelet would fail to start. Errors would be caught in the CI.
* **What specific metrics should inform a rollback?** Not Applicable, metrics wouldn't be available.
* **Were upgrade and rollback tested? Was upgrade->downgrade->upgrade path tested?** Not Applicable.
* **Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?** No.

### Monitoring requirements
* **How can an operator determine if the feature is in use by workloads?**
- Look at the `pod_resources_endpoint_requests_total` metric exposed by the kubelet.
- Look at hostPath mounts of privileged containers.
* **What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?**
- [X] Metrics
- Metric name: `pod_resources_endpoint_requests_total`
- Components exposing the metric: kubelet

* **What are the reasonable SLOs (Service Level Objectives) for the above SLIs?** N/A or refer to Kubelet SLIs.
* **Are there any missing metrics that would be useful to have to improve observability if this feature?** No.


### Dependencies

* **Does this feature depend on any specific services running in the cluster?** Not applicable.

### Scalability

* **Will enabling / using this feature result in any new API calls?** No.
* **Will enabling / using this feature result in introducing new API types?** No.
* **Will enabling / using this feature result in any new calls to cloud provider?** No.
* **Will enabling / using this feature result in increasing size or count of the existing API objects?** No.
* **Will enabling / using this feature result in increasing time taken by any operations covered by [existing SLIs/SLOs][]?** No. Feature is out of existing any paths in kubelet.
* **Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?** DDOSing the API can lead to resource exhaustion. It is planned to be addressed as part of G.A.
Feature only collects data when requests comes in, data is then garbage collected. Data collected is proportional to the number of pods on the node.

### Troubleshooting

* **How does this feature react if the API server and/or etcd is unavailable?**: No effect.
* **What are other known failure modes?** No known failure modes
* **What steps should be taken if SLOs are not being met to determine the problem?** N/A

[supported limits]: https://git.k8s.io/community//sig-scalability/configs-and-limits/thresholds.md
[existing SLIs/SLOs]: https://git.k8s.io/community/sig-scalability/slos/slos.md#kubernetes-slisslos

## Implementation History

- 2021-02-02: KEP extracted from [previous iteration](https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/2043-pod-resource-concrete-assigments)
- 2021-02-04: KEP polished, added feature gate, clarified the graduation criterias.

## Alternatives

### Add a new endpoint
* Pros:
* No changes to existing APIs
* Cons:
* Requires the client to consume two APIs
* This work nicely fits in the boundaries and purpose of the podresources API
* The changes proposed in this KEP are very low-risk and backward compatible
45 changes: 45 additions & 0 deletions keps/sig-node/2403-pod-resources-allocatable-resources/kep.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
title: Extend kubelet pod resource assignment endpoint to return allocatable resources
kep-number: 2403
authors:
- "@fromanirh"
- "@alexeyperevalov"
owning-sig: sig-node
participating-sigs: []
status: implementable
creation-date: "2021-02-02"
reviewers:
- "@derekwaynecarr"
- "@renaudwastaken"
approvers:
- "@sig-node-leads"
prr-approvers: []
see-also:
- "keps/sig-node/606-compute-device-assignment/"
- "keps/sig-node/2043-pod-resource-concrete-assigments/"
replaces: []

# The target maturity stage in the current dev cycle for this KEP.
stage: alpha

# The most recent milestone for which work toward delivery of this KEP has been
# done. This can be the current (upcoming) milestone, if it is being actively
# worked on.
latest-milestone: "v1.21"

# The milestone at which this feature was, or is targeted to be, at each stage.
milestone:
alpha: "v1.21"
beta: "v1.22"
stable: "v1.23"

# The following PRR answers are required at alpha release
# List the feature gate name and the components for which it must be enabled
feature-gates:
- name: "KubeletPodResourcesGetAllocatable"
components:
- kubelet
disable-supported: false

# The following PRR answers are required at beta release
metrics:
- pod_resources_endpoint_requests_total

0 comments on commit 2d90789

Please sign in to comment.