-
Notifications
You must be signed in to change notification settings - Fork 5.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding a proposal for hardware accelerators #844
Conversation
## Introduction | ||
|
||
Hardware Accelerators are becoming a widely used commodity across various industries. | ||
Accelerators have bring down computing latency and/or costs significantly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Accelerators can bring* ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops. Fixing it.
|
||
## Non Goals | ||
* Support for Cloud Gaming, Simulations, Remote Desktops and other workloads | ||
* Support for these workloads will be tackled once support for ML and DL matures |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe link to what differs about these workloads for those unfamiliar
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The difference doesn't matter much from K8s perspective. Hence removing it.
Any further differentiation amongst hardware accelerators using the resource name will not be considered “portable” across Kubernetes clusters. | ||
It is expected that accelerator hardware vendors will define and manage Resource Types. | ||
|
||
Nodes are expected to be homogenous and any attributes specific to hardware accelerators are expected to be exposed as node labels in Kubernetes to begin with. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i thought we wanted to support heterogeneous devices via resource class in the future? am i misinterpreting the homogeneous expectation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, i see clarification below. i would prefer we phrase this differently. we wish to support heterogenous devices in the future, but initially we will tackle the homegenous case first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack
|
||
### Timelines | ||
|
||
* Current target is `v1.9` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this assumes device api is alpha in 1.8, correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes.
a few questions, but generally, looks good to me and is consistent with what we have discussed in community. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
Signed-off-by: Vishnu kannan <vishnuk@google.com>
@derekwaynecarr can I get an LGTM? |
Exposing all hardware accelerators as well known (first class) Compute Resource Types will bloat the API and compromise portability. | ||
For this reason, Hardware Accelerators are expected to be handled as “Extended Compute Resources”. | ||
|
||
Kubernetes nucleus will recommend and document a general purpose resource name for each family of accelerators - examples include `nvidia-gpu`, `amd-gpu`, `google-tpu`, etc., with a standard prefix `extensions.kubernetes.io`. This naming scheme partially mimics PCI ID - `<Vendor Name>-<Device Type>`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you want to go down this road, someone has to manage a registry of vendor names (e.g. "nvidia" and never "nVidia" or "NVidia" or "n-vidia"), and do some sort of verification of ownership and non-infringement, and so on. The normal "use your own domain as the prefix" is a bit more self-service.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One of the goals is to provide guidelines similar to the PCI ID schema in Linux.
Custom domains may not be consistent across different drivers (plugins) for the same hardware. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My concern is the implication that we will manage a database of abstract names in perpetuity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@thockin As discussed offline, can you take a look at the language and suggest appropriate changeS?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the alternative is to have a name like the following:
<vendor-name>.extensions.kubernetes.io/<device-type>
which seems to satisfy @thockin concern.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vishh any thoughts on how to determine whether an integer resource allows overcommits or not? Right now, we have no-overcommit limitation for ResourceNvidiaGPU but not for the general OIR. Given that we want to deprecate ResourceNvidiaGPU in the future, how can we know a particular vendor resource allows overcommit or not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jiayingz there are at least a few options available there:
- Encode whether the resource can be overcommitted somehow in the resource name.
- Maintain a list of either overcommit or no-overcommit resource names and use them in admission control.
- Don't allow overcommit for now, except for first-class resources. A future API for resources could allow enabling overcommit on a case-by-case basis.
- Do this in a way that soft-breaks OIR (currently they can be overcommitted)
- Special-case OIR to preserve their overcommit behavior.
(1) seems likely to require a future breaking change.
(2) potentially couples admission to extended resource names.
(3.i) might be OK but we don't have a good way to determine the extent of the breakage across the ecosystem so I would vote not to do this without a deprecation cycle.
(3.ii) could be a decent stopgap, made more tolerable if we deprecate OIR at the same time. The breakage due to feature removal would be that pod.alpha.kubernetes.io/opaque-int-<foo>
resources can no longer be overcommited.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Extended or Opaque resources do not need overcommit.
I will update this PR to reflect the fact that a curated domain name is no longer necessary for plugins.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it legitimate for two different authors to produce drivers for the same resource? Does it matter here if an NV-provided driver and an OSS driver both claim to provide ".../nvidia-gpu" ? Or do we want the resource name to reflect the driver provider as well as the device?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From a user perspective, the resource name is an API that gives them access to a set of resources (character device files, user space libraries, kernel APIs, etc.) Identifying and documenting that API will be key to the project. That would then make it possible to have alternate implementations of device plugins for the same hardware.
There may be legal issues that vendors may potentially raise, but I do not have a good handle on that.
Thoughts?
Instead of building a special solution for Nvidia GPUs in Kubernetes, a standard extension pipeline called "Hardware Device Plugin" [has been proposed](https://docs.google.com/a/google.com/document/d/1LHeTPx_fWA1PdZkHuALPzYxR0AYXUiiXdo3S0g2VSlo/edit?usp=drive_web) to support arbitrary hardware (and virtual) devices without requiring device specific changes to Kubernetes nucleus. | ||
SW for hardware accelerators are expected to be shipped via standard containers. These containers are expected to be deployed on every node with accelerators. These containers are expected to install necessary SW for initializing hardware accelerators, register themselves with the Kubelet via standard device plugin APIs and exposing accelerators as consumable compute resources via Kubernetes APIs. | ||
Kubelet will handle allocation of hardware accelerators to pods and containers. | ||
Kubelet will communicate with the plugins to ensure that the necessary environment (SW, devices, env variables, etc.) to access hardware accelerators assigned to a pod/container are made accessible within the pod/container sandbox. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where is the spec for kubelet <-> plugin? Is this another form of CSI-style plugin? I'd like to converge these models if we can.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is being designed via this proposal - https://github.com/RenaudWasTaken/community/blob/f1b462b12353df8c4467aec2ace251c7424c9078/contributors/design-proposals/device-plugin.md
There are several open issues with the Spec and so I don't think it's ready for a thorough review. It will provide a good overview though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC, CSI APIs are very closely tied to filesystem APIs. The main overlap between CSI and the hardware device plugin API that I can see if with local storage devices. Since Storage resource APIs are completely different from other resources and they need a notion of Identity, I felt these two APIs are catering different needs.
If you can think of a means to converge to a single API, I'm all ears @thockin
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I really meant was convergence in transport and feel, rather than API. If they were all grpc with the same basic assumptions about auth and discovery, with the same basic concepts, it would help a lot.
That said, there's a proposal on the table to change storage stuff to a much more declaratoive model, and it has some compelling aspects, too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@thockin Got it. We have 3 more months to graduate the resource/device plugin API to beta. If the CSI API infra get's finalized by then, we can definitely re-use it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it's optional. We have to be driving plugins towards convergence. No "if" here. Or we have to have a good reason why we can't and never intend to do so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vishh For discovery in particular, device plugin proposal assumes plugins call kubelet for registration; whereas in csi, if implemented in kubernetes, operator negotiate plugin unix socket path between kubelet and plugin, and kubelet is responsible to watch for new plugin. Wondering what's the reason for this difference?
do you recall the location of the proposal :) @thockin
That said, there's a proposal on the table to change storage stuff to a much more declaratoive model, and it has some compelling aspects, too.
On Thu, Aug 10, 2017 at 5:22 PM, Connor Doyle ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In contributors/design-proposals/hardware-accelerators.md
<#844 (comment)>:
> +## Non Goals
+* Support for Cloud Gaming, Simulations, Remote Desktops and other workloads
+ * Support for these workloads will be tackled once support for Machine Learning matures
+
+## System Design
+
+The following sections highlight some of the critical design points for supporting Hardware Accelerators
+
+### API
+
+A plethora of Hardware accelerators exist in the world.
+Some of them are general purpose, but many of them are purpose built for specific use cases.
+Exposing all hardware accelerators as well known (first class) Compute Resource Types will bloat the API and compromise portability.
+For this reason, Hardware Accelerators are expected to be handled as “Extended Compute Resources”.
+
+Kubernetes nucleus will recommend and document a general purpose resource name for each family of accelerators - examples include `nvidia-gpu`, `amd-gpu`, `google-tpu`, etc., with a standard prefix `extensions.kubernetes.io`. This naming scheme partially mimics PCI ID - `<Vendor Name>-<Device Type>`.
@jiayingz <https://github.com/jiayingz> there are at least a few options
available there:
1. Encode whether the resource can be overcommitted somehow in the
resource name.
2. Maintain a list of either overcommit or no-overcommit resource
names and use them in admission control.
3. Don't allow overcommit for now, except for first-class resources. A
future API for resources could allow enabling overcommit on a case-by-case
basis.
1. Do this in a way that soft-breaks OIR (currently they can be
overcommitted)
2. Special-case OIR to preserve their overcommit behavior.
(1) seems likely to require a future breaking change.
(2) potentially couples admission to extended resource names.
(3.a) might be OK but we don't have a good way to determine the extent of
the breakage across the ecosystem so I would vote not to do this without a
deprecation cycle.
(3.b) could be a decent stopgap, made more tolerable if we deprecate OIR
at the same time. The breakage due to feature removal would be that
pod.alpha.kubernetes.io/opaque-int-<foo> resources can no longer be
overcommited.
Agree 3.b seems a good stopgap. Do we have a time plan to deprecate OIR? We
can remove this
OIR special-case at the time OIR is deprecated.
Jiaying
… —
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#844 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AcIZlMvqdyglTYpshD6Ll6QaWtLkVFiGks5sW56ygaJpZM4Oh4hc>
.
|
/lgtm cancel //PR changed after LGTM, removing LGTM. @derekwaynecarr @mindprince @thockin @vishh |
@mindprince you get the lgtm :) |
Signed-off-by: Vishnu kannan <vishnuk@google.com>
@thockin updated the naming scheme to reflect recent conversations. PTAL if you have some cycles. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am fine with the naming scheme behind the vendor domain. I think we will need the resource API we discussed, and I think we need to get that in probably before beta?
Kubernetes nucleus will recommend and document a general purpose resource name for each family of accelerators - examples include `nvidia.com/gpu`, `amd.com/gpu`, `google.com/tpu`, etc., with a standard domain name that unambiguously identifies the vendor of a hardware followed by the hardware type `<hardware-vendor-domain>/<hardware-type>`. | ||
It is expected that the hardware vendors will work with the kubernetes community to keep their resource names consistent across kubernetes clusters. | ||
|
||
Nodes are expected to be homogenous and any attributes specific to hardware accelerators are expected to be exposed as node labels in Kubernetes to begin with. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The expectation stated here for homogenous nodes contradicts later sections in same paragraph. Possibly rephrase to state priority order?
Nodes are expected to be homogenous and any attributes specific to hardware accelerators are expected to be exposed as node labels in Kubernetes to begin with. | ||
Users can expose “extended resources” with other names and consume them in their own clusters. | ||
The admission logic will be extended to allow any resource with a non empty non-default (not `kubernetes.io`) domain name. | ||
The scheduler will be extended to treat such extended resources as an integer resource to begin with. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note we will not support overcommiting initially?
|
||
Hardware Accelerators are expensive and typically have unique hardware architectures. | ||
Programming against these accelerators, improving performance and utilization is non-trivial. | ||
Certain generic metrics like `utilization` and `usage_time`, and vendor specific metrics are expected to be exposed via cAdvisor and made available to monitoring solutions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is being done here explicitly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same question here.
Also, the proposal seems to suggest adding vendor specific monitoring capability to cadvisor, is this what we do right now? I feel like this is not a sustainable solution.
|
||
## Beta | ||
|
||
### Requirements |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is basic monitoring implied here?
Accelerators are preferred over CPUs mainly for performance reasons. | ||
Accelerators typically have extreme requirements at the hardware level in terms of power, hardware interconnect bandwidth, latency, etc. | ||
These high performance devices require careful placement of user workloads on specific CPUs, Memory banks and Accelerator devices to reduce latency and guarantee application performance. | ||
Kubernetes will support support performance isolation for these hardware accelerators, by allowing hardware device plugins to expose a hardware topology graph where each edge represents latency to access one or more CPUs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo? two support
.
On Mon, Oct 16, 2017 at 7:58 PM, Deyuan Deng ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In contributors/design-proposals/hardware-accelerators.md
<#844 (comment)>:
> +
+### SW Infrastructure for Accelerators
+
+Hardware Accelerators often need vendor provided kernel and user space software.
+These software at times introduce tight coupling between the host and applications.
+Nvidia GPUs for example are consumed via higher level APIs like CUDA, CUVID, etc.
+These APIs are available via user space libraries.
+The libraries themselves are tied to the host image (kernel and Nvidia kernel driver versions primarily).
+These APIs break the abstraction of containers where the general assumption is that applications inside a container bring all their libraries as part of the container image.
+
+#### Extensibility
+
+Instead of building a special solution for Nvidia GPUs in Kubernetes, a standard extension pipeline called "Hardware Device Plugin" [has been proposed](https://docs.google.com/a/google.com/document/d/1LHeTPx_fWA1PdZkHuALPzYxR0AYXUiiXdo3S0g2VSlo/edit?usp=drive_web) to support arbitrary hardware (and virtual) devices without requiring device specific changes to Kubernetes nucleus.
+SW for hardware accelerators are expected to be shipped via standard containers. These containers are expected to be deployed on every node with accelerators. These containers are expected to install necessary SW for initializing hardware accelerators, register themselves with the Kubelet via standard device plugin APIs and exposing accelerators as consumable compute resources via Kubernetes APIs.
+Kubelet will handle allocation of hardware accelerators to pods and containers.
+Kubelet will communicate with the plugins to ensure that the necessary environment (SW, devices, env variables, etc.) to access hardware accelerators assigned to a pod/container are made accessible within the pod/container sandbox.
@vishh <https://github.com/vishh> For discovery in particular, device
plugin proposal assumes plugins call kubelet for registration; whereas in
csi, if implemented in kubernetes, operator negotiate plugin unix socket
path between kubelet and plugin, and kubelet is responsible to watch for
new plugin. Wondering what's the reason for this difference?
@ddysher Thanks a lot for the information! Could you explain a bit more on
what "operator negotiate plugin unix socket path between kubelet and
plugin, and kubelet is responsible to watch for new plugin" means? I
haven't followed the CSI design closely, although it is on our agenda to
compare different plugin systems including CSI, CNI, and DevicePlugin and
makes sure their communication models stay consistent, as @thockin
suggested :).
Jiaying
… do you recall the location of the proposal :) @thockin
<https://github.com/thockin>
That said, there's a proposal on the table to change storage stuff to a
much more declaratoive model, and it has some compelling aspects, too.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#844 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AcIZlO7j3ZmQcpp5rEJw5bbc_UfPXCTOks5stBfYgaJpZM4Oh4hc>
.
|
@jiayingz I mean operator (human) is responsible to pass csi endpoint to both server (e.g. kubelet) and plugin. plugin will create and listen on the endpoint (e.g. socket), while server monitors the endpoint. If server observes that endpoint exists, it will establish the connection. I didn't involve in csi discussion, this is my understanding after going through the spec |
Accelerators typically have extreme requirements at the hardware level in terms of power, hardware interconnect bandwidth, latency, etc. | ||
These high performance devices require careful placement of user workloads on specific CPUs, Memory banks and Accelerator devices to reduce latency and guarantee application performance. | ||
Kubernetes will support support performance isolation for these hardware accelerators, by allowing hardware device plugins to expose a hardware topology graph where each edge represents latency to access one or more CPUs. | ||
Kubelet will combine graphs from multiple plugins along with the node’s NUMA topology to handle hardware device assignment. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was under the impression that the scheduler was the component which placed Pods on a target node and allocated resources from that target node to the Pod. If that is the case, this seems to be carving out some special types of resources that will not be allocated by the scheduler but instead will be allocated/scheduled by the Kubelet. Is my understanding of that correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is a comment found in the design doc
https://docs.google.com/document/d/1LHeTPx_fWA1PdZkHuALPzYxR0AYXUiiXdo3S0g2VSlo/edit?disco=AAAABKV195w
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
/remove-lifecycle rotten |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Rotten issues close after 30d of inactivity. Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
wireless network monitoring |
This proposal captures the current state of affairs in the community around support for Hardware Accelerators.
This proposal is meant to help set expectations across the community.
The workloads aspect of hardware accelerators is intentionally pending. I intend to extend this proposal in the near future with workload specific user journeys.
@jiayingz @mindprince @thockin @derekwaynecarr @davidopp