Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding a proposal for hardware accelerators #844

Closed
wants to merge 2 commits into from

Conversation

vishh
Copy link
Contributor

@vishh vishh commented Jul 24, 2017

This proposal captures the current state of affairs in the community around support for Hardware Accelerators.
This proposal is meant to help set expectations across the community.

The workloads aspect of hardware accelerators is intentionally pending. I intend to extend this proposal in the near future with workload specific user journeys.

@jiayingz @mindprince @thockin @derekwaynecarr @davidopp

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jul 24, 2017
## Introduction

Hardware Accelerators are becoming a widely used commodity across various industries.
Accelerators have bring down computing latency and/or costs significantly.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Accelerators can bring* ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops. Fixing it.


## Non Goals
* Support for Cloud Gaming, Simulations, Remote Desktops and other workloads
* Support for these workloads will be tackled once support for ML and DL matures
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe link to what differs about these workloads for those unfamiliar

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The difference doesn't matter much from K8s perspective. Hence removing it.

Any further differentiation amongst hardware accelerators using the resource name will not be considered “portable” across Kubernetes clusters.
It is expected that accelerator hardware vendors will define and manage Resource Types.

Nodes are expected to be homogenous and any attributes specific to hardware accelerators are expected to be exposed as node labels in Kubernetes to begin with.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i thought we wanted to support heterogeneous devices via resource class in the future? am i misinterpreting the homogeneous expectation?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, i see clarification below. i would prefer we phrase this differently. we wish to support heterogenous devices in the future, but initially we will tackle the homegenous case first.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack


### Timelines

* Current target is `v1.9`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this assumes device api is alpha in 1.8, correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.

@derekwaynecarr
Copy link
Member

a few questions, but generally, looks good to me and is consistent with what we have discussed in community.

Copy link
Member

@rohitagarwal003 rohitagarwal003 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 25, 2017
Signed-off-by: Vishnu kannan <vishnuk@google.com>
@vishh
Copy link
Contributor Author

vishh commented Jul 25, 2017

@derekwaynecarr can I get an LGTM?

Exposing all hardware accelerators as well known (first class) Compute Resource Types will bloat the API and compromise portability.
For this reason, Hardware Accelerators are expected to be handled as “Extended Compute Resources”.

Kubernetes nucleus will recommend and document a general purpose resource name for each family of accelerators - examples include `nvidia-gpu`, `amd-gpu`, `google-tpu`, etc., with a standard prefix `extensions.kubernetes.io`. This naming scheme partially mimics PCI ID - `<Vendor Name>-<Device Type>`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want to go down this road, someone has to manage a registry of vendor names (e.g. "nvidia" and never "nVidia" or "NVidia" or "n-vidia"), and do some sort of verification of ownership and non-infringement, and so on. The normal "use your own domain as the prefix" is a bit more self-service.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of the goals is to provide guidelines similar to the PCI ID schema in Linux.
Custom domains may not be consistent across different drivers (plugins) for the same hardware. WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My concern is the implication that we will manage a database of abstract names in perpetuity.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@thockin As discussed offline, can you take a look at the language and suggest appropriate changeS?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the alternative is to have a name like the following:

<vendor-name>.extensions.kubernetes.io/<device-type>

which seems to satisfy @thockin concern.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vishh any thoughts on how to determine whether an integer resource allows overcommits or not? Right now, we have no-overcommit limitation for ResourceNvidiaGPU but not for the general OIR. Given that we want to deprecate ResourceNvidiaGPU in the future, how can we know a particular vendor resource allows overcommit or not?

Copy link
Contributor

@ConnorDoyle ConnorDoyle Aug 11, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jiayingz there are at least a few options available there:

  1. Encode whether the resource can be overcommitted somehow in the resource name.
  2. Maintain a list of either overcommit or no-overcommit resource names and use them in admission control.
  3. Don't allow overcommit for now, except for first-class resources. A future API for resources could allow enabling overcommit on a case-by-case basis.
    1. Do this in a way that soft-breaks OIR (currently they can be overcommitted)
    2. Special-case OIR to preserve their overcommit behavior.

(1) seems likely to require a future breaking change.
(2) potentially couples admission to extended resource names.
(3.i) might be OK but we don't have a good way to determine the extent of the breakage across the ecosystem so I would vote not to do this without a deprecation cycle.
(3.ii) could be a decent stopgap, made more tolerable if we deprecate OIR at the same time. The breakage due to feature removal would be that pod.alpha.kubernetes.io/opaque-int-<foo> resources can no longer be overcommited.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extended or Opaque resources do not need overcommit.
I will update this PR to reflect the fact that a curated domain name is no longer necessary for plugins.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it legitimate for two different authors to produce drivers for the same resource? Does it matter here if an NV-provided driver and an OSS driver both claim to provide ".../nvidia-gpu" ? Or do we want the resource name to reflect the driver provider as well as the device?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From a user perspective, the resource name is an API that gives them access to a set of resources (character device files, user space libraries, kernel APIs, etc.) Identifying and documenting that API will be key to the project. That would then make it possible to have alternate implementations of device plugins for the same hardware.
There may be legal issues that vendors may potentially raise, but I do not have a good handle on that.
Thoughts?

Instead of building a special solution for Nvidia GPUs in Kubernetes, a standard extension pipeline called "Hardware Device Plugin" [has been proposed](https://docs.google.com/a/google.com/document/d/1LHeTPx_fWA1PdZkHuALPzYxR0AYXUiiXdo3S0g2VSlo/edit?usp=drive_web) to support arbitrary hardware (and virtual) devices without requiring device specific changes to Kubernetes nucleus.
SW for hardware accelerators are expected to be shipped via standard containers. These containers are expected to be deployed on every node with accelerators. These containers are expected to install necessary SW for initializing hardware accelerators, register themselves with the Kubelet via standard device plugin APIs and exposing accelerators as consumable compute resources via Kubernetes APIs.
Kubelet will handle allocation of hardware accelerators to pods and containers.
Kubelet will communicate with the plugins to ensure that the necessary environment (SW, devices, env variables, etc.) to access hardware accelerators assigned to a pod/container are made accessible within the pod/container sandbox.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is the spec for kubelet <-> plugin? Is this another form of CSI-style plugin? I'd like to converge these models if we can.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is being designed via this proposal - https://github.com/RenaudWasTaken/community/blob/f1b462b12353df8c4467aec2ace251c7424c9078/contributors/design-proposals/device-plugin.md

There are several open issues with the Spec and so I don't think it's ready for a thorough review. It will provide a good overview though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, CSI APIs are very closely tied to filesystem APIs. The main overlap between CSI and the hardware device plugin API that I can see if with local storage devices. Since Storage resource APIs are completely different from other resources and they need a notion of Identity, I felt these two APIs are catering different needs.
If you can think of a means to converge to a single API, I'm all ears @thockin

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I really meant was convergence in transport and feel, rather than API. If they were all grpc with the same basic assumptions about auth and discovery, with the same basic concepts, it would help a lot.

That said, there's a proposal on the table to change storage stuff to a much more declaratoive model, and it has some compelling aspects, too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@thockin Got it. We have 3 more months to graduate the resource/device plugin API to beta. If the CSI API infra get's finalized by then, we can definitely re-use it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's optional. We have to be driving plugins towards convergence. No "if" here. Or we have to have a good reason why we can't and never intend to do so.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vishh For discovery in particular, device plugin proposal assumes plugins call kubelet for registration; whereas in csi, if implemented in kubernetes, operator negotiate plugin unix socket path between kubelet and plugin, and kubelet is responsible to watch for new plugin. Wondering what's the reason for this difference?

do you recall the location of the proposal :) @thockin

That said, there's a proposal on the table to change storage stuff to a much more declaratoive model, and it has some compelling aspects, too.

@jiayingz
Copy link
Contributor

jiayingz commented Aug 11, 2017 via email

@k8s-github-robot k8s-github-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Aug 15, 2017
@k8s-github-robot
Copy link

/lgtm cancel //PR changed after LGTM, removing LGTM. @derekwaynecarr @mindprince @thockin @vishh

@k8s-github-robot k8s-github-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 15, 2017
@thockin
Copy link
Member

thockin commented Aug 22, 2017

@mindprince you get the lgtm :)

@thockin thockin removed their assignment Aug 22, 2017
Signed-off-by: Vishnu kannan <vishnuk@google.com>
@vishh
Copy link
Contributor Author

vishh commented Aug 22, 2017

@thockin updated the naming scheme to reflect recent conversations. PTAL if you have some cycles.
@derekwaynecarr can you re-review this patch?

cc @jiayingz @mindprince

Copy link
Member

@derekwaynecarr derekwaynecarr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am fine with the naming scheme behind the vendor domain. I think we will need the resource API we discussed, and I think we need to get that in probably before beta?

Kubernetes nucleus will recommend and document a general purpose resource name for each family of accelerators - examples include `nvidia.com/gpu`, `amd.com/gpu`, `google.com/tpu`, etc., with a standard domain name that unambiguously identifies the vendor of a hardware followed by the hardware type `<hardware-vendor-domain>/<hardware-type>`.
It is expected that the hardware vendors will work with the kubernetes community to keep their resource names consistent across kubernetes clusters.

Nodes are expected to be homogenous and any attributes specific to hardware accelerators are expected to be exposed as node labels in Kubernetes to begin with.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The expectation stated here for homogenous nodes contradicts later sections in same paragraph. Possibly rephrase to state priority order?

Nodes are expected to be homogenous and any attributes specific to hardware accelerators are expected to be exposed as node labels in Kubernetes to begin with.
Users can expose “extended resources” with other names and consume them in their own clusters.
The admission logic will be extended to allow any resource with a non empty non-default (not `kubernetes.io`) domain name.
The scheduler will be extended to treat such extended resources as an integer resource to begin with.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note we will not support overcommiting initially?


Hardware Accelerators are expensive and typically have unique hardware architectures.
Programming against these accelerators, improving performance and utilization is non-trivial.
Certain generic metrics like `utilization` and `usage_time`, and vendor specific metrics are expected to be exposed via cAdvisor and made available to monitoring solutions.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is being done here explicitly?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question here.

Also, the proposal seems to suggest adding vendor specific monitoring capability to cadvisor, is this what we do right now? I feel like this is not a sustainable solution.


## Beta

### Requirements
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is basic monitoring implied here?

Accelerators are preferred over CPUs mainly for performance reasons.
Accelerators typically have extreme requirements at the hardware level in terms of power, hardware interconnect bandwidth, latency, etc.
These high performance devices require careful placement of user workloads on specific CPUs, Memory banks and Accelerator devices to reduce latency and guarantee application performance.
Kubernetes will support support performance isolation for these hardware accelerators, by allowing hardware device plugins to expose a hardware topology graph where each edge represents latency to access one or more CPUs.
Copy link
Contributor

@ScorpioCPH ScorpioCPH Sep 11, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo? two support.

@castrojo
Copy link
Member

This change is Reviewable

@jiayingz
Copy link
Contributor

jiayingz commented Oct 17, 2017 via email

@ddysher
Copy link
Contributor

ddysher commented Oct 17, 2017

@jiayingz I mean operator (human) is responsible to pass csi endpoint to both server (e.g. kubelet) and plugin. plugin will create and listen on the endpoint (e.g. socket), while server monitors the endpoint. If server observes that endpoint exists, it will establish the connection.

I didn't involve in csi discussion, this is my understanding after going through the spec

Accelerators typically have extreme requirements at the hardware level in terms of power, hardware interconnect bandwidth, latency, etc.
These high performance devices require careful placement of user workloads on specific CPUs, Memory banks and Accelerator devices to reduce latency and guarantee application performance.
Kubernetes will support support performance isolation for these hardware accelerators, by allowing hardware device plugins to expose a hardware topology graph where each edge represents latency to access one or more CPUs.
Kubelet will combine graphs from multiple plugins along with the node’s NUMA topology to handle hardware device assignment.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was under the impression that the scheduler was the component which placed Pods on a target node and allocated resources from that target node to the Pod. If that is the case, this seems to be carving out some special types of resources that will not be allocated by the scheduler but instead will be allocated/scheduled by the Kubelet. Is my understanding of that correct?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@k8s-github-robot k8s-github-robot added the kind/design Categorizes issue or PR as related to design. label Feb 6, 2018
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 18, 2018
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Mar 20, 2018
@rohitagarwal003
Copy link
Member

/remove-lifecycle rotten

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Mar 20, 2018
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 18, 2018
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jul 18, 2018
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@sisosiso
Copy link

wireless network monitoring

danehans pushed a commit to danehans/community that referenced this pull request Jul 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/hw-accelerators cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/design Categorizes issue or PR as related to design. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.