Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DRA for 1.31 #125488

Merged
merged 15 commits into from
Jul 22, 2024
Merged

DRA for 1.31 #125488

merged 15 commits into from
Jul 22, 2024

Conversation

pohly
Copy link
Contributor

@pohly pohly commented Jun 13, 2024

What type of PR is this?

/kind feature

What this PR does / why we need it:

This is an implementation of the revised API and features from kubernetes/enhancements#4709.

Which issue(s) this PR fixes:

Related-to:

Fixes: #125665, #124041, #125210

Special notes for your reviewer:

Several of these changes where already proposed earlier in separate PRs and/or might get split out. Here's a list:

In this PR, please only review commits starting with "DRA: remove immediate allocation".

Does this PR introduce a user-facing change?

DRA: new API and several new features

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

- [KEP]:  https://github.com/kubernetes/enhancements/issues/4381

@k8s-ci-robot
Copy link
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. kind/feature Categorizes issue or PR as related to a new feature. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jun 13, 2024
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. area/kubelet area/test sig/auth Categorizes an issue or PR as relevant to SIG Auth. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. sig/testing Categorizes an issue or PR as relevant to SIG Testing. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jun 13, 2024
@k8s-ci-robot k8s-ci-robot added area/apiserver area/code-generation area/kubectl kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/cli Categorizes an issue or PR as relevant to SIG CLI. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Jun 14, 2024
pohly and others added 10 commits July 22, 2024 18:09
This is a complete revamp of the original API. Some of the key
differences:
- refocused on structured parameters and allocating devices
- support for constraints across devices
- support for allocating "all" or a fixed amount
  of similar devices in a single request
- no class for ResourceClaims, instead individual
  device requests are associated with a mandatory
  DeviceClass

For the sake of simplicity, optional basic types (ints, strings) where the null
value is the default are represented as values in the API types. This makes Go
code simpler because it doesn't have to check for nil (consumers) and values
can be set directly (producers). The effect is that in protobuf, these fields
always get encoded because `opt` only has an effect for pointers.

The roundtrip test data for v1.29.0 and v1.30.0 changes because of the new
"request" field. This is considered acceptable because the entire `claims`
field in the pod spec is still alpha.

The implementation is complete enough to bring up the apiserver.
Adapting other components follows.
Publishing ResourceSlices now supports network-attached devices and the new
v1alpha3 API.  The logic for splitting up across different slices is missing.
This adds the ability to select specific requests inside a claim for a
container.

NodePrepareResources is always called, even if the claim is not used by any
container. This could be useful for drivers where that call has some effect
other than injecting CDI device IDs into containers. It also ensures that
drivers can validate configs.

The pod resource API can no longer report a class for each claim because there
is no such 1:1 relationship anymore. Instead, that API reports claim,
API devices (with driver/pool/device as ID) and CDI device IDs. The kubelet
itself doesn't extract that information from the claim. Instead, it relies on
drivers to report this information when the claim gets prepared. This isolates
the kubelet from API changes.

Because of a faulty E2E test, kubelet was told to contact the wrong driver for
a claim. This was not visible in the kubelet log output. Now changes to the
claim info cache are getting logged. While at it, naming of variables and some
existing log output gets harmonized.

Co-authored-by: Oksana Baranova <oksana.baranova@intel.com>
Co-authored-by: Ed Bartosh <eduard.bartosh@intel.com>
The advantages of using a validation admission policy (VAP) are that no changes
are needed in Kubernetes and that admins have full flexibility if and how they
want to control which users are allowed to use "admin access" in their
requests.

The downside is that without admins taking actions, the feature is enabled
out-of-the-box in a cluster. Documentation for DRA will have to make it very
clear that something needs to be done in multi-tenant clusters.

The test/e2e/testing-manifests/dra/admin-access-policy.yaml shows how to do
this. The corresponding E2E tests ensures that it actually works as intended.

For some reason, adding the namespace to the message expression leads to a
type check errors, so it's currently commented out.
The resource claim controller is completely agnostic to the claim spec. It
doesn't care about classes or devices, therefore it needs no changes in 1.31
besides the v1alpha2 -> v1alpha3 renaming from a previous commit.
The structured parameter allocation logic was written from scratch in
staging/src/k8s.io/dynamic-resource-allocation/structured where it might be
useful for out-of-tree components.

Besides the new features (amount, admin access) and API it now supports
backtracking when the initial device selection doesn't lead to a complete
allocation of all claims.

Co-authored-by: Ed Bartosh <eduard.bartosh@intel.com>
Co-authored-by: John Belamaric <jbelamaric@google.com>
In the API, the effect of the feature gate is that alpha fields get dropped on
create. They get preserved during updates if already set. The
PodSchedulingContext registration is *not* restricted by the feature gate.
This enables deleting stale PodSchedulingContext objects after disabling
the feature gate.

The scheduler checks the new feature gate before setting up an informer for
PodSchedulingContext objects and when deciding whether it can schedule a
pod. If any claim depends on a control plane controller, the scheduler bails
out, leading to:

    Status:       Pending
    ...
      Warning  FailedScheduling             73s   default-scheduler  0/1 nodes are available: resourceclaim depends on disabled DRAControlPlaneController feature. no new claims to deallocate, preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling.

The rest of the changes prepare for testing the new feature separately from
"structured parameters". The goal is to have base "dra" jobs which just enable
and test those, then "classic-dra" jobs which add DRAControlPlaneController.
This fixes the message (node name and "cluster-scoped" were switched) and
simplifies the VAP:
- a single matchCondition short circuits completely unless they're a user
  we care about
- variables to extract the userNodeName and objectNodeName once
  (using optionals to gracefully turn missing claims and fields into empty strings)
- leaves very tiny concise validations

Co-authored-by: Jordan Liggitt <liggitt@google.com>
Some of the E2E node tests were flaky. Their timeout apparently was chosen
under the assumption that kubelet would retry immediately after a failed gRPC
call, with a factor of 2 as safety margin. But according to
kubernetes@0449cef,
kubelet has a different, higher retry period of 90 seconds, which was exactly
the test timeout. The test timeout has to be higher than that.

As the tests don't use the gRPC call timeout anymore, it can be made
private. While at it, the name and documentation gets updated.
Copy link
Member

@Huang-Wei Huang-Wei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

for scheduler changes.

PS: there was one nit for integration test coverage. For now DRA is using scheduler_perf tests for both perf and functionality tests. However, scheduler_perf is not included in a PR's presubmit-CI, so it makes sense to add some integration tests under test/integration/scheduler. Not a blocker though.

@pohly
Copy link
Contributor Author

pohly commented Jul 22, 2024

/test pull-kubernetes-node-e2e-crio-cgrpv2-dra

Setup failed with a flake (? "Error while dialing: dial unix /var/run/crio/crio.sock: connect: no such file or directory").

#125488 (review):

However, scheduler_perf is not included in a PR's presubmit-CI

It is enabled, and there is one test scenario ("SchedulingWithResourceClaimTemplateStructured") for the new API. More can and will be added once we start investigating performance a bit more. Right now, it's more about functionality.

@klueska: would you do us the honor and LGTM together with lifting the hold?

We have positive reviews and approval from (not collecting links, but they are there...):

Thanks everyone!

@klueska
Copy link
Contributor

klueska commented Jul 22, 2024

Gladly. Thanks everyone! One step closer...

/approve
/lgtm
/unhold

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 22, 2024
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: ce1d7771db93f922ecda8bf61041a6be1c84de6d

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Huang-Wei, klueska, pohly, thockin

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 22, 2024
@k8s-ci-robot k8s-ci-robot merged commit d21b172 into kubernetes:master Jul 22, 2024
22 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v1.31 milestone Jul 22, 2024
pohly added a commit to pohly/kubernetes that referenced this pull request Jul 31, 2024
This behavior was useful for kubernetes#125488 but
wasn't obvious when reading the documentation.
hungnguyen243 pushed a commit to hungnguyen243/kubernetes that referenced this pull request Aug 16, 2024
The manual deep comparison code is hard to maintain (would need to be updated
in kubernetes#125488) and error prone.

In fact, one test case failed when doing a full automatic comparison with
cmp.Diff because it wasn't setting allMemory.
hungnguyen243 pushed a commit to hungnguyen243/kubernetes that referenced this pull request Aug 16, 2024
This behavior was useful for kubernetes#125488 but
wasn't obvious when reading the documentation.
SoulPancake pushed a commit to SoulPancake/k8s that referenced this pull request Sep 11, 2024
The manual deep comparison code is hard to maintain (would need to be updated
in kubernetes#125488) and error prone.

In fact, one test case failed when doing a full automatic comparison with
cmp.Diff because it wasn't setting allMemory.
SoulPancake pushed a commit to SoulPancake/k8s that referenced this pull request Sep 11, 2024
This behavior was useful for kubernetes#125488 but
wasn't obvious when reading the documentation.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/apiserver area/code-generation area/e2e-test-framework Issues or PRs related to refactoring the kubernetes e2e test framework area/kubectl area/kubelet area/release-eng Issues or PRs related to the Release Engineering subproject area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/auth Categorizes an issue or PR as relevant to SIG Auth. sig/cli Categorizes an issue or PR as relevant to SIG CLI. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. sig/etcd Categorizes an issue or PR as relevant to SIG Etcd. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/release Categorizes an issue or PR as relevant to SIG Release. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. sig/storage Categorizes an issue or PR as relevant to SIG Storage. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
Archived in project
Archived in project
Archived in project
Development

Successfully merging this pull request may close these issues.

finish DRA for 1.31