[YUNIKORN-2976] Handle multiple require node allocations per node #1001

wilfred-s · 2024-12-03T07:41:52Z

What is this PR for?

If an allocation requires a specific node the scheduler should not consider any other node. We should allow multiple allocations that require the same node to reserve the node at the same time. A required node allocation must be placed on the node before anything else. If other non required node reservations are made on a node remove the existing reservations that do not require that node. Make sure that the releases are tracked correctly in the partition.

After the repeat count removal reservations can be simplified:

track reservations using the allocation key
removed the composite key setup
removed collection listener call on reserve or unreserve of a node

Cleanup testing of the node collection

What type of PR is it?

- Bug Fix
- Improvement

Todos

e2e test in the shim requesting multiple daemon sets at once

What is the Jira issue?

YUNIKORN-2976

How should this be tested?

unit tests are updated new e2e test should be added

If an allocation requires a specific node the scheduler should not consider any other node. We should allow multiple allocations that require the same node to reserve the node at the same time. A required node allocation must be placed on the node before anything else. If other non required node reservations are made on a node remove the existing reservations that do not require that node. Make sure that the releases are tracked correctly in the partition. After the repeat count removal reservations can be simplified: - track reservations using the allocation key - removed the composite key setup - removed collection listener call on reserve or unreserve of a node Cleanup testing of the node collection

codecov · 2024-12-03T07:43:56Z

Codecov Report

Attention: Patch coverage is 79.54545% with 54 lines in your changes missing coverage. Please review.

Project coverage is 81.99%. Comparing base (0356a3a) to head (35c3cad).
Report is 2 commits behind head on master.

Files with missing lines	Patch %	Lines
pkg/scheduler/objects/application.go	75.69%	40 Missing and 4 partials ⚠️
pkg/scheduler/partition.go	47.36%	9 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #1001      +/-   ##
==========================================
+ Coverage   81.56%   81.99%   +0.43%     
==========================================
  Files          97       97              
  Lines       15646    15621      -25     
==========================================
+ Hits        12761    12809      +48     
+ Misses       2601     2535      -66     
+ Partials      284      277       -7

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

pbacsko

Wow, nice change :)

First, I just have a general observation: there seems to be too many uncovered paths in application.go. I'm sure some of those already existed before, but this change touches critical part of the allocation cycle, so I'd be happy to see those codecov reports go away, like:

#L883 - L887
#L944 - L947
#L1106 - L1120
#L1143
#L1241 (this looks weird, because the flow continues)
#L1317 (looks weird too)
#L1362

Perhaps they're covered indirectly from partition, but it's hard to reason about it.

pbacsko

Some more comments after round #1.

pkg/scheduler/objects/reservation.go

pkg/scheduler/objects/application.go

craigcondit · 2024-12-03T23:43:25Z

I've installed this patch on a cluster where I've spun up a kwok-based autoscaler. I encountered a scenario where a monitoring pod (daemonset member) wouldn't schedule on a full node by triggering preemption. The logs ping-pong back and forth between these two messages now:

2024-12-03T23:37:20.251Z    INFO    core.scheduler.queue    objects/queue.go:1438    allocation found on queue    {"queueName": "root.monitoring", "appID": "monitoring-ac69ed37-d4d7-4c0f-92bd-8f89a16cec74", "resultType": "Reserved", "allocation": "allocationKey ac69ed37-d4d7-4c0f-92bd-8f89a16cec74, applicationID monitoring-ac69ed37-d4d7-4c0f-92bd-8f89a16cec74, Resource map[pods:1], Allocated false"}
2024-12-03T23:37:20.251Z    INFO    core.scheduler.partition    scheduler/partition.go:957    Application is already reserved on node    {"appID": "monitoring-ac69ed37-d4d7-4c0f-92bd-8f89a16cec74", "nodeID": "kind-worker-6fhtj"}

Events for the pod:

Normal  Scheduling     4m13s  yunikorn  monitoring/kube-prometheus-stack-prometheus-node-exporter-5b6p7 is queued and waiting for allocation
Normal  Informational  4m12s  yunikorn  Unschedulable request 'ac69ed37-d4d7-4c0f-92bd-8f89a16cec74' with required node 'kind-worker-6fhtj', no preemption victim found
Normal  Scheduling     4m5s   yunikorn  monitoring/kube-prometheus-stack-prometheus-node-exporter-5b6p7 is queued and waiting for allocation

I'm using admissionController.filtering.generateUniqueAppId: true to use per-pod App ID generation, if that makes a difference. Restarting YK doesn't seem to help.

Update: Manually killing a pod to make room doesn't make the logs go away, but does schedule the pod. This would seem to indicate that the reservations aren't being cleaned up properly.

wilfred-s · 2024-12-10T06:13:50Z

Sorry for the late response:

The logs ping-pong back and forth between these two messages now:

2024-12-03T23:37:20.251Z    INFO    core.scheduler.queue    objects/queue.go:1438    allocation found on queue    {"queueName": "root.monitoring", "appID": "monitoring-ac69ed37-d4d7-4c0f-92bd-8f89a16cec74", "resultType": "Reserved", "allocation": "allocationKey ac69ed37-d4d7-4c0f-92bd-8f89a16cec74, applicationID monitoring-ac69ed37-d4d7-4c0f-92bd-8f89a16cec74, Resource map[pods:1], Allocated false"}
2024-12-03T23:37:20.251Z    INFO    core.scheduler.partition    scheduler/partition.go:957    Application is already reserved on node    {"appID": "monitoring-ac69ed37-d4d7-4c0f-92bd-8f89a16cec74", "nodeID": "kind-worker-6fhtj"}

One function was not updated to allow multiple daemon set pods to reserve multiple nodes. A test case was missing for that unit test also.

Added more unit tests remove function and calls to getAllocationReservation remove function and calls to IsAllocationReserved rewrote function IsReservedOnNode to NodeReservedForAsk and make the caller check the nodeID fix the logic around a previously reserved allocation in tryRequiredNode Put a safeguard into the partition to handle the broken case mentioned in the review

wilfred-s · 2024-12-12T10:26:16Z

filed two follow up jiras to add unit tests to tryNodesNoReserve and tryPlaceholderAllocate on the application

craigcondit

+1 LGTM. I've verified the updated patch on a cluster with a kwok autoscaler, and have been unable to reproduce the original issues. This looks good.

) If an allocation requires a specific node the scheduler should not consider any other node. We should allow multiple allocations that require the same node to reserve the node at the same time. A required node allocation must be placed on the node before anything else. If other non required node reservations are made on a node remove the existing reservations that do not require that node. Make sure that the releases are tracked correctly in the partition. After the repeat count removal reservations can be simplified: - track reservations using the allocation key - removed the composite key setup - removed collection listener call on reserve or unreserve of a node Closes: #1001 Signed-off-by: Craig Condit <ccondit@apache.org>

…ache#1001) If an allocation requires a specific node the scheduler should not consider any other node. We should allow multiple allocations that require the same node to reserve the node at the same time. A required node allocation must be placed on the node before anything else. If other non required node reservations are made on a node remove the existing reservations that do not require that node. Make sure that the releases are tracked correctly in the partition. After the repeat count removal reservations can be simplified: - track reservations using the allocation key - removed the composite key setup - removed collection listener call on reserve or unreserve of a node Closes: apache#1001 Signed-off-by: Craig Condit <ccondit@apache.org> (cherry picked from commit b02e15e)

wilfred-s requested review from manirajv06, chia7712, lixmgl, zhuqi-lucas, craigcondit, pbacsko and 0yukali0 December 3, 2024 07:41

wilfred-s self-assigned this Dec 3, 2024

pbacsko reviewed Dec 3, 2024

View reviewed changes

pkg/scheduler/objects/reservation.go Show resolved Hide resolved

pkg/scheduler/objects/reservation.go Show resolved Hide resolved

pkg/scheduler/objects/application.go Outdated Show resolved Hide resolved

wilfred-s requested a review from pbacsko December 12, 2024 10:24

craigcondit approved these changes Dec 13, 2024

View reviewed changes

craigcondit closed this in b02e15e Dec 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[YUNIKORN-2976] Handle multiple require node allocations per node #1001

[YUNIKORN-2976] Handle multiple require node allocations per node #1001

wilfred-s commented Dec 3, 2024 •

edited

Loading

codecov bot commented Dec 3, 2024 •

edited

Loading

pbacsko left a comment

pbacsko left a comment

craigcondit commented Dec 3, 2024 •

edited

Loading

wilfred-s commented Dec 10, 2024

wilfred-s commented Dec 12, 2024

craigcondit left a comment

[YUNIKORN-2976] Handle multiple require node allocations per node #1001

[YUNIKORN-2976] Handle multiple require node allocations per node #1001

Conversation

wilfred-s commented Dec 3, 2024 • edited Loading

What is this PR for?

What type of PR is it?

Todos

What is the Jira issue?

How should this be tested?

codecov bot commented Dec 3, 2024 • edited Loading

Codecov Report

pbacsko left a comment

Choose a reason for hiding this comment

pbacsko left a comment

Choose a reason for hiding this comment

craigcondit commented Dec 3, 2024 • edited Loading

wilfred-s commented Dec 10, 2024

wilfred-s commented Dec 12, 2024

craigcondit left a comment

Choose a reason for hiding this comment

wilfred-s commented Dec 3, 2024 •

edited

Loading

codecov bot commented Dec 3, 2024 •

edited

Loading

craigcondit commented Dec 3, 2024 •

edited

Loading