-
Notifications
You must be signed in to change notification settings - Fork 239
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[YUNIKORN-2832] [core] Add non-Yunikorn allocation tracking logic #975
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #975 +/- ##
==========================================
+ Coverage 81.03% 81.49% +0.45%
==========================================
Files 97 97
Lines 12523 12632 +109
==========================================
+ Hits 10148 10294 +146
+ Misses 2104 2055 -49
- Partials 271 283 +12 ☔ View full report in Codecov by Sentry. |
f35f054
to
724f3de
Compare
724f3de
to
d61863c
Compare
d61863c
to
b836716
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few minor things, and some questions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some questions.
if !allocated { | ||
return false, false, fmt.Errorf("trying to add a foreign request (non-allocation) %s", allocationKey) | ||
} | ||
if alloc.GetNodeID() == "" { | ||
return false, false, fmt.Errorf("node ID is empty for allocation %s", allocationKey) | ||
} | ||
if node == nil { | ||
return false, false, fmt.Errorf("failed to find node %s for allocation %s", nodeID, allocationKey) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Return error means the foreign allocation will be sent back to shim as "rejected allocation"
May I check what will be the followup actions after shim receives the rejected foreign allocations?
The reject reasons for foreign allocations:
- foreign request is non-allocated
- foreign allocation don't have node ID empty
- foreign allocation failed to find node in partition
Besides, should we add a flag of allocation type to &si.RejectedAllocation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct, it's sent back and it needs to be handled properly.
Currently, this happens inside scheduler_callback.go
:
for _, reject := range response.RejectedAllocations {
// request rejected by the scheduler, reject it
log.Log(log.ShimRMCallback).Debug("callback: response to rejected allocation",
zap.String("allocationKey", reject.AllocationKey))
dispatcher.Dispatch(NewRejectTaskEvent(reject.ApplicationID, reject.AllocationKey,
fmt.Sprintf("task %s allocation from application %s is rejected by scheduler",
reject.AllocationKey, reject.ApplicationID)))
}
This doesn't look right, we don't need a task reject event for a non-Yunikorn allocation. Therefore, we need changes in this code path (my feeling is that there's not much we can do other than logging something) but, this is yet to be done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, there is nothing we can do except logging.
If we want to log the information back to the Pod (Pod event? or other places?), we need to determine whether "This rejected allocation is YuniKorn or Non-YuniKorn." How do we identify it?
Will we keep the Non-YuniKorn allocation ID in shim?
My previous question assumes we don't keep the Non-YuniKorn allocation ID list in shim.
Besides, should we add a flag of allocation type to &si.RejectedAllocation?
Therefore, we might need a flag from si.RejectedAllocation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or, maybe we can just keep the rejected allocation in core?
Actually, I prefer this because there is nothing helpful the shim can do after receiving the rejected Non-YuniKorn allcoations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only way we would ever reject a non-YK alloc in the core is if the node was not found. This would indicate a bug in the shim, and so there's no appropriate way to handle it. So logging loudly is probably sufficient (ideally with a "BUG:" prefix). Since it should be unreachable code, sending back a reject seems unnecessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good except for debug instead of warning in logging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 LGTM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 LGTM
What is this PR for?
Add foreign pod tracking logic to the scheduler core.
Note: existing code which updates
occupiedResources
is not removed in this PR. Removal will be done as a follow-up task.Updating the allocation resource usage be implemented in YUNIKORN-1765.
What type of PR is it?
Todos
What is the Jira issue?
https://issues.apache.org/jira/browse/YUNIKORN-2832
How should this be tested?
Screenshots (if appropriate)
Questions: