Skip to content
This repository has been archived by the owner on May 25, 2023. It is now read-only.

Examine all pending tasks in a job #540

Merged
merged 2 commits into from
Jan 7, 2019

Conversation

Jeffwan
Copy link
Contributor

@Jeffwan Jeffwan commented Jan 6, 2019

What this PR does / why we need it:
Process exit when it meet the first tasks that can not fit into node. This PR make sure it checks all pending tasks and try to fit them all into nodes.

I0105 21:54:11.361090    2921 allocate.go:42] Enter Allocate ...
I0105 21:54:11.361107    2921 allocate.go:57] Added Job <default/qj-1> into Queue <default>
I0105 21:54:11.361123    2921 allocate.go:61] Try to allocate resource to 1 Queues
I0105 21:54:11.361137    2921 allocate.go:78] Try to allocate resource to Jobs in Queue <default>
I0105 21:54:11.361156    2921 allocate.go:102] Try to allocate resource to 4 tasks of Job <default/qj-1>

# 1st task
I0105 21:54:11.361175    2921 allocate.go:109] There are <2> nodes for Job <default/qj-1>
I0105 21:54:11.361188    2921 allocate.go:120] Considering Task <default/qj-1-64jpk> on node <ip-192-168-46-32.us-west-2.compute.internal>: <cpu 2000.00, memory 0.00, GPU 0.00> vs. <cpu 3690.00, memory 16568913920.00, GPU 0.00>
I0105 21:54:11.361241    2921 allocate.go:132] Binding Task <default/qj-1-64jpk> to node <ip-192-168-46-32.us-west-2.compute.internal>
I0105 21:54:11.361271    2921 session.go:170] After allocated Task <default/qj-1-64jpk> to Node <ip-192-168-46-32.us-west-2.compute.internal>: idle <cpu 1690.00, memory 16568913920.00, GPU 0.00>, used <cpu 2310.00, memory 146800640.00, GPU 0.00>, releasing <cpu 0.00, memory 0.00, GPU 0.00>
I0105 21:54:11.361299    2921 allocate.go:78] Try to allocate resource to Jobs in Queue <default>
I0105 21:54:11.361315    2921 allocate.go:102] Try to allocate resource to 3 tasks of Job <default/qj-1>

# 2nd task
I0105 21:54:11.361330    2921 allocate.go:109] There are <2> nodes for Job <default/qj-1>
I0105 21:54:11.361344    2921 allocate.go:120] Considering Task <default/qj-1-qzdzn> on node <ip-192-168-46-32.us-west-2.compute.internal>: <cpu 2000.00, memory 0.00, GPU 0.00> vs. <cpu 1690.00, memory 16568913920.00, GPU 0.00>
I0105 21:54:11.361404    2921 allocate.go:120] Considering Task <default/qj-1-qzdzn> on node <ip-192-168-71-35.us-west-2.compute.internal>: <cpu 2000.00, memory 0.00, GPU 0.00> vs. <cpu 3890.00, memory 16715714560.00, GPU 0.00>
I0105 21:54:11.361462    2921 allocate.go:132] Binding Task <default/qj-1-qzdzn> to node <ip-192-168-71-35.us-west-2.compute.internal>
I0105 21:54:11.361482    2921 session.go:170] After allocated Task <default/qj-1-qzdzn> to Node <ip-192-168-71-35.us-west-2.compute.internal>: idle <cpu 1890.00, memory 16715714560.00, GPU 0.00>, used <cpu 2110.00, memory 0.00, GPU 0.00>, releasing <cpu 0.00, memory 0.00, GPU 0.00>
I0105 21:54:11.361511    2921 allocate.go:78] Try to allocate resource to Jobs in Queue <default>
I0105 21:54:11.361526    2921 allocate.go:102] Try to allocate resource to 2 tasks of Job <default/qj-1>

# 3rd task
I0105 21:54:11.361540    2921 allocate.go:109] There are <2> nodes for Job <default/qj-1>
I0105 21:54:11.361555    2921 allocate.go:120] Considering Task <default/qj-1-7wbdd> on node <ip-192-168-46-32.us-west-2.compute.internal>: <cpu 2000.00, memory 0.00, GPU 0.00> vs. <cpu 1690.00, memory 16568913920.00, GPU 0.00>
I0105 21:54:11.361628    2921 allocate.go:120] Considering Task <default/qj-1-7wbdd> on node <ip-192-168-71-35.us-west-2.compute.internal>: <cpu 2000.00, memory 0.00, GPU 0.00> vs. <cpu 1890.00, memory 16715714560.00, GPU 0.00>


# 4th task
I0105 21:54:11.361703    2921 allocate.go:109] There are <2> nodes for Job <default/qj-1>
I0105 21:54:11.361722    2921 allocate.go:120] Considering Task <default/qj-1-blb4z> on node <ip-192-168-46-32.us-west-2.compute.internal>: <cpu 2000.00, memory 0.00, GPU 0.00> vs. <cpu 1690.00, memory 16568913920.00, GPU 0.00>
I0105 21:54:11.361818    2921 allocate.go:120] Considering Task <default/qj-1-blb4z> on node <ip-192-168-71-35.us-west-2.compute.internal>: <cpu 2000.00, memory 0.00, GPU 0.00> vs. <cpu 1890.00, memory 16715714560.00, GPU 0.00>

# end
I0105 21:54:11.361894    2921 allocate.go:78] Try to allocate resource to Jobs in Queue <default>
I0105 21:54:11.361910    2921 allocate.go:81] Can not find jobs for queue default.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #537

Special notes for your reviewer:
if task is assigned, it add job into job list and rerun entire process.
If current task is not assigned, it won't step out. Instead, it will try to fit all rest tasks.

Release note:

Examine all pending tasks in a job

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jan 6, 2019
@k8s-ci-robot k8s-ci-robot requested review from jinzhejz and k82cn January 6, 2019 06:11
@k8s-ci-robot k8s-ci-robot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Jan 6, 2019
@k82cn
Copy link
Contributor

k82cn commented Jan 7, 2019

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Jan 7, 2019
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Jeffwan, k82cn

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 7, 2019
@k82cn
Copy link
Contributor

k82cn commented Jan 7, 2019

BTW, please also add e2e test for this case :)

@k8s-ci-robot k8s-ci-robot merged commit 4e350d1 into kubernetes-retired:master Jan 7, 2019
@k82cn
Copy link
Contributor

k82cn commented Jan 7, 2019

/cc @jiaxuanzhou , would this be your case?

@jiaxuanzhou
Copy link
Contributor

@k82cn not actually, #539 does.

k8s-ci-robot added a commit that referenced this pull request Jan 7, 2019
…ase-0.3

Automated cherry pick of #540: Examine all pending tasks in one loop
@Jeffwan
Copy link
Contributor Author

Jeffwan commented Jan 7, 2019

@k82cn Sure. I will add an e2e test case for this change.

@k82cn
Copy link
Contributor

k82cn commented Jan 7, 2019

Sure. I will add an e2e test case for this change.

Please refer to Task Priority e2e test on how to create tasks/pods with different resource request.

@k82cn
Copy link
Contributor

k82cn commented Jan 7, 2019

@k82cn not actually, #539 does.

Great !! That's top priority in backlog :)

@k82cn k82cn added this to the v0.4 milestone Jan 26, 2019
kevin-wangzefeng pushed a commit to kevin-wangzefeng/scheduler that referenced this pull request Jun 28, 2019
kevin-wangzefeng pushed a commit to kevin-wangzefeng/scheduler that referenced this pull request Jun 28, 2019
kevin-wangzefeng pushed a commit to kevin-wangzefeng/scheduler that referenced this pull request Jun 28, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm Indicates that a PR is ready to be merged. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Allocate action should examine all pending tasks in a job
4 participants