Auto-scaling controller does not scale-down the Job #450

Yancey1989 · 2017-10-30T05:18:33Z

I submit 10 auto-scaling jobs with min-instance=2, max-instance=20, but job-0 has 20 trainers and job-9 has only 2 trainers.
There are too many PENDING trainer pod, but the logs in controller is as following:

time="2017-10-30T05:25:31Z" level=debug msg="Dry run scale job mnist7: current 0, additional 0, remaining resource: autoscaler.ClusterResource{NodeCount:133, GPURequest:0, GPULimit:0, GPUTotal:0, CPURequestMilli:1269510, CPULimitMilli:1848000, CPUTotalMilli:2348000, MemoryRequestMega:1186525, MemoryLimitMega:1186588, MemoryTotalMega:13665210}"

In the logs CPURequestMilli < CPUTotalMilli , but I think it should be CPURequestMilli > CPUTotalMilli in fact.

The text was updated successfully, but these errors were encountered:

typhoonzero · 2017-10-30T05:35:15Z

This is a bug due to:

 1621 time="2017-10-30T02:10:40Z" level=debug msg="Dry run scale job mnist4: current -32, additional 0, remaining resource: autoscaler.ClusterResource{NodeCo      unt:133, GPURequest:0, GPULimit:0, GPUTotal:0, CPURequestMilli:2339510, CPULimitMilli:3933000, CPUTotalMilli:2348000, MemoryRequestMega:2105646, Memory      LimitMega:2380589, MemoryTotalMega:13665210}"
 1622 time="2017-10-30T02:10:40Z" level=info msg="Scaling plan: map[mnist4:-32]"
 1623 time="2017-10-30T02:10:40Z" level=info msg="Scaling job mnist4, diff: -32."
 1624 time="2017-10-30T02:10:40Z" level=error msg="Error updating trainer job Job.batch \"mnist4-trainer\" is invalid: spec.parallelism: Invalid value: -38:       must be greater than or equal to 0, retry remaining: 0."
 1625 time="2017-10-30T02:10:40Z" level=error msg="Error updating trainer job Job.batch \"mnist4-trainer\" is invalid: spec.parallelism: Invalid value: -38:       must be greater than or equal to 0, retry remaining: 1."
 1626 time="2017-10-30T02:10:40Z" level=error msg="Error updating trainer job Job.batch \"mnist4-trainer\" is invalid: spec.parallelism: Invalid value: -38:       must be greater than or equal to 0, retry remaining: 2."

Yancey1989 · 2017-10-30T09:03:19Z

Reopen this issue because the problem remains.

typhoonzero · 2017-10-30T14:32:40Z

I checked the log, it seems that total CPU request is less than the real value in the cluster.

helinwang · 2017-10-30T18:25:43Z

I have added this PR as an improvement: #456

helinwang · 2017-11-01T23:06:20Z

I think this is fixed, please reopen if otherwise.

Yancey1989 changed the title ~~Auto~~ Auto-scaling controller does not scale-down the Job Oct 30, 2017

Yancey1989 assigned helinwang and typhoonzero Oct 30, 2017

typhoonzero added the bug label Oct 30, 2017

typhoonzero mentioned this issue Oct 30, 2017

Fix scale down to min #451

Merged

typhoonzero closed this as completed in #451 Oct 30, 2017

Yancey1989 reopened this Oct 30, 2017

Yancey1989 mentioned this issue Oct 30, 2017

Autoscaling Experiment. #399

Closed

helinwang closed this as completed Nov 1, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Auto-scaling controller does not scale-down the Job #450

Auto-scaling controller does not scale-down the Job #450

Yancey1989 commented Oct 30, 2017 •

edited

Loading

typhoonzero commented Oct 30, 2017

Yancey1989 commented Oct 30, 2017 •

edited

Loading

typhoonzero commented Oct 30, 2017

helinwang commented Oct 30, 2017

helinwang commented Nov 1, 2017

Auto-scaling controller does not scale-down the Job #450

Auto-scaling controller does not scale-down the Job #450

Comments

Yancey1989 commented Oct 30, 2017 • edited Loading

typhoonzero commented Oct 30, 2017

Yancey1989 commented Oct 30, 2017 • edited Loading

typhoonzero commented Oct 30, 2017

helinwang commented Oct 30, 2017

helinwang commented Nov 1, 2017

Yancey1989 commented Oct 30, 2017 •

edited

Loading

Yancey1989 commented Oct 30, 2017 •

edited

Loading