Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Auto-scaling controller does not scale-down the Job #450

Closed
Yancey1989 opened this issue Oct 30, 2017 · 5 comments
Closed

Auto-scaling controller does not scale-down the Job #450

Yancey1989 opened this issue Oct 30, 2017 · 5 comments
Assignees
Labels

Comments

@Yancey1989
Copy link
Collaborator

Yancey1989 commented Oct 30, 2017

I submit 10 auto-scaling jobs with min-instance=2, max-instance=20, but job-0 has 20 trainers and job-9 has only 2 trainers.
There are too many PENDING trainer pod, but the logs in controller is as following:

time="2017-10-30T05:25:31Z" level=debug msg="Dry run scale job mnist7: current 0, additional 0, remaining resource: autoscaler.ClusterResource{NodeCount:133, GPURequest:0, GPULimit:0, GPUTotal:0, CPURequestMilli:1269510, CPULimitMilli:1848000, CPUTotalMilli:2348000, MemoryRequestMega:1186525, MemoryLimitMega:1186588, MemoryTotalMega:13665210}"

In the logs CPURequestMilli < CPUTotalMilli , but I think it should be CPURequestMilli > CPUTotalMilli in fact.

@Yancey1989 Yancey1989 changed the title Auto Auto-scaling controller does not scale-down the Job Oct 30, 2017
@typhoonzero
Copy link
Collaborator

This is a bug due to:

 1621 time="2017-10-30T02:10:40Z" level=debug msg="Dry run scale job mnist4: current -32, additional 0, remaining resource: autoscaler.ClusterResource{NodeCo      unt:133, GPURequest:0, GPULimit:0, GPUTotal:0, CPURequestMilli:2339510, CPULimitMilli:3933000, CPUTotalMilli:2348000, MemoryRequestMega:2105646, Memory      LimitMega:2380589, MemoryTotalMega:13665210}"
 1622 time="2017-10-30T02:10:40Z" level=info msg="Scaling plan: map[mnist4:-32]"
 1623 time="2017-10-30T02:10:40Z" level=info msg="Scaling job mnist4, diff: -32."
 1624 time="2017-10-30T02:10:40Z" level=error msg="Error updating trainer job Job.batch \"mnist4-trainer\" is invalid: spec.parallelism: Invalid value: -38:       must be greater than or equal to 0, retry remaining: 0."
 1625 time="2017-10-30T02:10:40Z" level=error msg="Error updating trainer job Job.batch \"mnist4-trainer\" is invalid: spec.parallelism: Invalid value: -38:       must be greater than or equal to 0, retry remaining: 1."
 1626 time="2017-10-30T02:10:40Z" level=error msg="Error updating trainer job Job.batch \"mnist4-trainer\" is invalid: spec.parallelism: Invalid value: -38:       must be greater than or equal to 0, retry remaining: 2."

@Yancey1989
Copy link
Collaborator Author

Yancey1989 commented Oct 30, 2017

Reopen this issue because the problem remains.

@typhoonzero
Copy link
Collaborator

I checked the log, it seems that total CPU request is less than the real value in the cluster.

@helinwang
Copy link
Collaborator

I have added this PR as an improvement: #456

@helinwang
Copy link
Collaborator

I think this is fixed, please reopen if otherwise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants