Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can not fetch new task after some trainers have been scaled down #5279

Closed
Yancey1989 opened this issue Nov 1, 2017 · 2 comments
Closed

Can not fetch new task after some trainers have been scaled down #5279

Yancey1989 opened this issue Nov 1, 2017 · 2 comments
Assignees
Labels

Comments

@Yancey1989
Copy link
Contributor

The logs as following:

addle/go/master/service.go:389]
t=2017-11-01T09:12:39+0000 lvl=warn msg="No more available task." todoLen=0 pendingLen=3 doneLen=3 failedLen=0 curPass=2 stack=[github.com/PaddlePaddle/Paddle/go/master/service.go:389]
t=2017-11-01T09:12:40+0000 lvl=warn msg="No more available task." todoLen=0 pendingLen=3 doneLen=3 failedLen=0 curPass=2 stack=[github.com/PaddlePaddle/Paddle/go/master/service.go:389]
t=2017-11-01T09:12:42+0000 lvl=warn msg="No more available task." pendingLen=3 doneLen=3 failedLen=0 curPass=2 todoLen=0 stack=[github.com/PaddlePaddle/Paddle/go/master/service.go:389]
t=2017-11-01T09:13:10+0000 lvl=warn msg="No more available task." todoLen=0 pendingLen=3 doneLen=3 failedLen=0 curPass=2 stack=[github.com/PaddlePaddle/Paddle/go/master/service.go:389]
t=2017-11-01T09:21:25+0000 lvl=warn msg="No more available task." todoLen=0 pendingLen=3 doneLen=3 failedLen=0 curPass=2 stack=[github.com/PaddlePaddle/Paddle/go/master/service.go:389]

t=2017-11-01T09:27:29+0000 lvl=warn msg="Task failed, re-dispatch." task="{Meta:{ID:8674665223307489707 Epoch:3} Chunks:[{Path:/data/mnist/mnist-train-00040 Index:{ChunkOffsets:[0] ChunkLens:[1000] NumRecords:1000 ChunkRecords:[1000]}} {Path:/data/mnist/mnist-train-00041 Index:{ChunkOffsets:[0] ChunkLens:[1000] NumRecords:1000 ChunkRecords:[1000]}} {Path:/data/mnist/mnist-train-00042 Index:{ChunkOffsets:[0] ChunkLens:[1000] NumRecords:1000 ChunkRecords:[1000]}} {Path:/data/mnist/mnist-train-00043 Index:{ChunkOffsets:[0] ChunkLens:[1000] NumRecords:1000 ChunkRecords:[1000]}} {Path:/data/mnist/mnist-train-00044 Index:{ChunkOffsets:[0] ChunkLens:[1000] NumRecords:1000 ChunkRecords:[1000]}} {Path:/data/mnist/mnist-train-00045 Index:{ChunkOffsets:[0] ChunkLens:[1000] NumRecords:1000 ChunkRecords:[1000]}} {Path:/data/mnist/mnist-train-00046 Index:{ChunkOffsets:[0] ChunkLens:[1000] NumRecords:1000 ChunkRecords:[1000]}} {Path:/data/mnist/mnist-train-00047 Index:{ChunkOffsets:[0] ChunkLens:[1000] NumRecords:1000 ChunkRecords:[1000]}} {Path:/data/mnist/mnist-train-00048 Index:{ChunkOffsets:[0] ChunkLens:[1000] NumRecords:1000 ChunkRecords:[1000]}} {Path:/data/mnist/mnist-train-00049 Index:{ChunkOffsets:[0] ChunkLens:[1000] NumRecords:1000 ChunkRecords:[1000]}}]}" num failed=1 stack="[github.com/PaddlePaddle/Paddle/go/master/service.go:336 github.com/PaddlePaddle/Paddle/go/master/service.go:351]"

Looks like scale down will lead to the task timeout.

@typhoonzero
Copy link
Contributor

Default task timeout is 20 minutes, scale down cause a trainer to stop, the task this trainer was training will be waiting for 20 minutes until re-dispatched.

@Yancey1989
Copy link
Contributor Author

Reproduce steps in the auto-scaling experiment:

  1. Submit a fault-tolerant job with 10 trainers, ./run.sh start case1.
  2. Scale up the job to 30 trainers, kubectl scale --replicas=30 jobs/mnist0-trainer
  3. Waiting for all the trainers begin to training.
  4. Scale down the job to 3 trainers, kubectl scale --replicas=3 jobs/mnist0-trainer
  5. trainer will hang at https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/v2/reader/creator.py#L123

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants