Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug Or Feature - Experiment enters failed state but trials left running #855

Closed
jlewi opened this issue Oct 4, 2019 · 5 comments
Closed

Comments

@jlewi
Copy link
Contributor

jlewi commented Oct 4, 2019

/kind bug

What steps did you take and what happened:
I'm not sure if this is a bug or a feature.

I had an experiment enter a failed state because too many trials failed.

All of the existing trials that were in the running state continued to run.

I can see advantages and disadvantages to not terminating existing trials when an experiment fails

@johnugeorge
Copy link
Member

Currently, the behavior is same as you described. Do you have any particular use case?

@jlewi
Copy link
Contributor Author

jlewi commented Oct 8, 2019

I don't have a specific use case to warrant one behavior or the other.

Training jobs can run for a long time so if training jobs have been running for 10 hours you don't necessarily want to just abort them; you might want to let them continue to run and finish.

I guess if the user wants to kill them they can always do explicitly. Wheras if they are killed automatically resuming the jobs may not be possible.

@gaocegege
Copy link
Member

@johnugeorge Did we fix it in #861

@johnugeorge
Copy link
Member

Yes. Currently, we keep existing running trials to continue and complete even if it has entered a terminal state.

@johnugeorge
Copy link
Member

Closing the issue as it works as expected

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants