-
Notifications
You must be signed in to change notification settings - Fork 448
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sync up worker status all the time #299
Conversation
/assign @YujiOshima @richardsliu |
} | ||
} | ||
} else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indentation looks misaligned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it works well (I had verified it). else
match with line 347 (not 348)
} | ||
} | ||
} else { | ||
// for some reason, metricsCollector for this worker cannot be found (deleted by anyone accidentally or even failed to be created) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if the job finishes early but metric collector is not spawned yet?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in this case, metrics including objectiveMetrics will not be collected, studyJob will have no idea which group of hyperParameter have better performance. That is, studyJob will study nothing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When a worker job changes to "Running", metric collector cronjob is spawned rightly. That is, when a worker job changes to "Complete", but its corresponding metric collector cronjob cannot be found. In this case, I think it is deleted by anyone accidentally or even failed to be created.
Thanks for your explanations. |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: richardsliu The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Now, when metricsCollector for a worker failed to be created, the worker job status in studyjob.status.trials will not be updated (blocked in Running), and the corresponding job will not be deleted even completed.
Fixes: #298
This change is