-
Notifications
You must be signed in to change notification settings - Fork 448
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Include MetricsUnavailable condition to Complete in Trial #1877
Include MetricsUnavailable condition to Complete in Trial #1877
Conversation
bb7fdd1
to
657b316
Compare
Kubeflow-presubmit-tests have been removed from prow due to migration. We need to figure out how to enable tests again so that code is broken. We need to think of Github actions |
Thanks for letting me know, @johnugeorge. |
Yes. There are some plans but completion time is not known yet. Hence, thinking if we need to move to Github actions. Thoughts? |
I see. Generally, I agree with your opinion. However, if AWS test-infra is rebuilt, we should not go back to AWS test-infra since it is hard for us to go back to AWS test-infra from GitHub Actions. What do you think, @johnugeorge? |
Tricky issue is the amount of resources required for Katib tests. I am still not sure if we will be able to fit all tests in GH actions. |
I see. |
It is not easy for users to find why Trial failed when training code output incorrect format logs since the trial-controller sets Succeeded condition with False to Trial if there are unavailable metrics in Katib DB as described in kubeflow#1343. So we also include MetricsUnavailable condition to Complete in Trial.
657b316
to
877e6e6
Compare
I have rebased. |
@kubeflow/wg-automl-leads Please take a look. |
@@ -192,7 +202,9 @@ func UpdateExperimentStatusCondition(collector *ExperimentsCollector, instance * | |||
} | |||
|
|||
// First check if MaxFailedTrialCount is reached. | |||
if (instance.Spec.MaxFailedTrialCount != nil) && (failedTrialsCount > *instance.Spec.MaxFailedTrialCount) { | |||
if (instance.Spec.MaxFailedTrialCount != nil) && | |||
((*instance.Spec.MaxFailedTrialCount != *instance.Spec.MaxTrialCount && failedTrialsCount > *instance.Spec.MaxFailedTrialCount) || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain this condition from the current one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@johnugeorge Thanks for your review!
Currently, if *instance.Spec.MaxFailedTrialCount
is equal to *instance.Spec.MaxTrialCount
and failedTrialsCount
is equal to *instance.Spec.MaxFailedTrialCount
, katib skip this condition and gives Completed
to the Experiment status.
In other words, katib gives Completed
to Experiment status even though the number of failed Trials reaches *instance.Spec.MaxFailedTrialCount
.
In this implementation, when the same condition as mentioned above, katib doesn't skip this condition and gives Failed
to the experiment status.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does it matter with *instance.Spec.MaxTrialCount
? If I understand correctly, status is not set correctly when the number of failed Trials reaches *instance.Spec.MaxFailedTrialCount
. Then, Isn't condition failedTrialsCount >= *instance.Spec.MaxFailedTrialCount
enough ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the current implementation, katib create *instance.Spec.MaxFailedTrialCount + 1
times Trials if all Trials failed.
So, I implemented this part to keep this specification.
Should I change to failedTrialsCount >= *instance.Spec.MaxFailedTrialCount
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't failedTrialsCount >= *instance.Spec.MaxFailedTrialCount
the right one? Experiment should be completed when failed trials reach MaxFailedTrialCount.
/cc @andreyvelich
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't failedTrialsCount >= *instance.Spec.MaxFailedTrialCount the right one? Experiment should be completed when failed trials reach MaxFailedTrialCount.
Yes, I agree with @johnugeorge .
However, because as you say, there is an inconsistency between maxFailedTrialCount
and maxTrialCount
, we need to change definition of maxFailedTrialCount in order to change to failedTrialsCount >= *instance.Spec.MaxFailedTrialCount
in this part.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tenzen-y Please go ahead and make this change - failedTrialsCount >= *instance.Spec.MaxFailedTrialCount
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@johnugeorge Sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To avoid being set Failed in Experiment status when failedTrialsCount
and *instance.Spec.MaxFailedTrialCount
is equal to 0
, I added condition failedTrialsCount != 0
to this part.
ddece93
to
1a0ac41
Compare
1a0ac41
to
ce2d82f
Compare
failedTrialsCount := instance.Status.TrialsFailed | ||
completedTrialsCount := | ||
instance.Status.TrialsSucceeded + instance.Status.TrialsFailed + instance.Status.TrialsKilled + instance.Status.TrialsEarlyStopped + instance.Status.TrialMetricsUnavailable | ||
failedTrialsCount := instance.Status.TrialsFailed + instance.Status.TrialMetricsUnavailable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussion point:
Is TrialMetricsUnavailable state a failed state for a user?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that the TrialMetricsUnavailable state is a failed Trial for a user since katib can not use that Trial result to optimize hyperparameters even though the training is successful if the metrics-collector container fails to collect metrics.
What do you think? @johnugeorge
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think so since it is a unrecoverable and an end state. We have to include this in docs as well to explain "Failed" trial"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It makes sense.
As I can see katib docs, we don't have any sections to describe Experiment status, so It might be better to explain Experiment status in this paragraph describing Experiment.
Does it sound good to you? @johnugeorge
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. We will wait for a day to get other reviews as well.
/hold till review is completed |
I will wait 1-2 days for comments from other reviewers. |
This PR needs two doc changes
|
…ntroller sets Failed to Experiment status
…, we need to add condition,
e08f09f
to
5dbc5f3
Compare
/hold cancel |
Thanks @tenzen-y /lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: johnugeorge, tenzen-y The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
What this PR does / why we need it:
It is not easy for users to find why Trial failed when training code output incorrect format logs since the trial-controller sets Succeeded condition with False to Trial if there are unavailable metrics in Katib DB as described in #1343.
So I also included MetricsUnavailable condition to Complete in Trial.
Which issue(s) this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)
format, will close the issue(s) when PR gets merged):Fixes #1343
Checklist: