Include MetricsUnavailable condition to Complete in Trial #1877

tenzen-y · 2022-05-28T21:17:43Z

What this PR does / why we need it:
It is not easy for users to find why Trial failed when training code output incorrect format logs since the trial-controller sets Succeeded condition with False to Trial if there are unavailable metrics in Katib DB as described in #1343.
So I also included MetricsUnavailable condition to Complete in Trial.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #1343

Checklist:

Docs included if any changes are user facing

coveralls · 2022-05-28T21:23:17Z

Coverage decreased (-0.03%) to 73.796% when pulling 5dbc5f3 on tenzen-y:deal-metrics-unavilable-as-completed into c9001d8 on kubeflow:master.

johnugeorge · 2022-05-30T05:29:54Z

Kubeflow-presubmit-tests have been removed from prow due to migration. We need to figure out how to enable tests again so that code is broken.

kubeflow/testing#1003

We need to think of Github actions

tenzen-y · 2022-05-30T05:54:21Z

Thanks for letting me know, @johnugeorge.
As commented in that PR, it seems that there are some plans for a new kubeflow test-infra from the Kubeflow AWS team.
Are the plans gone away?

johnugeorge · 2022-05-30T12:08:50Z

Yes. There are some plans but completion time is not known yet. Hence, thinking if we need to move to Github actions.

Thoughts?

tenzen-y · 2022-05-30T14:28:32Z

Yes. There are some plans but completion time is not known yet. Hence, thinking if we need to move to Github actions.

Thoughts?

I see. Generally, I agree with your opinion.

However, if AWS test-infra is rebuilt, we should not go back to AWS test-infra since it is hard for us to go back to AWS test-infra from GitHub Actions.

What do you think, @johnugeorge?

johnugeorge · 2022-05-30T17:59:20Z

Tricky issue is the amount of resources required for Katib tests. I am still not sure if we will be able to fit all tests in GH actions.

tenzen-y · 2022-05-30T19:40:04Z

Tricky issue is the amount of resources required for Katib tests. I am still not sure if we will be able to fit all tests in GH actions.

I see.
First of all, I will create an issue to track migrating to GitHub Actions.

It is not easy for users to find why Trial failed when training code output incorrect format logs since the trial-controller sets Succeeded condition with False to Trial if there are unavailable metrics in Katib DB as described in kubeflow#1343. So we also include MetricsUnavailable condition to Complete in Trial.

tenzen-y · 2022-06-06T08:07:01Z

I have rebased.

tenzen-y · 2022-06-06T08:23:26Z

@kubeflow/wg-automl-leads Please take a look.

johnugeorge · 2022-06-06T09:21:01Z

pkg/controller.v1beta1/experiment/util/status_util.go

@@ -192,7 +202,9 @@ func UpdateExperimentStatusCondition(collector *ExperimentsCollector, instance *
 	}

 	// First check if MaxFailedTrialCount is reached.
-	if (instance.Spec.MaxFailedTrialCount != nil) && (failedTrialsCount > *instance.Spec.MaxFailedTrialCount) {
+	if (instance.Spec.MaxFailedTrialCount != nil) &&
+		((*instance.Spec.MaxFailedTrialCount != *instance.Spec.MaxTrialCount && failedTrialsCount > *instance.Spec.MaxFailedTrialCount) ||


Can you explain this condition from the current one?

@johnugeorge Thanks for your review!
Currently, if *instance.Spec.MaxFailedTrialCount is equal to *instance.Spec.MaxTrialCount and failedTrialsCount is equal to *instance.Spec.MaxFailedTrialCount, katib skip this condition and gives Completed to the Experiment status.

In other words, katib gives Completed to Experiment status even though the number of failed Trials reaches *instance.Spec.MaxFailedTrialCount.

In this implementation, when the same condition as mentioned above, katib doesn't skip this condition and gives Failed to the experiment status.

How does it matter with *instance.Spec.MaxTrialCount? If I understand correctly, status is not set correctly when the number of failed Trials reaches *instance.Spec.MaxFailedTrialCount. Then, Isn't condition failedTrialsCount >= *instance.Spec.MaxFailedTrialCount enough ?

In the current implementation, katib create *instance.Spec.MaxFailedTrialCount + 1 times Trials if all Trials failed.
So, I implemented this part to keep this specification.
Should I change to failedTrialsCount >= *instance.Spec.MaxFailedTrialCount?

Isn't failedTrialsCount >= *instance.Spec.MaxFailedTrialCount the right one? Experiment should be completed when failed trials reach MaxFailedTrialCount.
/cc @andreyvelich

Isn't failedTrialsCount >= *instance.Spec.MaxFailedTrialCount the right one? Experiment should be completed when failed trials reach MaxFailedTrialCount.

Yes, I agree with @johnugeorge .
However, because as you say, there is an inconsistency between maxFailedTrialCount and maxTrialCount, we need to change definition of maxFailedTrialCount in order to change to failedTrialsCount >= *instance.Spec.MaxFailedTrialCount in this part.

@tenzen-y Please go ahead and make this change - failedTrialsCount >= *instance.Spec.MaxFailedTrialCount

@johnugeorge Sure.

To avoid being set Failed in Experiment status when failedTrialsCount and *instance.Spec.MaxFailedTrialCount is equal to 0, I added condition failedTrialsCount != 0 to this part.

sdk/python/v1beta1/kubeflow/katib/models/v1beta1_experiment_status.py

johnugeorge · 2022-06-06T19:49:37Z

pkg/controller.v1beta1/experiment/util/status_util.go

-	failedTrialsCount := instance.Status.TrialsFailed
+	completedTrialsCount :=
+		instance.Status.TrialsSucceeded + instance.Status.TrialsFailed + instance.Status.TrialsKilled + instance.Status.TrialsEarlyStopped + instance.Status.TrialMetricsUnavailable
+	failedTrialsCount := instance.Status.TrialsFailed + instance.Status.TrialMetricsUnavailable


Discussion point:
Is TrialMetricsUnavailable state a failed state for a user?

I think that the TrialMetricsUnavailable state is a failed Trial for a user since katib can not use that Trial result to optimize hyperparameters even though the training is successful if the metrics-collector container fails to collect metrics.

What do you think? @johnugeorge

I think so since it is a unrecoverable and an end state. We have to include this in docs as well to explain "Failed" trial"

It makes sense.
As I can see katib docs, we don't have any sections to describe Experiment status, so It might be better to explain Experiment status in this paragraph describing Experiment.

Does it sound good to you? @johnugeorge

Sure. We will wait for a day to get other reviews as well.

johnugeorge · 2022-06-06T20:03:24Z

/hold till review is completed

tenzen-y · 2022-06-07T08:12:46Z

I will wait 1-2 days for comments from other reviewers.

johnugeorge · 2022-06-07T12:17:53Z

This PR needs two doc changes

Update definition of maxFailedTrialCount
Add definition of 'Failed Trial`

…ntroller sets Failed to Experiment status

…, we need to add condition,

johnugeorge · 2022-06-08T04:20:20Z

/hold cancel

johnugeorge · 2022-06-08T04:20:57Z

Thanks @tenzen-y

/lgtm
/approve

google-oss-prow · 2022-06-08T04:21:15Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: johnugeorge, tenzen-y

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [johnugeorge]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

google-oss-prow bot requested review from anencore94 and sperlingxx May 28, 2022 21:17

google-oss-prow bot added the size/XL label May 28, 2022

tenzen-y force-pushed the deal-metrics-unavilable-as-completed branch from bb7fdd1 to 657b316 Compare May 28, 2022 21:32

tenzen-y mentioned this pull request May 30, 2022

Migrate kubeflow-katib-presubmit to GitHub Actions #1878

Closed

tenzen-y force-pushed the deal-metrics-unavilable-as-completed branch from 657b316 to 877e6e6 Compare June 6, 2022 08:05

johnugeorge reviewed Jun 6, 2022

View reviewed changes

sdk/python/v1beta1/kubeflow/katib/models/v1beta1_experiment_status.py Show resolved Hide resolved

add gh-actions tasks to verify generated codes

7336c04

tenzen-y force-pushed the deal-metrics-unavilable-as-completed branch from ddece93 to 1a0ac41 Compare June 6, 2022 12:50

fix gh-actions workflow

ce2d82f

tenzen-y force-pushed the deal-metrics-unavilable-as-completed branch from 1a0ac41 to ce2d82f Compare June 6, 2022 12:52

johnugeorge reviewed Jun 6, 2022

View reviewed changes

google-oss-prow bot added the do-not-merge/hold label Jun 6, 2022

tenzen-y added 3 commits June 7, 2022 22:15

when the number of Failed Trials reaches maxTrialCount, experiment-co…

5996be9

…ntroller sets Failed to Experiment status

fix e2e test

e32c642

To avoid being set Failed in Experiment status when and is equal to 0…

5dbc5f3

…, we need to add condition,

tenzen-y force-pushed the deal-metrics-unavilable-as-completed branch from e08f09f to 5dbc5f3 Compare June 7, 2022 14:37

google-oss-prow bot removed the do-not-merge/hold label Jun 8, 2022

google-oss-prow bot assigned johnugeorge Jun 8, 2022

google-oss-prow bot added the lgtm label Jun 8, 2022

google-oss-prow bot added the approved label Jun 8, 2022

google-oss-prow bot merged commit ab2f596 into kubeflow:master Jun 8, 2022

tenzen-y deleted the deal-metrics-unavilable-as-completed branch June 8, 2022 04:32

tenzen-y mentioned this pull request Jun 8, 2022

Update definition of maxFailedTrialCount kubeflow/website#3275

Merged

tenzen-y mentioned this pull request Aug 17, 2022

Trial job is succeeded but metrics are not reported, reconcile requeued #1795

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Include MetricsUnavailable condition to Complete in Trial #1877

Include MetricsUnavailable condition to Complete in Trial #1877

tenzen-y commented May 28, 2022 •

edited

Loading

coveralls commented May 28, 2022 •

edited

Loading

johnugeorge commented May 30, 2022

tenzen-y commented May 30, 2022 •

edited

Loading

johnugeorge commented May 30, 2022

tenzen-y commented May 30, 2022

johnugeorge commented May 30, 2022

tenzen-y commented May 30, 2022

tenzen-y commented Jun 6, 2022

tenzen-y commented Jun 6, 2022

johnugeorge Jun 6, 2022

tenzen-y Jun 6, 2022 •

edited

Loading

johnugeorge Jun 6, 2022

tenzen-y Jun 6, 2022 •

edited

Loading

johnugeorge Jun 6, 2022 •

edited

Loading

tenzen-y Jun 7, 2022

johnugeorge Jun 7, 2022

tenzen-y Jun 7, 2022

tenzen-y Jun 7, 2022

tenzen-y Jun 7, 2022

johnugeorge Jun 6, 2022 •

edited

Loading

tenzen-y Jun 7, 2022

johnugeorge Jun 7, 2022 •

edited

Loading

tenzen-y Jun 7, 2022

johnugeorge Jun 7, 2022

johnugeorge commented Jun 6, 2022

tenzen-y commented Jun 7, 2022 •

edited

Loading

johnugeorge commented Jun 7, 2022

johnugeorge commented Jun 8, 2022

johnugeorge commented Jun 8, 2022

google-oss-prow bot commented Jun 8, 2022

Include MetricsUnavailable condition to Complete in Trial #1877

Include MetricsUnavailable condition to Complete in Trial #1877

Conversation

tenzen-y commented May 28, 2022 • edited Loading

coveralls commented May 28, 2022 • edited Loading

johnugeorge commented May 30, 2022

tenzen-y commented May 30, 2022 • edited Loading

johnugeorge commented May 30, 2022

tenzen-y commented May 30, 2022

johnugeorge commented May 30, 2022

tenzen-y commented May 30, 2022

tenzen-y commented Jun 6, 2022

tenzen-y commented Jun 6, 2022

Choose a reason for hiding this comment

tenzen-y Jun 6, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tenzen-y Jun 6, 2022 • edited Loading

Choose a reason for hiding this comment

johnugeorge Jun 6, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

johnugeorge Jun 6, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

johnugeorge Jun 7, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

johnugeorge commented Jun 6, 2022

tenzen-y commented Jun 7, 2022 • edited Loading

johnugeorge commented Jun 7, 2022

johnugeorge commented Jun 8, 2022

johnugeorge commented Jun 8, 2022

google-oss-prow bot commented Jun 8, 2022

tenzen-y commented May 28, 2022 •

edited

Loading

coveralls commented May 28, 2022 •

edited

Loading

tenzen-y commented May 30, 2022 •

edited

Loading

tenzen-y Jun 6, 2022 •

edited

Loading

tenzen-y Jun 6, 2022 •

edited

Loading

johnugeorge Jun 6, 2022 •

edited

Loading

johnugeorge Jun 6, 2022 •

edited

Loading

johnugeorge Jun 7, 2022 •

edited

Loading

tenzen-y commented Jun 7, 2022 •

edited

Loading