Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Include MetricsUnavailable condition to Complete in Trial #1877

Merged

Conversation

tenzen-y
Copy link
Member

@tenzen-y tenzen-y commented May 28, 2022

What this PR does / why we need it:
It is not easy for users to find why Trial failed when training code output incorrect format logs since the trial-controller sets Succeeded condition with False to Trial if there are unavailable metrics in Katib DB as described in #1343.
So I also included MetricsUnavailable condition to Complete in Trial.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #1343

Checklist:

  • Docs included if any changes are user facing

@coveralls
Copy link

coveralls commented May 28, 2022

Coverage Status

Coverage decreased (-0.03%) to 73.796% when pulling 5dbc5f3 on tenzen-y:deal-metrics-unavilable-as-completed into c9001d8 on kubeflow:master.

@tenzen-y tenzen-y force-pushed the deal-metrics-unavilable-as-completed branch from bb7fdd1 to 657b316 Compare May 28, 2022 21:32
@johnugeorge
Copy link
Member

Kubeflow-presubmit-tests have been removed from prow due to migration. We need to figure out how to enable tests again so that code is broken.

kubeflow/testing#1003

We need to think of Github actions

@tenzen-y
Copy link
Member Author

tenzen-y commented May 30, 2022

Thanks for letting me know, @johnugeorge.
As commented in that PR, it seems that there are some plans for a new kubeflow test-infra from the Kubeflow AWS team.
Are the plans gone away?

@johnugeorge
Copy link
Member

Yes. There are some plans but completion time is not known yet. Hence, thinking if we need to move to Github actions.

Thoughts?

@tenzen-y
Copy link
Member Author

Yes. There are some plans but completion time is not known yet. Hence, thinking if we need to move to Github actions.

Thoughts?

I see. Generally, I agree with your opinion.

However, if AWS test-infra is rebuilt, we should not go back to AWS test-infra since it is hard for us to go back to AWS test-infra from GitHub Actions.

What do you think, @johnugeorge?

@johnugeorge
Copy link
Member

Tricky issue is the amount of resources required for Katib tests. I am still not sure if we will be able to fit all tests in GH actions.

@tenzen-y
Copy link
Member Author

Tricky issue is the amount of resources required for Katib tests. I am still not sure if we will be able to fit all tests in GH actions.

I see.
First of all, I will create an issue to track migrating to GitHub Actions.

It is not easy for users to find why Trial failed when training code output incorrect format logs
since the trial-controller sets Succeeded condition with False to Trial if there are unavailable metrics in Katib DB as described in kubeflow#1343.
So we also include MetricsUnavailable condition to Complete in Trial.
@tenzen-y tenzen-y force-pushed the deal-metrics-unavilable-as-completed branch from 657b316 to 877e6e6 Compare June 6, 2022 08:05
@tenzen-y
Copy link
Member Author

tenzen-y commented Jun 6, 2022

I have rebased.

@tenzen-y
Copy link
Member Author

tenzen-y commented Jun 6, 2022

@kubeflow/wg-automl-leads Please take a look.

@@ -192,7 +202,9 @@ func UpdateExperimentStatusCondition(collector *ExperimentsCollector, instance *
}

// First check if MaxFailedTrialCount is reached.
if (instance.Spec.MaxFailedTrialCount != nil) && (failedTrialsCount > *instance.Spec.MaxFailedTrialCount) {
if (instance.Spec.MaxFailedTrialCount != nil) &&
((*instance.Spec.MaxFailedTrialCount != *instance.Spec.MaxTrialCount && failedTrialsCount > *instance.Spec.MaxFailedTrialCount) ||
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain this condition from the current one?

Copy link
Member Author

@tenzen-y tenzen-y Jun 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@johnugeorge Thanks for your review!
Currently, if *instance.Spec.MaxFailedTrialCount is equal to *instance.Spec.MaxTrialCount and failedTrialsCount is equal to *instance.Spec.MaxFailedTrialCount, katib skip this condition and gives Completed to the Experiment status.

In other words, katib gives Completed to Experiment status even though the number of failed Trials reaches *instance.Spec.MaxFailedTrialCount.

In this implementation, when the same condition as mentioned above, katib doesn't skip this condition and gives Failed to the experiment status.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does it matter with *instance.Spec.MaxTrialCount? If I understand correctly, status is not set correctly when the number of failed Trials reaches *instance.Spec.MaxFailedTrialCount. Then, Isn't condition failedTrialsCount >= *instance.Spec.MaxFailedTrialCount enough ?

Copy link
Member Author

@tenzen-y tenzen-y Jun 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the current implementation, katib create *instance.Spec.MaxFailedTrialCount + 1 times Trials if all Trials failed.
So, I implemented this part to keep this specification.
Should I change to failedTrialsCount >= *instance.Spec.MaxFailedTrialCount?

Copy link
Member

@johnugeorge johnugeorge Jun 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't failedTrialsCount >= *instance.Spec.MaxFailedTrialCount the right one? Experiment should be completed when failed trials reach MaxFailedTrialCount.
/cc @andreyvelich

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't failedTrialsCount >= *instance.Spec.MaxFailedTrialCount the right one? Experiment should be completed when failed trials reach MaxFailedTrialCount.

Yes, I agree with @johnugeorge .
However, because as you say, there is an inconsistency between maxFailedTrialCount and maxTrialCount, we need to change definition of maxFailedTrialCount in order to change to failedTrialsCount >= *instance.Spec.MaxFailedTrialCount in this part.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tenzen-y Please go ahead and make this change - failedTrialsCount >= *instance.Spec.MaxFailedTrialCount

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@johnugeorge Sure.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To avoid being set Failed in Experiment status when failedTrialsCount and *instance.Spec.MaxFailedTrialCount is equal to 0, I added condition failedTrialsCount != 0 to this part.

@tenzen-y tenzen-y force-pushed the deal-metrics-unavilable-as-completed branch from ddece93 to 1a0ac41 Compare June 6, 2022 12:50
@tenzen-y tenzen-y force-pushed the deal-metrics-unavilable-as-completed branch from 1a0ac41 to ce2d82f Compare June 6, 2022 12:52
failedTrialsCount := instance.Status.TrialsFailed
completedTrialsCount :=
instance.Status.TrialsSucceeded + instance.Status.TrialsFailed + instance.Status.TrialsKilled + instance.Status.TrialsEarlyStopped + instance.Status.TrialMetricsUnavailable
failedTrialsCount := instance.Status.TrialsFailed + instance.Status.TrialMetricsUnavailable
Copy link
Member

@johnugeorge johnugeorge Jun 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussion point:
Is TrialMetricsUnavailable state a failed state for a user?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that the TrialMetricsUnavailable state is a failed Trial for a user since katib can not use that Trial result to optimize hyperparameters even though the training is successful if the metrics-collector container fails to collect metrics.

What do you think? @johnugeorge

Copy link
Member

@johnugeorge johnugeorge Jun 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so since it is a unrecoverable and an end state. We have to include this in docs as well to explain "Failed" trial"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes sense.
As I can see katib docs, we don't have any sections to describe Experiment status, so It might be better to explain Experiment status in this paragraph describing Experiment.

Does it sound good to you? @johnugeorge

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. We will wait for a day to get other reviews as well.

@johnugeorge
Copy link
Member

/hold till review is completed

@tenzen-y
Copy link
Member Author

tenzen-y commented Jun 7, 2022

I will wait 1-2 days for comments from other reviewers.

@johnugeorge
Copy link
Member

This PR needs two doc changes

  1. Update definition of maxFailedTrialCount
  2. Add definition of 'Failed Trial`

@tenzen-y tenzen-y force-pushed the deal-metrics-unavilable-as-completed branch from e08f09f to 5dbc5f3 Compare June 7, 2022 14:37
@johnugeorge
Copy link
Member

/hold cancel

@johnugeorge
Copy link
Member

Thanks @tenzen-y

/lgtm
/approve

@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: johnugeorge, tenzen-y

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@google-oss-prow google-oss-prow bot merged commit ab2f596 into kubeflow:master Jun 8, 2022
@tenzen-y tenzen-y deleted the deal-metrics-unavilable-as-completed branch June 8, 2022 04:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Metrics unavailable Trial condition
3 participants