Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

extracting metric value in multiple ways #1140

Merged
merged 7 commits into from
Jun 12, 2020

Conversation

sperlingxx
Copy link
Member

@sperlingxx sperlingxx commented Apr 13, 2020

Currently, katib trial controller simply takes the best metric value among obversations. This approach will get the wrong objective value if training program doesn't preserve model (checkpoint) with best performance and regard it as final result. Actually, a lot of training programs simply save the model after final step.
So, it is necessary to support an another objective value extracting strategy: extracting the latest recorded value, which is the purpose of this PR.


After discussion, I found that the approach raised in #987 is better, in which we extract all min, max and latest metric value from observation logs. When converting trial observations into api_pb.Observation for feeding suggestion service, value will be chosen among min, max and latest metric according to MetricStrategies, a new field of commonv1alpha3.ObjectiveSpec.

@kubeflow-bot
Copy link

This change is Reviewable

@sperlingxx
Copy link
Member Author

sperlingxx commented Apr 13, 2020

@andreyvelich @gaocegege @johnugeorge Could you help to review this?

@@ -36,10 +36,21 @@ type TrialSpec struct {
// Whether to retain the trial run object after completed.
RetainRun bool `json:"retainRun,omitempty"`

// Describes how objective value will be extracted from collected metrics
ObjectiveExtract ObjectiveExtractType `json:"objectiveExtract,omitempty"`

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, we have an issue for this #987. Maybe instead of set up another env variable, in each run we can have min, max and latest in Objective.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andreyvelich The idea extracting min, max and latest objective in each run looks good to me. But I think we also need an additional field like ObjectiveExtractType to describe which kind of objective chosen to be compared with goal.
Of course env variable DefaultObjectiveExtract is not necessary since the default choice can be figured out by ObjectiveType.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then, maybe it should be part of Experiment?
What do you think @johnugeorge ?

Copy link
Member Author

@sperlingxx sperlingxx Apr 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andreyvelich @johnugeorge @gaocegege
After a detailed review, I found the modification to common_types.Observation will cause enormous changes including corresponding interface of all existing suggestion services.
Maybe we can keep suggestionapi.Observation (pb spec) unchanged ? Only change common_types.Observation to the style like

observation:
 metrics:
   - name: loss
     min: 0.0001
     max: 0.1234
     latest: 0.1111

@sperlingxx
Copy link
Member Author

/retest

@sperlingxx sperlingxx changed the title support extracting latest objective value as an optional approach extracting metric value in multiple ways Apr 15, 2020
@sperlingxx
Copy link
Member Author

/retest

@sperlingxx
Copy link
Member Author

@johnugeorge @andreyvelich @gaocegege I think current PR is also ready.

@@ -51,6 +51,8 @@ type ObjectiveSpec struct {
// This can be empty if we only care about the objective metric.
// Note: If we adopt a push instead of pull mechanism, this can be omitted completely.
AdditionalMetricNames []string `json:"additionalMetricNames,omitempty"`
// This field is allowed to missing, experiment defaulter (webhook) will fill it.
MetricStrategies map[string]MetricStrategy `json:"metricStrategies,omitempty"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After this change, I can see that you send Metrics to Suggestion according to MetricStrategies. Is it right way?
Maybe some Suggestions need min, max and latest Observation? Do we need to be consistent between manager: https://github.com/kubeflow/katib/blob/master/pkg/apis/manager/v1alpha3/api.proto#L177 and Trial API?
/cc @johnugeorge @gaocegege

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we modify definition of observation in api.proto, we need to apply corresponding modifications on all suggestion services.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I understand it. You think we don't need it in suggestions?

Copy link
Member Author

@sperlingxx sperlingxx Apr 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andreyvelich Yes. For suggestion service, as far as I know, the final reward of training program is the only necessary value. It doesn't care about historical records of metrics.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is a huge change to API. Can we place this change in the next release? Maybe v1beta1.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with @gaocegege. Let's keep it for next release.

Copy link
Member

@gaocegege gaocegege left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR itself LGTM.

But prefer to propose in the next release.

@sperlingxx

@@ -51,6 +51,8 @@ type ObjectiveSpec struct {
// This can be empty if we only care about the objective metric.
// Note: If we adopt a push instead of pull mechanism, this can be omitted completely.
AdditionalMetricNames []string `json:"additionalMetricNames,omitempty"`
// This field is allowed to missing, experiment defaulter (webhook) will fill it.
MetricStrategies map[string]MetricStrategy `json:"metricStrategies,omitempty"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is a huge change to API. Can we place this change in the next release? Maybe v1beta1.

@sperlingxx
Copy link
Member Author

The PR itself LGTM.

But prefer to propose in the next release.

@sperlingxx

I'm okay with that.

@sperlingxx
Copy link
Member Author

sperlingxx commented May 30, 2020

@gaocegege @andreyvelich I've migrated codes to v1beta1. And I merged modifications of Metric schema in current PR and #1120 . So, the Metric schema is

type Metric struct {
	Name   string  `json:"name,omitempty"`
	Min    float64 `json:"min,omitempty"`
	Max    float64 `json:"max,omitempty"`
	Latest string  `json:"latest,omitempty"`
}

Latest is changed to string-type in order to contain non-numeric values, while Min and Max are still float-type. Considering that max/min values become nonsense when metric values are non-numeric, they will be filled with math.NaN.

@gaocegege
Copy link
Member

Thanks! I love the feature.

/lgtm

/cc @johnugeorge @andreyvelich

}
// fetch additional metrics if exists
metricLogs := reply.ObservationLog.MetricLogs
for _, metricName := range instance.Spec.Objective.AdditionalMetricNames {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that correct to add AdditionalMetricNames in GetTrialObservationLog function?
I believe this function is needed to add Observation to Trial instance, right?
Currently, Trial Observation is related only to ObjectiveMetricName.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so.

Copy link
Member Author

@sperlingxx sperlingxx May 31, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andreyvelich Yes, I am trying to append addtional metrics into Trial.Status.Observation. So, we can get full metrics of each trial via kubernetes API. Are you considering this modification may confuse our users/developers ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I am just worried if the Suggestions will work correct.
I checked here: https://github.com/kubeflow/katib/blob/master/pkg/suggestion/v1beta1/internal/trial.py#L33, we add only objective metric to target metric. So, I think it should work fine.

@andreyvelich
Copy link
Member

@sperlingxx Can you submit one example with metricsStrategies, please?
You can show how to specify Strategy for various Experiment metrics.

return nil, fmt.Errorf("failed to parse timestamps %s: %e", metricLog.TimeStamp, err)
}
timestamp, _ := timestamps[metricLog.Metric.Name]
if timestamp == nil || !timestamp.After(currentTime) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we have more than one metric with the same timestamp?
Maybe we can extract Latest metric using latest element from the metricLogs for each metric name, what do you think?

Copy link
Member Author

@sperlingxx sperlingxx Jun 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. And I think, with current implementation, katib will use latest element if there exists multiple metricLogs with same timestamp.
The condition !timestamp.After(currentTime) ensure latest metric value will update as long as current timestamp is not earlier than historically latest timestamp. So, when they are equal, the updation will still be applied.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we change it to !currentTime.After(timestamp) ?
Since timestamp <= currentTime, because currentTime is the new value from metricLog and timestamp historical value that we have recorded ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sperlingxx Have you tried to test it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @andreyvelich , I missed this comment sereval days ago.
I think !timestamp.After(currentTime) represents not ({timestamp} > {currentTime}), which equals to {timestamp} <= {currentTime}.
And there is a test case on extracting latest records if their values are same. Here is the link.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sperlingxx Got it. Thank you

@sperlingxx
Copy link
Member Author

@sperlingxx Can you submit one example with metricsStrategies, please?
You can show how to specify Strategy for various Experiment metrics.

Do you mean providing a full executable example like those in examples folder?

@andreyvelich
Copy link
Member

@sperlingxx Can you submit one example with metricsStrategies, please?
You can show how to specify Strategy for various Experiment metrics.

Do you mean providing a full executable example like those in examples folder?

Yes, just a simple one yaml example, you can use random algorithm.

@k8s-ci-robot k8s-ci-robot removed the lgtm label Jun 2, 2020
@sperlingxx
Copy link
Member Author

@sperlingxx Can you submit one example with metricsStrategies, please?
You can show how to specify Strategy for various Experiment metrics.

Do you mean providing a full executable example like those in examples folder?

Yes, just a simple one yaml example, you can use random algorithm.

Done.

Copy link
Member

@gaocegege gaocegege left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

Thanks for your contribution! 🎉 👍

/cc @johnugeorge @andreyvelich

@andreyvelich
Copy link
Member

@sperlingxx Thank you for doing this!
Overall lgtm, only this question: #1140 (comment)

Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/cc @johnugeorge

@johnugeorge
Copy link
Member

Sorry for being late.
/lgtm

@sperlingxx can you rebase?

@sperlingxx
Copy link
Member Author

Sorry for being late.
/lgtm

@sperlingxx can you rebase?

It's Done!

Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @sperlingxx!
/approve

@k8s-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andreyvelich

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@andreyvelich
Copy link
Member

/lgtm

@k8s-ci-robot k8s-ci-robot merged commit a918e08 into kubeflow:master Jun 12, 2020
sperlingxx added a commit to sperlingxx/katib that referenced this pull request Jul 9, 2020
* migrate this PR to v1beta1

* fix

* fix

* fix

* add example for metric strategies

* modify MetricStrategy specification

* fix
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants