extracting metric value in multiple ways #1140

sperlingxx · 2020-04-13T06:20:35Z

Currently, katib trial controller simply takes the best metric value among obversations. This approach will get the wrong objective value if training program doesn't preserve model (checkpoint) with best performance and regard it as final result. Actually, a lot of training programs simply save the model after final step.
So, it is necessary to support an another objective value extracting strategy: extracting the latest recorded value, which is the purpose of this PR.

After discussion, I found that the approach raised in #987 is better, in which we extract all min, max and latest metric value from observation logs. When converting trial observations into api_pb.Observation for feeding suggestion service, value will be chosen among min, max and latest metric according to MetricStrategies, a new field of commonv1alpha3.ObjectiveSpec.

kubeflow-bot · 2020-04-13T06:20:40Z

This change is

sperlingxx · 2020-04-13T07:35:32Z

@andreyvelich @gaocegege @johnugeorge Could you help to review this?

andreyvelich · 2020-04-13T12:41:11Z

pkg/apis/controller/trials/v1alpha3/trial_types.go

@@ -36,10 +36,21 @@ type TrialSpec struct {
 	// Whether to retain the trial run object after completed.
 	RetainRun bool `json:"retainRun,omitempty"`

+	// Describes how objective value will be extracted from collected metrics
+	ObjectiveExtract ObjectiveExtractType `json:"objectiveExtract,omitempty"`
+


Actually, we have an issue for this #987. Maybe instead of set up another env variable, in each run we can have min, max and latest in Objective.

/cc @gaocegege

@andreyvelich The idea extracting min, max and latest objective in each run looks good to me. But I think we also need an additional field like ObjectiveExtractType to describe which kind of objective chosen to be compared with goal.
Of course env variable DefaultObjectiveExtract is not necessary since the default choice can be figured out by ObjectiveType.

Then, maybe it should be part of Experiment?
What do you think @johnugeorge ?

@andreyvelich @johnugeorge @gaocegege
After a detailed review, I found the modification to common_types.Observation will cause enormous changes including corresponding interface of all existing suggestion services.
Maybe we can keep suggestionapi.Observation (pb spec) unchanged ? Only change common_types.Observation to the style like

observation: metrics: - name: loss min: 0.0001 max: 0.1234 latest: 0.1111

sperlingxx · 2020-04-15T07:22:18Z

/retest

sperlingxx · 2020-04-15T09:14:04Z

/retest

sperlingxx · 2020-04-15T16:17:04Z

@johnugeorge @andreyvelich @gaocegege I think current PR is also ready.

andreyvelich · 2020-04-15T19:49:30Z

pkg/apis/controller/common/v1alpha3/common_types.go

@@ -51,6 +51,8 @@ type ObjectiveSpec struct {
 	// This can be empty if we only care about the objective metric.
 	// Note: If we adopt a push instead of pull mechanism, this can be omitted completely.
 	AdditionalMetricNames []string `json:"additionalMetricNames,omitempty"`
+	// This field is allowed to missing, experiment defaulter (webhook) will fill it.
+	MetricStrategies map[string]MetricStrategy `json:"metricStrategies,omitempty"`


After this change, I can see that you send Metrics to Suggestion according to MetricStrategies. Is it right way?
Maybe some Suggestions need min, max and latest Observation? Do we need to be consistent between manager: https://github.com/kubeflow/katib/blob/master/pkg/apis/manager/v1alpha3/api.proto#L177 and Trial API?
/cc @johnugeorge @gaocegege

If we modify definition of observation in api.proto, we need to apply corresponding modifications on all suggestion services.

Yes, I understand it. You think we don't need it in suggestions?

@andreyvelich Yes. For suggestion service, as far as I know, the final reward of training program is the only necessary value. It doesn't care about historical records of metrics.

I think it is a huge change to API. Can we place this change in the next release? Maybe v1beta1.

Agree with @gaocegege. Let's keep it for next release.

gaocegege

The PR itself LGTM.

But prefer to propose in the next release.

@sperlingxx

gaocegege · 2020-04-17T08:47:42Z

pkg/apis/controller/common/v1alpha3/common_types.go

@@ -51,6 +51,8 @@ type ObjectiveSpec struct {
 	// This can be empty if we only care about the objective metric.
 	// Note: If we adopt a push instead of pull mechanism, this can be omitted completely.
 	AdditionalMetricNames []string `json:"additionalMetricNames,omitempty"`
+	// This field is allowed to missing, experiment defaulter (webhook) will fill it.
+	MetricStrategies map[string]MetricStrategy `json:"metricStrategies,omitempty"`


I think it is a huge change to API. Can we place this change in the next release? Maybe v1beta1.

sperlingxx · 2020-04-17T09:00:52Z

The PR itself LGTM.

But prefer to propose in the next release.

@sperlingxx

I'm okay with that.

sperlingxx · 2020-05-30T11:45:42Z

@gaocegege @andreyvelich I've migrated codes to v1beta1. And I merged modifications of Metric schema in current PR and #1120 . So, the Metric schema is

type Metric struct {
	Name   string  `json:"name,omitempty"`
	Min    float64 `json:"min,omitempty"`
	Max    float64 `json:"max,omitempty"`
	Latest string  `json:"latest,omitempty"`
}

Latest is changed to string-type in order to contain non-numeric values, while Min and Max are still float-type. Considering that max/min values become nonsense when metric values are non-numeric, they will be filled with math.NaN.

gaocegege · 2020-05-30T14:16:23Z

Thanks! I love the feature.

/lgtm

/cc @johnugeorge @andreyvelich

andreyvelich · 2020-05-30T20:13:44Z

pkg/controller.v1beta1/trial/managerclient/managerclient.go

+	}
+	// fetch additional metrics if exists
+	metricLogs := reply.ObservationLog.MetricLogs
+	for _, metricName := range instance.Spec.Objective.AdditionalMetricNames {


Is that correct to add AdditionalMetricNames in GetTrialObservationLog function?
I believe this function is needed to add Observation to Trial instance, right?
Currently, Trial Observation is related only to ObjectiveMetricName.

I think so.

@andreyvelich Yes, I am trying to append addtional metrics into Trial.Status.Observation. So, we can get full metrics of each trial via kubernetes API. Are you considering this modification may confuse our users/developers ?

Ok, I am just worried if the Suggestions will work correct.
I checked here: https://github.com/kubeflow/katib/blob/master/pkg/suggestion/v1beta1/internal/trial.py#L33, we add only objective metric to target metric. So, I think it should work fine.

andreyvelich · 2020-06-01T17:07:08Z

@sperlingxx Can you submit one example with metricsStrategies, please?
You can show how to specify Strategy for various Experiment metrics.

andreyvelich · 2020-06-01T17:13:48Z

pkg/controller.v1beta1/trial/trial_controller_util.go

+			return nil, fmt.Errorf("failed to parse timestamps %s: %e", metricLog.TimeStamp, err)
+		}
+		timestamp, _ := timestamps[metricLog.Metric.Name]
+		if timestamp == nil || !timestamp.After(currentTime) {


What if we have more than one metric with the same timestamp?
Maybe we can extract Latest metric using latest element from the metricLogs for each metric name, what do you think?

I agree. And I think, with current implementation, katib will use latest element if there exists multiple metricLogs with same timestamp.
The condition !timestamp.After(currentTime) ensure latest metric value will update as long as current timestamp is not earlier than historically latest timestamp. So, when they are equal, the updation will still be applied.

Should we change it to !currentTime.After(timestamp) ?
Since timestamp <= currentTime, because currentTime is the new value from metricLog and timestamp historical value that we have recorded ?

@sperlingxx Have you tried to test it?

Hi @andreyvelich , I missed this comment sereval days ago.
I think !timestamp.After(currentTime) represents not ({timestamp} > {currentTime}), which equals to {timestamp} <= {currentTime}.
And there is a test case on extracting latest records if their values are same. Here is the link.

@sperlingxx Got it. Thank you

sperlingxx · 2020-06-02T05:50:59Z

@sperlingxx Can you submit one example with metricsStrategies, please?
You can show how to specify Strategy for various Experiment metrics.

Do you mean providing a full executable example like those in examples folder?

andreyvelich · 2020-06-02T11:52:48Z

@sperlingxx Can you submit one example with metricsStrategies, please?
You can show how to specify Strategy for various Experiment metrics.

Do you mean providing a full executable example like those in examples folder?

Yes, just a simple one yaml example, you can use random algorithm.

sperlingxx · 2020-06-02T13:00:07Z

@sperlingxx Can you submit one example with metricsStrategies, please?
You can show how to specify Strategy for various Experiment metrics.

Do you mean providing a full executable example like those in examples folder?

Yes, just a simple one yaml example, you can use random algorithm.

Done.

gaocegege

/lgtm

Thanks for your contribution! 🎉 👍

/cc @johnugeorge @andreyvelich

andreyvelich · 2020-06-08T16:34:39Z

@sperlingxx Thank you for doing this!
Overall lgtm, only this question: #1140 (comment)

andreyvelich

/lgtm
/cc @johnugeorge

johnugeorge · 2020-06-12T05:04:09Z

Sorry for being late.
/lgtm

@sperlingxx can you rebase?

sperlingxx · 2020-06-12T09:41:10Z

Sorry for being late.
/lgtm

@sperlingxx can you rebase?

It's Done!

andreyvelich

Thanks @sperlingxx!
/approve

k8s-ci-robot · 2020-06-12T18:12:05Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andreyvelich

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [andreyvelich]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

andreyvelich · 2020-06-12T18:32:24Z

/lgtm

* migrate this PR to v1beta1 * fix * fix * fix * add example for metric strategies * modify MetricStrategy specification * fix

k8s-ci-robot requested review from Akado2009 and jinan-zhou April 13, 2020 06:20

k8s-ci-robot added the size/M label Apr 13, 2020

andreyvelich reviewed Apr 13, 2020

View reviewed changes

k8s-ci-robot requested a review from gaocegege April 13, 2020 12:41

k8s-ci-robot added size/L and removed size/M labels Apr 15, 2020

sperlingxx changed the title ~~support extracting latest objective value as an optional approach~~ extracting metric value in multiple ways Apr 15, 2020

andreyvelich reviewed Apr 15, 2020

View reviewed changes

k8s-ci-robot requested a review from johnugeorge April 15, 2020 19:53

gaocegege reviewed Apr 17, 2020

View reviewed changes

sperlingxx force-pushed the extract_latest_objective branch from f8ac41e to 62e09cf Compare May 30, 2020 10:38

k8s-ci-robot requested a review from andreyvelich May 30, 2020 14:16

k8s-ci-robot assigned gaocegege May 30, 2020

k8s-ci-robot added the lgtm label May 30, 2020

andreyvelich reviewed May 30, 2020

View reviewed changes

andreyvelich reviewed Jun 1, 2020

View reviewed changes

k8s-ci-robot removed the lgtm label Jun 2, 2020

k8s-ci-robot added size/XL and removed size/L labels Jun 8, 2020

gaocegege reviewed Jun 8, 2020

View reviewed changes

k8s-ci-robot requested a review from andreyvelich June 8, 2020 09:19

k8s-ci-robot added the lgtm label Jun 8, 2020

andreyvelich approved these changes Jun 9, 2020

View reviewed changes

k8s-ci-robot assigned andreyvelich Jun 9, 2020

k8s-ci-robot assigned johnugeorge Jun 12, 2020

sperlingxx added 7 commits June 12, 2020 16:31

migrate this PR to v1beta1

f328a88

fix

a45f418

fix

e86e1bb

fix

1a071d5

add example for metric strategies

aa6502c

modify MetricStrategy specification

9e24c8c

fix

108b06e

sperlingxx force-pushed the extract_latest_objective branch from dee9fec to 108b06e Compare June 12, 2020 08:32

k8s-ci-robot removed the lgtm label Jun 12, 2020

andreyvelich approved these changes Jun 12, 2020

View reviewed changes

k8s-ci-robot added the approved label Jun 12, 2020

k8s-ci-robot added the lgtm label Jun 12, 2020

k8s-ci-robot merged commit a918e08 into kubeflow:master Jun 12, 2020

sperlingxx added a commit to sperlingxx/katib that referenced this pull request Jul 9, 2020

extracting metric value in multiple ways (kubeflow#1140)

1a8759d

* migrate this PR to v1beta1 * fix * fix * fix * add example for metric strategies * modify MetricStrategy specification * fix

andreyvelich mentioned this pull request Aug 21, 2020

[Release 1.2] Feature Planning / Roadmap kubeflow/kubeflow#5224

Closed

andreyvelich mentioned this pull request Oct 17, 2020

[Release] Katib 0.10 release for Kubeflow 1.2 #1360

Closed

12 tasks

jbottum mentioned this pull request Nov 19, 2020

Kubeflow 1.2 Blog post kubeflow/community#455

Closed

extracting metric value in multiple ways #1140

extracting metric value in multiple ways #1140

Conversation

sperlingxx commented Apr 13, 2020 • edited Loading

kubeflow-bot commented Apr 13, 2020

sperlingxx commented Apr 13, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sperlingxx Apr 14, 2020 • edited Loading

Choose a reason for hiding this comment

sperlingxx commented Apr 15, 2020

sperlingxx commented Apr 15, 2020

sperlingxx commented Apr 15, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sperlingxx Apr 17, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gaocegege left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sperlingxx commented Apr 17, 2020

sperlingxx commented May 30, 2020 • edited Loading

gaocegege commented May 30, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sperlingxx May 31, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andreyvelich commented Jun 1, 2020

Choose a reason for hiding this comment

sperlingxx Jun 2, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sperlingxx commented Jun 2, 2020

andreyvelich commented Jun 2, 2020

sperlingxx commented Jun 2, 2020

gaocegege left a comment

Choose a reason for hiding this comment

andreyvelich commented Jun 8, 2020

andreyvelich left a comment

Choose a reason for hiding this comment

johnugeorge commented Jun 12, 2020

sperlingxx commented Jun 12, 2020

andreyvelich left a comment

Choose a reason for hiding this comment

k8s-ci-robot commented Jun 12, 2020

andreyvelich commented Jun 12, 2020

sperlingxx commented Apr 13, 2020 •

edited

Loading

sperlingxx commented Apr 13, 2020 •

edited

Loading

sperlingxx Apr 14, 2020 •

edited

Loading

sperlingxx Apr 17, 2020 •

edited

Loading

sperlingxx commented May 30, 2020 •

edited

Loading

sperlingxx May 31, 2020 •

edited

Loading

sperlingxx Jun 2, 2020 •

edited

Loading