Objective metrics cast as int, erroneously exceed target with random optimizer #1040

timothyjlaurent · 2020-02-02T05:58:18Z

/kind bug

What steps did you take and what happened:
I’m running the tfjob-example.yaml. The Objective is to maximize accuracy with a Goal of 0.99.

The experiment stopped prematurely because "Trial has succeeded" and the Experiment records that
accuracy has a value of 1 and shows up as an int in the katib ui, which is more than the 0.99 Goal. In the UI however the Trial with the winning hyperparameters has a value of 0.986... So it looks like the float value is being cast to an int and prematurely stopping training.

What did you expect to happen:

The Experiment would continue until a value exceeded the Goal, and that the reported accuracy would remain a float value.

Anything else you would like to add:

Here is the description of the Experiment:

Name:         katib-tfjob-example-cpu
Namespace:    kubeflow
Labels:       <none>
Annotations:  <none>
API Version:  kubeflow.org/v1alpha3
Kind:         Experiment
Metadata:
  Creation Timestamp:  2020-02-02T04:57:42Z
  Finalizers:
    update-prometheus-metrics
  Generation:        1
  Resource Version:  34076667
  Self Link:         /apis/kubeflow.org/v1alpha3/namespaces/kubeflow/experiments/katib-tfjob-example-cpu
  UID:               8b2ee6a1-4578-11ea-a653-0a392894a425
Spec:
  Algorithm:
    Algorithm Name:        random
    Algorithm Settings:    <nil>
  Max Failed Trial Count:  3
  Max Trial Count:         12
  Metrics Collector Spec:
    Collector:
      Kind:  TensorFlowEvent
    Source:
      File System Path:
        Kind:  Directory
        Path:  /train
  Objective:
    Goal:                   0.99
    Objective Metric Name:  accuracy_1
    Type:                   maximize
  Parallel Trial Count:     3
  Parameters:
    Feasible Space:
      Max:           0.05
      Min:           0.01
    Name:            --learning_rate
    Parameter Type:  double
    Feasible Space:
      Max:           200
      Min:           100
    Name:            --batch_size
    Parameter Type:  int
  Trial Template:
    Go Template:
      Raw Template:  apiVersion: "kubeflow.org/v1"
kind: TFJob
metadata:
  name: {{.Trial}}
  namespace: {{.NameSpace}}
spec:
 tfReplicaSpecs:
  Worker:
    replicas: 1
    restartPolicy: OnFailure
    template:
      spec:
        containers:
          - name: tensorflow
            image: gcr.io/kubeflow-ci/tf-mnist-with-summaries:1.0
            imagePullPolicy: Always
            command:
              - "python"
              - "/var/tf_mnist/mnist_with_summaries.py"
              - "--log_dir=/train/metrics"
              {{- with .HyperParameters}}
              {{- range .}}
              - "{{.Name}}={{.Value}}"
              {{- end}}
              {{- end}}
Status:
  Completion Time:  2020-02-02T05:06:12Z
  Conditions:
    Last Transition Time:  2020-02-02T04:57:42Z
    Last Update Time:      2020-02-02T04:57:42Z
    Message:               Experiment is created
    Reason:                ExperimentCreated
    Status:                True
    Type:                  Created
    Last Transition Time:  2020-02-02T05:06:12Z
    Last Update Time:      2020-02-02T05:06:12Z
    Message:               Experiment is running
    Reason:                ExperimentRunning
    Status:                False
    Type:                  Running
    Last Transition Time:  2020-02-02T05:06:12Z
    Last Update Time:      2020-02-02T05:06:12Z
    Message:               Experiment has succeeded because Objective goal has reached
    Reason:                ExperimentSucceeded
    Status:                True
    Type:                  Succeeded
  Current Optimal Trial:
    Observation:
      Metrics:
        Name:   accuracy_1
        Value:  1
    Parameter Assignments:
      Name:          --learning_rate
      Value:         0.011044242440414936
      Name:          --batch_size
      Value:         144
  Start Time:        2020-02-02T04:57:42Z
  Trials:            5
  Trials Succeeded:  5
Events:              <none>

and the winning trial also showed a value of 1 for accuracy:

Name:         katib-tfjob-example-cpu-clql64rr
Namespace:    kubeflow
Labels:       experiment=katib-tfjob-example-cpu
Annotations:  <none>
API Version:  kubeflow.org/v1alpha3
Kind:         Trial
Metadata:
  Creation Timestamp:  2020-02-02T04:58:56Z
  Finalizers:
    clean-metrics-in-db
  Generation:  1
  Owner References:
    API Version:           kubeflow.org/v1alpha3
    Block Owner Deletion:  true
    Controller:            true
    Kind:                  Experiment
    Name:                  katib-tfjob-example-cpu
    UID:                   8b2ee6a1-4578-11ea-a653-0a392894a425
  Resource Version:        34074063
  Self Link:               /apis/kubeflow.org/v1alpha3/namespaces/kubeflow/trials/katib-tfjob-example-cpu-clql64rr
  UID:                     b751ea80-4578-11ea-a653-0a392894a425
Spec:
  Metrics Collector:
    Collector:
      Kind:  TensorFlowEvent
    Source:
      File System Path:
        Kind:  Directory
        Path:  /train
  Objective:
    Goal:                   0.99
    Objective Metric Name:  accuracy_1
    Type:                   maximize
  Parameter Assignments:
    Name:    --learning_rate
    Value:   0.011044242440414936
    Name:    --batch_size
    Value:   144
  Run Spec:  apiVersion: "kubeflow.org/v1"
kind: TFJob
metadata:
  name: katib-tfjob-example-cpu-clql64rr
  namespace: kubeflow
spec:
 tfReplicaSpecs:
  Worker:
    replicas: 1
    restartPolicy: OnFailure
    template:
      spec:
        containers:
          - name: tensorflow
            image: gcr.io/kubeflow-ci/tf-mnist-with-summaries:1.0
            imagePullPolicy: Always
            command:
              - "python"
              - "/var/tf_mnist/mnist_with_summaries.py"
              - "--log_dir=/train/metrics"
              - "--learning_rate=0.011044242440414936"
              - "--batch_size=144"
Status:
  Completion Time:  2020-02-02T05:06:12Z
  Conditions:
    Last Transition Time:  2020-02-02T04:58:56Z
    Last Update Time:      2020-02-02T04:58:56Z
    Message:               Trial is created
    Reason:                TrialCreated
    Status:                True
    Type:                  Created
    Last Transition Time:  2020-02-02T05:06:12Z
    Last Update Time:      2020-02-02T05:06:12Z
    Message:               Trial is running
    Reason:                TrialRunning
    Status:                False
    Type:                  Running
    Last Transition Time:  2020-02-02T05:06:12Z
    Last Update Time:      2020-02-02T05:06:12Z
    Message:               Trial has succeeded
    Reason:                TrialSucceeded
    Status:                True
    Type:                  Succeeded
  Observation:
    Metrics:
      Name:    accuracy_1
      Value:   1
  Start Time:  2020-02-02T04:58:56Z
Events:
  Type    Reason        Age                From              Message
  ----    ------        ----               ----              -------
  Normal  JobCreated    43m                trial-controller  Job katib-tfjob-example-cpu-clql64rr has been created
  Normal  JobSucceeded  36m                trial-controller  Job katib-tfjob-example-cpu-clql64rr has succeeded
  Normal  JobDeleted    36m (x2 over 36m)  trial-controller  Job katib-tfjob-example-cpu-clql64rr has been deleted

Shot of 'winning' Trial.

Environment:

Kubeflow version: 0.7.1
Minikube version:
Kubernetes version: (use kubectl version): v1.14.6
OS (e.g. from /etc/os-release):

The text was updated successfully, but these errors were encountered:

issue-label-bot · 2020-02-02T05:58:27Z

Issue-Label Bot is automatically applying the labels:

Label	Probability
bug	0.98

Please mark this comment with 👍 or 👎 to give our bot feedback!
Links: app homepage, dashboard and code for this bot.

johnugeorge · 2020-02-02T16:20:46Z

i think, this is just the UI bug reported in #884 .

timothyjlaurent · 2020-02-05T17:36:12Z

OK I'll close, then.

k8s-ci-robot added the kind/bug label Feb 2, 2020

issue-label-bot bot added the bug label Feb 2, 2020

jlewi removed the bug label Feb 2, 2020

timothyjlaurent closed this as completed Feb 5, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Objective metrics cast as int, erroneously exceed target with random optimizer #1040

Objective metrics cast as int, erroneously exceed target with random optimizer #1040

timothyjlaurent commented Feb 2, 2020 •

edited

Loading

issue-label-bot bot commented Feb 2, 2020

johnugeorge commented Feb 2, 2020

timothyjlaurent commented Feb 5, 2020

Objective metrics cast as int, erroneously exceed target with random optimizer #1040

Objective metrics cast as int, erroneously exceed target with random optimizer #1040

Comments

timothyjlaurent commented Feb 2, 2020 • edited Loading

issue-label-bot bot commented Feb 2, 2020

johnugeorge commented Feb 2, 2020

timothyjlaurent commented Feb 5, 2020

timothyjlaurent commented Feb 2, 2020 •

edited

Loading