Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Objective metrics cast as int, erroneously exceed target with random optimizer #1040

Closed
timothyjlaurent opened this issue Feb 2, 2020 · 3 comments
Labels

Comments

@timothyjlaurent
Copy link

timothyjlaurent commented Feb 2, 2020

/kind bug

What steps did you take and what happened:
I’m running the tfjob-example.yaml. The Objective is to maximize accuracy with a Goal of 0.99.

The experiment stopped prematurely because "Trial has succeeded" and the Experiment records that
accuracy has a value of 1 and shows up as an int in the katib ui, which is more than the 0.99 Goal. In the UI however the Trial with the winning hyperparameters has a value of 0.986... So it looks like the float value is being cast to an int and prematurely stopping training.

What did you expect to happen:

The Experiment would continue until a value exceeded the Goal, and that the reported accuracy would remain a float value.

Anything else you would like to add:

Here is the description of the Experiment:

Name:         katib-tfjob-example-cpu
Namespace:    kubeflow
Labels:       <none>
Annotations:  <none>
API Version:  kubeflow.org/v1alpha3
Kind:         Experiment
Metadata:
  Creation Timestamp:  2020-02-02T04:57:42Z
  Finalizers:
    update-prometheus-metrics
  Generation:        1
  Resource Version:  34076667
  Self Link:         /apis/kubeflow.org/v1alpha3/namespaces/kubeflow/experiments/katib-tfjob-example-cpu
  UID:               8b2ee6a1-4578-11ea-a653-0a392894a425
Spec:
  Algorithm:
    Algorithm Name:        random
    Algorithm Settings:    <nil>
  Max Failed Trial Count:  3
  Max Trial Count:         12
  Metrics Collector Spec:
    Collector:
      Kind:  TensorFlowEvent
    Source:
      File System Path:
        Kind:  Directory
        Path:  /train
  Objective:
    Goal:                   0.99
    Objective Metric Name:  accuracy_1
    Type:                   maximize
  Parallel Trial Count:     3
  Parameters:
    Feasible Space:
      Max:           0.05
      Min:           0.01
    Name:            --learning_rate
    Parameter Type:  double
    Feasible Space:
      Max:           200
      Min:           100
    Name:            --batch_size
    Parameter Type:  int
  Trial Template:
    Go Template:
      Raw Template:  apiVersion: "kubeflow.org/v1"
kind: TFJob
metadata:
  name: {{.Trial}}
  namespace: {{.NameSpace}}
spec:
 tfReplicaSpecs:
  Worker:
    replicas: 1
    restartPolicy: OnFailure
    template:
      spec:
        containers:
          - name: tensorflow
            image: gcr.io/kubeflow-ci/tf-mnist-with-summaries:1.0
            imagePullPolicy: Always
            command:
              - "python"
              - "/var/tf_mnist/mnist_with_summaries.py"
              - "--log_dir=/train/metrics"
              {{- with .HyperParameters}}
              {{- range .}}
              - "{{.Name}}={{.Value}}"
              {{- end}}
              {{- end}}
Status:
  Completion Time:  2020-02-02T05:06:12Z
  Conditions:
    Last Transition Time:  2020-02-02T04:57:42Z
    Last Update Time:      2020-02-02T04:57:42Z
    Message:               Experiment is created
    Reason:                ExperimentCreated
    Status:                True
    Type:                  Created
    Last Transition Time:  2020-02-02T05:06:12Z
    Last Update Time:      2020-02-02T05:06:12Z
    Message:               Experiment is running
    Reason:                ExperimentRunning
    Status:                False
    Type:                  Running
    Last Transition Time:  2020-02-02T05:06:12Z
    Last Update Time:      2020-02-02T05:06:12Z
    Message:               Experiment has succeeded because Objective goal has reached
    Reason:                ExperimentSucceeded
    Status:                True
    Type:                  Succeeded
  Current Optimal Trial:
    Observation:
      Metrics:
        Name:   accuracy_1
        Value:  1
    Parameter Assignments:
      Name:          --learning_rate
      Value:         0.011044242440414936
      Name:          --batch_size
      Value:         144
  Start Time:        2020-02-02T04:57:42Z
  Trials:            5
  Trials Succeeded:  5
Events:              <none>

and the winning trial also showed a value of 1 for accuracy:

Name:         katib-tfjob-example-cpu-clql64rr
Namespace:    kubeflow
Labels:       experiment=katib-tfjob-example-cpu
Annotations:  <none>
API Version:  kubeflow.org/v1alpha3
Kind:         Trial
Metadata:
  Creation Timestamp:  2020-02-02T04:58:56Z
  Finalizers:
    clean-metrics-in-db
  Generation:  1
  Owner References:
    API Version:           kubeflow.org/v1alpha3
    Block Owner Deletion:  true
    Controller:            true
    Kind:                  Experiment
    Name:                  katib-tfjob-example-cpu
    UID:                   8b2ee6a1-4578-11ea-a653-0a392894a425
  Resource Version:        34074063
  Self Link:               /apis/kubeflow.org/v1alpha3/namespaces/kubeflow/trials/katib-tfjob-example-cpu-clql64rr
  UID:                     b751ea80-4578-11ea-a653-0a392894a425
Spec:
  Metrics Collector:
    Collector:
      Kind:  TensorFlowEvent
    Source:
      File System Path:
        Kind:  Directory
        Path:  /train
  Objective:
    Goal:                   0.99
    Objective Metric Name:  accuracy_1
    Type:                   maximize
  Parameter Assignments:
    Name:    --learning_rate
    Value:   0.011044242440414936
    Name:    --batch_size
    Value:   144
  Run Spec:  apiVersion: "kubeflow.org/v1"
kind: TFJob
metadata:
  name: katib-tfjob-example-cpu-clql64rr
  namespace: kubeflow
spec:
 tfReplicaSpecs:
  Worker:
    replicas: 1
    restartPolicy: OnFailure
    template:
      spec:
        containers:
          - name: tensorflow
            image: gcr.io/kubeflow-ci/tf-mnist-with-summaries:1.0
            imagePullPolicy: Always
            command:
              - "python"
              - "/var/tf_mnist/mnist_with_summaries.py"
              - "--log_dir=/train/metrics"
              - "--learning_rate=0.011044242440414936"
              - "--batch_size=144"
Status:
  Completion Time:  2020-02-02T05:06:12Z
  Conditions:
    Last Transition Time:  2020-02-02T04:58:56Z
    Last Update Time:      2020-02-02T04:58:56Z
    Message:               Trial is created
    Reason:                TrialCreated
    Status:                True
    Type:                  Created
    Last Transition Time:  2020-02-02T05:06:12Z
    Last Update Time:      2020-02-02T05:06:12Z
    Message:               Trial is running
    Reason:                TrialRunning
    Status:                False
    Type:                  Running
    Last Transition Time:  2020-02-02T05:06:12Z
    Last Update Time:      2020-02-02T05:06:12Z
    Message:               Trial has succeeded
    Reason:                TrialSucceeded
    Status:                True
    Type:                  Succeeded
  Observation:
    Metrics:
      Name:    accuracy_1
      Value:   1
  Start Time:  2020-02-02T04:58:56Z
Events:
  Type    Reason        Age                From              Message
  ----    ------        ----               ----              -------
  Normal  JobCreated    43m                trial-controller  Job katib-tfjob-example-cpu-clql64rr has been created
  Normal  JobSucceeded  36m                trial-controller  Job katib-tfjob-example-cpu-clql64rr has succeeded
  Normal  JobDeleted    36m (x2 over 36m)  trial-controller  Job katib-tfjob-example-cpu-clql64rr has been deleted

Shot of 'winning' Trial.
image

Environment:

  • Kubeflow version: 0.7.1
  • Minikube version:
  • Kubernetes version: (use kubectl version): v1.14.6
  • OS (e.g. from /etc/os-release):
@issue-label-bot
Copy link

Issue-Label Bot is automatically applying the labels:

Label Probability
bug 0.98

Please mark this comment with 👍 or 👎 to give our bot feedback!
Links: app homepage, dashboard and code for this bot.

@issue-label-bot issue-label-bot bot added the bug label Feb 2, 2020
@jlewi jlewi removed the bug label Feb 2, 2020
@johnugeorge
Copy link
Member

i think, this is just the UI bug reported in #884 .

@timothyjlaurent
Copy link
Author

OK I'll close, then.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants