You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
What steps did you take and what happened:
I’m running the tfjob-example.yaml. The Objective is to maximize accuracy with a Goal of 0.99.
The experiment stopped prematurely because "Trial has succeeded" and the Experiment records that
accuracy has a value of 1 and shows up as an int in the katib ui, which is more than the 0.99 Goal. In the UI however the Trial with the winning hyperparameters has a value of 0.986... So it looks like the float value is being cast to an int and prematurely stopping training.
What did you expect to happen:
The Experiment would continue until a value exceeded the Goal, and that the reported accuracy would remain a float value.
Anything else you would like to add:
Here is the description of the Experiment:
Name: katib-tfjob-example-cpu
Namespace: kubeflow
Labels: <none>
Annotations: <none>
API Version: kubeflow.org/v1alpha3
Kind: Experiment
Metadata:
Creation Timestamp: 2020-02-02T04:57:42Z
Finalizers:
update-prometheus-metrics
Generation: 1
Resource Version: 34076667
Self Link: /apis/kubeflow.org/v1alpha3/namespaces/kubeflow/experiments/katib-tfjob-example-cpu
UID: 8b2ee6a1-4578-11ea-a653-0a392894a425
Spec:
Algorithm:
Algorithm Name: random
Algorithm Settings: <nil>
Max Failed Trial Count: 3
Max Trial Count: 12
Metrics Collector Spec:
Collector:
Kind: TensorFlowEvent
Source:
File System Path:
Kind: Directory
Path: /train
Objective:
Goal: 0.99
Objective Metric Name: accuracy_1
Type: maximize
Parallel Trial Count: 3
Parameters:
Feasible Space:
Max: 0.05
Min: 0.01
Name: --learning_rate
Parameter Type: double
Feasible Space:
Max: 200
Min: 100
Name: --batch_size
Parameter Type: int
Trial Template:
Go Template:
Raw Template: apiVersion: "kubeflow.org/v1"
kind: TFJob
metadata:
name: {{.Trial}}
namespace: {{.NameSpace}}
spec:
tfReplicaSpecs:
Worker:
replicas: 1
restartPolicy: OnFailure
template:
spec:
containers:
- name: tensorflow
image: gcr.io/kubeflow-ci/tf-mnist-with-summaries:1.0
imagePullPolicy: Always
command:
- "python"
- "/var/tf_mnist/mnist_with_summaries.py"
- "--log_dir=/train/metrics"
{{- with .HyperParameters}}
{{- range .}}
- "{{.Name}}={{.Value}}"
{{- end}}
{{- end}}
Status:
Completion Time: 2020-02-02T05:06:12Z
Conditions:
Last Transition Time: 2020-02-02T04:57:42Z
Last Update Time: 2020-02-02T04:57:42Z
Message: Experiment is created
Reason: ExperimentCreated
Status: True
Type: Created
Last Transition Time: 2020-02-02T05:06:12Z
Last Update Time: 2020-02-02T05:06:12Z
Message: Experiment is running
Reason: ExperimentRunning
Status: False
Type: Running
Last Transition Time: 2020-02-02T05:06:12Z
Last Update Time: 2020-02-02T05:06:12Z
Message: Experiment has succeeded because Objective goal has reached
Reason: ExperimentSucceeded
Status: True
Type: Succeeded
Current Optimal Trial:
Observation:
Metrics:
Name: accuracy_1
Value: 1
Parameter Assignments:
Name: --learning_rate
Value: 0.011044242440414936
Name: --batch_size
Value: 144
Start Time: 2020-02-02T04:57:42Z
Trials: 5
Trials Succeeded: 5
Events: <none>
and the winning trial also showed a value of 1 for accuracy:
Name: katib-tfjob-example-cpu-clql64rr
Namespace: kubeflow
Labels: experiment=katib-tfjob-example-cpu
Annotations: <none>
API Version: kubeflow.org/v1alpha3
Kind: Trial
Metadata:
Creation Timestamp: 2020-02-02T04:58:56Z
Finalizers:
clean-metrics-in-db
Generation: 1
Owner References:
API Version: kubeflow.org/v1alpha3
Block Owner Deletion: true
Controller: true
Kind: Experiment
Name: katib-tfjob-example-cpu
UID: 8b2ee6a1-4578-11ea-a653-0a392894a425
Resource Version: 34074063
Self Link: /apis/kubeflow.org/v1alpha3/namespaces/kubeflow/trials/katib-tfjob-example-cpu-clql64rr
UID: b751ea80-4578-11ea-a653-0a392894a425
Spec:
Metrics Collector:
Collector:
Kind: TensorFlowEvent
Source:
File System Path:
Kind: Directory
Path: /train
Objective:
Goal: 0.99
Objective Metric Name: accuracy_1
Type: maximize
Parameter Assignments:
Name: --learning_rate
Value: 0.011044242440414936
Name: --batch_size
Value: 144
Run Spec: apiVersion: "kubeflow.org/v1"
kind: TFJob
metadata:
name: katib-tfjob-example-cpu-clql64rr
namespace: kubeflow
spec:
tfReplicaSpecs:
Worker:
replicas: 1
restartPolicy: OnFailure
template:
spec:
containers:
- name: tensorflow
image: gcr.io/kubeflow-ci/tf-mnist-with-summaries:1.0
imagePullPolicy: Always
command:
- "python"
- "/var/tf_mnist/mnist_with_summaries.py"
- "--log_dir=/train/metrics"
- "--learning_rate=0.011044242440414936"
- "--batch_size=144"
Status:
Completion Time: 2020-02-02T05:06:12Z
Conditions:
Last Transition Time: 2020-02-02T04:58:56Z
Last Update Time: 2020-02-02T04:58:56Z
Message: Trial is created
Reason: TrialCreated
Status: True
Type: Created
Last Transition Time: 2020-02-02T05:06:12Z
Last Update Time: 2020-02-02T05:06:12Z
Message: Trial is running
Reason: TrialRunning
Status: False
Type: Running
Last Transition Time: 2020-02-02T05:06:12Z
Last Update Time: 2020-02-02T05:06:12Z
Message: Trial has succeeded
Reason: TrialSucceeded
Status: True
Type: Succeeded
Observation:
Metrics:
Name: accuracy_1
Value: 1
Start Time: 2020-02-02T04:58:56Z
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal JobCreated 43m trial-controller Job katib-tfjob-example-cpu-clql64rr has been created
Normal JobSucceeded 36m trial-controller Job katib-tfjob-example-cpu-clql64rr has succeeded
Normal JobDeleted 36m (x2 over 36m) trial-controller Job katib-tfjob-example-cpu-clql64rr has been deleted
/kind bug
What steps did you take and what happened:
I’m running the tfjob-example.yaml. The Objective is to maximize accuracy with a Goal of 0.99.
The experiment stopped prematurely because "Trial has succeeded" and the Experiment records that
accuracy has a value of 1 and shows up as an int in the katib ui, which is more than the 0.99 Goal. In the UI however the Trial with the winning hyperparameters has a value of 0.986... So it looks like the float value is being cast to an int and prematurely stopping training.
What did you expect to happen:
The Experiment would continue until a value exceeded the Goal, and that the reported accuracy would remain a float value.
Anything else you would like to add:
Here is the description of the Experiment:
and the winning trial also showed a value of 1 for accuracy:
Shot of 'winning' Trial.
Environment:
kubectl version
): v1.14.6/etc/os-release
):The text was updated successfully, but these errors were encountered: