-
Notifications
You must be signed in to change notification settings - Fork 448
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TensorBoard logs: "Unexpected error: corrupted record at 0" #924
Comments
#920 had fixed this issue (unfortunately not included in 0.7.0)
|
Great, that's good to know thanks! I'll give the custom fix a try, otherwise I'll come back to this with the next release. |
@karlschriek were you able to test this out to see if it fixed your problem? I think I'm having a similar problem with the I'm writing out my metric scalars via a writer created from However, the
When I use
It appears to be parsing the events file correctly and finding the scalar set in However, the metrics do not appear to be getting saved correctly in the experiment as they're not appearing in the UI (values are always 0): My Experiment config is: apiVersion: "kubeflow.org/v1alpha3"
kind: Experiment
metadata:
namespace: kubeflow
name: gcp-demo1-tune
spec:
parallelTrialCount: 5
maxTrialCount: 20
maxFailedTrialCount: 3
objective:
type: minimize
goal: 0.00
objectiveMetricName: test_loss
algorithm:
algorithmName: random
metricsCollectorSpec:
source:
fileSystemPath:
path: /job
kind: Directory
collector:
kind: TensorFlowEvent
parameters:
- name: --learning-rate
parameterType: discrete
feasibleSpace:
list: ["0.0001", "0.0005", "0.001", "0.005", "0.01", "0.05", "0.1", "0.5", "1.0"]
trialTemplate:
goTemplate:
rawTemplate: |-
apiVersion: "kubeflow.org/v1"
kind: "TFJob"
metadata:
name: {{.Trial}}
namespace: {{.NameSpace}}
spec:
cleanPodPolicy: None
tfReplicaSpecs:
Worker:
replicas: 1
restartPolicy: OnFailure
template:
spec:
containers:
- name: tensorflow
image: gcr.io/PROJECTID/gcp-demo1:training
imagePullPolicy: Always
command:
- "python"
- "-m"
- "trainer.task"
- "task"
- "--batch-size=128"
- "--epochs=1"
- "--chunk-size=5000000"
- "--cycle-length=8"
- "--job-dir=/job"
- "--table-id=finaltaxi_encoded_sampled_small"
{{- with .HyperParameters}}
{{- range .}}
- "{{.Name}}={{.Value}}"
{{- end}}
{{- end}}
resources:
limits:
cpu: '4'
memory: '40G'
env:
- name: GOOGLE_APPLICATION_CREDENTIALS
value: "/etc/secrets/user-gcp-sa.json"
volumeMounts:
- name: sa
mountPath: "/etc/secrets"
readOnly: true
volumes:
- name: sa
secret:
secretName: user-gcp-sa I can provide my TF code as well if helpful. |
@eriklincoln I haven't really had a chance to try it yet. Also unlikely that I'll get to it before january |
So I've finally had the chance to come back to this today. @hougangliu, I edited the katib-config ConfigMap to use:
|
@karlschriek Hi Karl, I had the exact same issue which got resolved after using the |
@swarajoturkar thanks for the tip will try it out! Do you know if this is documented somewhere? I take it |
@karlschriek Yes, infact Here is some code i used to read the event files:
|
Tried using
If I set this to |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it. |
/kind bug
What steps did you take and what happened:
I am trying to construct a simple example MNIST example that makes use Keras with TFJob and logs out metrics using the TensorBoard callback. However, the
TensorFlowEvent
collector is unable to pick up the logs.The workers reach
completed
stage. Callingkubectl -n kubeflow logs pod/tfjob-example-tf-events-xxxxxxx-worker-0 metrics-collector
then yields the followingBelow follows the full code and details of what I am doing
1.
model.py
This sits within a Docker image (let's just call it
my_images/keras_mnist
for simplicity sake) running TensorFlow 2.0 (its. based ontensorflow/tensorflow:latest-gpu-py3
).It is based on the official tutorial for running Keras in a distributed manner, as found here:
https://www.tensorflow.org/tutorials/distribute/multi_worker_with_keras
https://colab.research.google.com/github/tensorflow/docs/blob/master/site/en/tutorials/distribute/multi_worker_with_keras.ipynb#scrollTo=xIY9vKnUU82o
The above code has two callbacks: the vanilla
TensorBoard
callback, which writes toFLAGS.log_dir
and is meant to be picked up by a collector of kindTensorFlowEvent
. The second is a little customStdOutCallback
, used for testing purposes only, which writes out metrics to standard out in the formatacc=0.71
. This is meant to be picked up by a collector of type StdOut.2. YAML files
I use two different YAML files. The first is with a
StdOut
collector, which runs through without any problems. It looks like this:std_out.yaml
:The above is only really used for testing (and out of curiosity for how the different kinds of collectors work). What I would prefer would be to catch the TensorBoard logs. For that I've set up the following yaml:
'tf_events.yaml':
Running
kubectl apply -f tf_events.yaml
results in the error being logged in themetrics-collector
sidecar as written at the top of this post.What did you expect to happen:
The
metrics-collector
's logs seem to suggest that it was able to find the tensorboard logs and that it will attempt to parse them. I would expect the parsing to work (or at the very least to receive a message explaining why it doesn't work)Environment:
kubectl version
): 1.12The text was updated successfully, but these errors were encountered: