TensorBoard logs: "Unexpected error: corrupted record at 0" #924

karlschriek · 2019-11-21T12:18:39Z

/kind bug

What steps did you take and what happened:

I am trying to construct a simple example MNIST example that makes use Keras with TFJob and logs out metrics using the TensorBoard callback. However, the TensorFlowEvent collector is unable to pick up the logs.

The workers reach completed stage. Calling kubectl -n kubeflow logs pod/tfjob-example-tf-events-xxxxxxx-worker-0 metrics-collector then yields the following

/tensorboard/train/events.out.tfevents.1574336795.tfjob-example-tf-events-xxxxxxx-worker-0.8.140821.v2 will be parsed.
/tensorboard/train/events.out.tfevents.1574336799.tfjob-example-tf-events-xxxxxxx-worker-0.profile-empty will be parsed.
/tensorboard/train/plugins/profile/2019-11-21_11-46-39/local.trace will be parsed.
Unexpected error: corrupted record at 0
In tfjob-example-tf-events-xxxxxxx 0 metrics will be reported.

Below follows the full code and details of what I am doing

1. model.py

This sits within a Docker image (let's just call it my_images/keras_mnist for simplicity sake) running TensorFlow 2.0 (its. based on tensorflow/tensorflow:latest-gpu-py3).

It is based on the official tutorial for running Keras in a distributed manner, as found here:
https://www.tensorflow.org/tutorials/distribute/multi_worker_with_keras
https://colab.research.google.com/github/tensorflow/docs/blob/master/site/en/tutorials/distribute/multi_worker_with_keras.ipynb#scrollTo=xIY9vKnUU82o

import tensorflow_datasets as tfds
tfds.disable_progress_bar()
import tensorflow as tf
import argparse


FLAGS = None

BUFFER_SIZE = 10000

class StdOutCallback(tf.keras.callbacks.ProgbarLogger):
    # a simple callback that picky-backs of the progress bar callback. It prints metrics to StdOut.
    def on_batch_end(self, batch, logs=None):
        logs = logs or {}
        for k in self.params['metrics']:
            if k in logs:
                print("{}={}".format(k,logs[k]))

def build_and_compile_cnn_model(dropout_rate, lr):
  model = tf.keras.Sequential([
      tf.keras.layers.Conv2D(32, 3, activation='relu', input_shape=(28, 28, 1)),
      tf.keras.layers.MaxPooling2D(),
      tf.keras.layers.Flatten(),
      tf.keras.layers.Dense(64, activation='relu'),
      tf.keras.layers.Dropout(rate=dropout_rate),
      tf.keras.layers.Dense(10, activation='softmax')
  ])

  model.compile(
      loss=tf.keras.losses.sparse_categorical_crossentropy,
      optimizer=tf.keras.optimizers.SGD(learning_rate=lr),
      metrics=['accuracy']
  )
  return model

def make_datasets_unbatched(dataset_name):
  # Scaling MNIST data from (0, 255] to (0., 1.]
  def scale(image, label):
    image = tf.cast(image, tf.float32)
    image /= 255
    return image, label

  datasets, info = tfds.load(name='mnist',
                            with_info=True,
                            as_supervised=True)

  return datasets[dataset_name].map(scale).cache().shuffle(BUFFER_SIZE)


def parse_arguments():
    parser = argparse.ArgumentParser()
    parser.add_argument('--input_size_x', type=int, default=10),
    parser.add_argument('--input_size_y', type=int, default=10)
    parser.add_argument('--lr', type=float, default=0.01)
    parser.add_argument('--dropout_rate', type=float, default=0.4)
    parser.add_argument('--log_dir', type=str, default="./tensorboard/metrics")
    parser.add_argument('--number_of_workers', type=int, default=1)
    args, unparsed = parser.parse_known_args()

    return args, unparsed

def main():
    FLAGS, unparsed = parse_arguments()
    GLOBAL_BATCH_SIZE = 512 * FLAGS.number_of_workers
    
    strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy()

    tensorboard = tf.keras.callbacks.TensorBoard(log_dir=FLAGS.log_dir, update_freq="batch")      
    std_out = StdOutCallback()

    with strategy.scope():
        train_datasets = make_datasets_unbatched("train").batch(GLOBAL_BATCH_SIZE)
        multi_worker_model = build_and_compile_cnn_model(FLAGS.dropout_rate, FLAGS.lr)
        
    multi_worker_model.fit(x=train_datasets,
                           epochs=3,
                           callbacks=[tensorboard, std_out],
                           verbose=0  # to disable progress bar
                           )

if __name__ == '__main__':
    main()

The above code has two callbacks: the vanilla TensorBoard callback, which writes to FLAGS.log_dir and is meant to be picked up by a collector of kind TensorFlowEvent. The second is a little custom StdOutCallback, used for testing purposes only, which writes out metrics to standard out in the format acc=0.71. This is meant to be picked up by a collector of type StdOut.

2. YAML files

I use two different YAML files. The first is with a StdOut collector, which runs through without any problems. It looks like this:

std_out.yaml:

apiVersion: "kubeflow.org/v1alpha3"
kind: Experiment
metadata:
  namespace: kubeflow
  name: tfjob-example-std-out
spec:
  parallelTrialCount: 3
  maxTrialCount: 12
  maxFailedTrialCount: 3
  objective:
    type: minimize
    goal: 0.1
    objectiveMetricName: loss
    additionalMetricNames:
      - acc
  algorithm:
    algorithmName: random
  metricsCollectorSpec:
    collector:
      kind: StdOut
  parameters:
    - name: --lr
      parameterType: double
      feasibleSpace:
        min: "0.01"
        max: "0.05"
    - name: --dropout_rate
      parameterType: double
      feasibleSpace:
        min: "0.3"
        max: "0.7"
  trialTemplate:
    goTemplate:
        rawTemplate: |-
          apiVersion: "kubeflow.org/v1"
          kind: TFJob
          metadata:
            name: {{.Trial}}
            namespace: {{.NameSpace}}
          spec:
           tfReplicaSpecs:
            Worker:
              replicas: 1
              restartPolicy: OnFailure
              template:
                spec:
                  containers:
                    - name: tensorflow
                      image: my_images/keras_mnist:latest
                      imagePullPolicy: Always
                      command:
                        - "python"
                        - "model.py"
                        - "--input_size_x=10"
                        - "--input_size_y=10"
                        - "--log_dir=/tensorboard"
                        - "--number_of_workers=1"
                        {{- with .HyperParameters}}
                        {{- range .}}
                        - "{{.Name}}={{.Value}}"
                        {{- end}}
                        {{- end}}

The above is only really used for testing (and out of curiosity for how the different kinds of collectors work). What I would prefer would be to catch the TensorBoard logs. For that I've set up the following yaml:

'tf_events.yaml':

apiVersion: "kubeflow.org/v1alpha3"
kind: Experiment
metadata:
  namespace: kubeflow
  name: tfjob-example-tf-event
spec:
  parallelTrialCount: 3
  maxTrialCount: 12
  maxFailedTrialCount: 3
  objective:
    type: minimize
    goal: 0.01
    objectiveMetricName: loss
    additionalMetricNames:
      - acc
  algorithm:
    algorithmName: random
  metricsCollectorSpec:
    source:
      fileSystemPath:
        path: /tensorboard
        kind: Directory
    collector:
      kind: TensorFlowEvent
  parameters:
    - name: --lr
      parameterType: double
      feasibleSpace:
        min: "0.01"
        max: "0.05"
    - name: --dropout_rate
      parameterType: double
      feasibleSpace:
        min: "0.3"
        max: "0.7"
  trialTemplate:
    goTemplate:
        rawTemplate: |-
          apiVersion: "kubeflow.org/v1"
          kind: TFJob
          metadata:
            name: {{.Trial}}
            namespace: {{.NameSpace}}
          spec:
           tfReplicaSpecs:
            Worker:
              replicas: 1
              restartPolicy: OnFailure
              template:
                spec:
                  containers:
                    - name: tensorflow
                      image: my_images/keras_mnist:latest
                      imagePullPolicy: Always
                      command:
                        - "python"
                        - "model.py"
                        - "--input_size_x=10"
                        - "--input_size_y=10"
                        - "--log_dir=/tensorboard"
                        - "--number_of_workers=1"
                        {{- with .HyperParameters}}
                        {{- range .}}
                        - "{{.Name}}={{.Value}}"
                        {{- end}}
                        {{- end}}

Running kubectl apply -f tf_events.yaml results in the error being logged in the metrics-collector sidecar as written at the top of this post.

What did you expect to happen:

The metrics-collector's logs seem to suggest that it was able to find the tensorboard logs and that it will attempt to parse them. I would expect the parsing to work (or at the very least to receive a message explaining why it doesn't work)

Environment:

TensorFlow version: 2.0
Kubeflow version: 0,7.0
Kubernetes version: (use kubectl version): 1.12

Client Version: version.Info{Major:"1", Minor:"11", GitVersion:"v1.11.9", GitCommit:"16236ce91790d4c75b79f6ce96841db1c843e7d2", GitTreeState:"clean", BuildDate:"2019-03-27T14:42:18Z", GoVersion:"go1.10.8", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"12+", GitVersion:"v1.12.10-eks-afa464", GitCommit:"afa464ce9760bd08a53a207f505b133b93366ea3", GitTreeState:"clean", BuildDate:"2019-10-22T21:42:57Z", GoVersion:"go1.12.9", Compiler:"gc", Platform:"linux/amd64"}

The text was updated successfully, but these errors were encountered:

hougangliu · 2019-11-28T01:59:32Z

#920 had fixed this issue (unfortunately not included in 0.7.0)
You can try to kubectl edit configmap katib-config -n kubeflow to use image including this fix as below

"TensorFlowEvent": {
        "image": "gcr.io/kubeflow-images-public/katib/v1alpha3/tfevent-metrics-collector"
      }

karlschriek · 2019-12-02T19:15:57Z

Great, that's good to know thanks! I'll give the custom fix a try, otherwise I'll come back to this with the next release.

eriklincoln · 2019-12-03T23:17:19Z

@karlschriek were you able to test this out to see if it fixed your problem? I think I'm having a similar problem with the TensorFlowEvent collector. I'm also using TF 2.0 (Docker image tensorflow/tensorflow:nightly-py3) but with a custom training loop.

I'm writing out my metric scalars via a writer created from tf.summary.create_file_writer and actually writing the scalar with tf.summary.scalar. The writer gets flushed at the end of the epoch and I confirmed the events.out.tfevents... file is getting written when the writer is flushed at the end of the epoch in the trial worker container.

However, the metrics-collector container was not working either with the tfevent-metrics-collector image for 0.7.0 or using the latest suggested by @hougangliu. The error in the metrics-collector container with tfevent-metrics-collector:latest is:

Traceback (most recent call last): 
  File "main.py", line 39, in <module>
    WaitOtherMainProcesses(completed_marked_dir=opt.dir_path)
  File "/usr/src/app/github.com/kubeflow/katib/pkg/metricscollector/v1alpha3/common/pns.py", line 45, in WaitOtherMainProcesses 
    return WaitPIDs(GetOtherMainProcesses(), poll_interval_seconds, timeout_seconds, is_wait_all, completed_marked_dir)
  File "/usr/src/app/github.com/kubeflow/katib/pkg/metricscollector/v1alpha3/common/pns.py", line 31, in WaitPIDs
    with open(mark_file) as file_obj:
IOError: [Errno 2] No such file or directory: '/job/7.pid'
rpc error: code = Unknown desc = Error: No such container: 53a4eb8f10203672b3bd1f83ee6d0275064ff84d770767b6d34f1bfec9360fb1(base)

When I use tfevent-metrics-collector:v0.7.0 and have the job only save a single tfevents file (no checkpoints, variables, or the model itself), the container returns with:

/job/logs/events.out.tfevents.1575411038.gcp-demo1-tune-l92bc8xm-worker-0.6.384.v2 will be parsed. 
In gcp-demo1-tune-l92bc8xm 1 metrics will be reported.

It appears to be parsing the events file correctly and finding the scalar set in spec.objective.objectiveMetricName beacause when I set objectiveMetricName to something random, it returns with ... 0 metrics will be reported.

However, the metrics do not appear to be getting saved correctly in the experiment as they're not appearing in the UI (values are always 0):

My Experiment config is:

apiVersion: "kubeflow.org/v1alpha3"
kind: Experiment
metadata:
  namespace: kubeflow
  name: gcp-demo1-tune
spec:
  parallelTrialCount: 5
  maxTrialCount: 20
  maxFailedTrialCount: 3
  objective:
    type: minimize
    goal: 0.00
    objectiveMetricName: test_loss
  algorithm:
    algorithmName: random
  metricsCollectorSpec:
    source:
      fileSystemPath:
        path: /job
        kind: Directory
    collector:
      kind: TensorFlowEvent
  parameters:
    - name: --learning-rate
      parameterType: discrete
      feasibleSpace: 
        list: ["0.0001", "0.0005", "0.001", "0.005", "0.01", "0.05", "0.1", "0.5", "1.0"]
  trialTemplate:
    goTemplate:
        rawTemplate: |-
          apiVersion: "kubeflow.org/v1"
          kind: "TFJob"
          metadata:
            name: {{.Trial}}
            namespace: {{.NameSpace}}
          spec:
            cleanPodPolicy: None 
            tfReplicaSpecs:
              Worker:
                replicas: 1
                restartPolicy: OnFailure
                template:
                  spec:
                    containers:
                      - name: tensorflow
                        image: gcr.io/PROJECTID/gcp-demo1:training
                        imagePullPolicy: Always
                        command:
                          - "python"
                          - "-m"
                          - "trainer.task"
                          - "task"
                          - "--batch-size=128"
                          - "--epochs=1"
                          - "--chunk-size=5000000"
                          - "--cycle-length=8"
                          - "--job-dir=/job"
                          - "--table-id=finaltaxi_encoded_sampled_small"
                          {{- with .HyperParameters}}
                          {{- range .}}
                          - "{{.Name}}={{.Value}}"
                          {{- end}}
                          {{- end}}
                        resources:
                          limits:
                            cpu: '4'
                            memory: '40G'
                        env:
                        - name: GOOGLE_APPLICATION_CREDENTIALS
                          value: "/etc/secrets/user-gcp-sa.json"
                        volumeMounts:
                        - name: sa
                          mountPath: "/etc/secrets"
                          readOnly: true
                    volumes:
                    - name: sa
                      secret:
                        secretName: user-gcp-sa

I can provide my TF code as well if helpful.

karlschriek · 2019-12-04T11:26:59Z

@eriklincoln I haven't really had a chance to try it yet. Also unlikely that I'll get to it before january

karlschriek · 2020-01-14T10:44:01Z

So I've finally had the chance to come back to this today. @hougangliu, I edited the katib-config ConfigMap to use: gcr.io/kubeflow-images-public/katib/v1alpha3/tfevent-metrics-collector:latest, but this still results in the exact same error message:

/tensorboard/train/events.out.tfevents.1578998178.tfjob-example-tf-event-wdlgctxj-worker-0.8.520.v2 will be parsed.
/tensorboard/train/events.out.tfevents.1578998181.tfjob-example-tf-event-wdlgctxj-worker-0.profile-empty will be parsed.
/tensorboard/train/plugins/profile/2020-01-14_10-36-21/local.trace will be parsed.
Unexpected error: corrupted record at 0
In tfjob-example-tf-event-wdlgctxj 0 metrics will be reported.

swarajoturkar · 2020-01-15T12:07:46Z

@karlschriek Hi Karl, I had the exact same issue which got resolved after using the tfevent-metrics-collector:latest image and setting the metric name (in tf_events.yaml ) to batch_accuracy instead of accuracy.
TFEvents summaries apparently contain a prefix based on update_freq parameter in tf.keras.callbacks.TensorBoard

karlschriek · 2020-01-16T14:27:01Z

@swarajoturkar thanks for the tip will try it out! Do you know if this is documented somewhere? I take it epoch_{metric_name} would equally be relevant when looking at an update frequency per epoch?

swarajoturkar · 2020-01-16T15:15:12Z

@karlschriek
I dont think so since I found no mention of it while looking for a solution. I manually tried to read the events file generated locally which is where I found the available keys for scalar metrics.

Yes, infact epoch_{metric_name} was logged even though I specified "batch" in the Tensorboard callback. Since I am pretty new to Tensorboard, I am not sure if this is the expected behaviour.

Here is some code i used to read the event files:

from tensorboard.backend.event_processing.event_accumulator import EventAccumulator
event_acc = EventAccumulator(log_file)
event_acc.Reload()
print(event_acc.Tags())

{'images': [], 'audio': [], 'histograms': [], 'scalars': ['batch_loss', 'batch_accuracy', 'epoch_loss', 'epoch_accuracy'], 'distributions': [], 'tensors': [], 'graph': True, 'meta_graph': True, 'run_metadata': []}

karlschriek · 2020-01-20T10:39:15Z

Tried using batch_loss this morning, looks like I am one step closer, but still not quite there:

/tensorboard/train/events.out.tfevents.1579515747.tfjob-example-tf-event-tlcljq2x-worker-0.8.140488.v2 will be parsed.
/tensorboard/train/events.out.tfevents.1579515749.tfjob-example-tf-event-tlcljq2x-worker-0.profile-empty will be parsed.
/tensorboard/train/plugins/profile/2020-01-20_10-22-29/local.trace will be parsed.
Unexpected error: corrupted record at 0
In tfjob-example-tf-event-tlcljq2x 30000 metrics will be reported.

Traceback (most recent call last):
  File "main.py", line 54, in <module>
    ), timeout=timeout_in_seconds)
  File "/usr/local/lib/python2.7/dist-packages/grpc/beta/_client_adaptations.py", line 309, in __call__
    self._request_serializer, self._response_deserializer)
  File "/usr/local/lib/python2.7/dist-packages/grpc/beta/_client_adaptations.py", line 195, in _blocking_unary_unary
    raise _abortion_error(rpc_error_call)
grpc.framework.interfaces.face.face.ExpirationError: ExpirationError(code=StatusCode.DEADLINE_EXCEEDED, details="Deadline Exceeded")

If I set this to epoch_loss instead it at least runs through, so looks like transmitting all 30.000 batches takes too long.

stale · 2020-11-25T00:01:08Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale · 2020-12-19T05:56:55Z

This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it.

k8s-ci-robot added the kind/bug label Nov 21, 2019

stale bot added the lifecycle/stale label Nov 25, 2020

stale bot closed this as completed Dec 19, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TensorBoard logs: "Unexpected error: corrupted record at 0" #924

TensorBoard logs: "Unexpected error: corrupted record at 0" #924

karlschriek commented Nov 21, 2019

hougangliu commented Nov 28, 2019

karlschriek commented Dec 2, 2019 •

edited

Loading

eriklincoln commented Dec 3, 2019 •

edited

Loading

karlschriek commented Dec 4, 2019

karlschriek commented Jan 14, 2020 •

edited

Loading

swarajoturkar commented Jan 15, 2020

karlschriek commented Jan 16, 2020

swarajoturkar commented Jan 16, 2020 •

edited

Loading

karlschriek commented Jan 20, 2020

stale bot commented Nov 25, 2020

stale bot commented Dec 19, 2020

TensorBoard logs: "Unexpected error: corrupted record at 0" #924

TensorBoard logs: "Unexpected error: corrupted record at 0" #924

Comments

karlschriek commented Nov 21, 2019

hougangliu commented Nov 28, 2019

karlschriek commented Dec 2, 2019 • edited Loading

eriklincoln commented Dec 3, 2019 • edited Loading

karlschriek commented Dec 4, 2019

karlschriek commented Jan 14, 2020 • edited Loading

swarajoturkar commented Jan 15, 2020

karlschriek commented Jan 16, 2020

swarajoturkar commented Jan 16, 2020 • edited Loading

karlschriek commented Jan 20, 2020

stale bot commented Nov 25, 2020

stale bot commented Dec 19, 2020

karlschriek commented Dec 2, 2019 •

edited

Loading

eriklincoln commented Dec 3, 2019 •

edited

Loading

karlschriek commented Jan 14, 2020 •

edited

Loading

swarajoturkar commented Jan 16, 2020 •

edited

Loading