Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CH7: Small Conv Network Training Error - Conv2DCustomBackpropInputOp only supports NHWC. #80

Open
natekester opened this issue Dec 22, 2020 · 10 comments

Comments

@natekester
Copy link

Attempting to run the small convolution network on MacOS Big Sur.

Not sure what my issue is exactly - could be versions used. Any ideas what I can do to make it work?

tensorflow==2.4.0
Python 3.8.2

`...
Epoch 1/5
Traceback (most recent call last):
File "training_small.py", line 37, in
model.fit_generator(generator=generator.generate(batch_size, num_classes), epochs=epochs, steps_per_epoch=generator.get_num_samples() / batch_size, validation_data=test_generator.generate(batch_size, num_classes), validation_steps=test_generator.get_num_samples() / batch_size, callbacks=[ ModelCheckpoint('../checkpoints/small_model_epoch_{epoch}.h5')])
File "/Library/Python/3.8/site-packages/tensorflow/python/keras/engine/training.py", line 1847, in fit_generator
return self.fit(
File "/Library/Python/3.8/site-packages/tensorflow/python/keras/engine/training.py", line 1100, in fit
tmp_logs = self.train_function(iterator)
File "/Library/Python/3.8/site-packages/tensorflow/python/eager/def_function.py", line 828, in call
result = self._call(*args, **kwds)
File "/Library/Python/3.8/site-packages/tensorflow/python/eager/def_function.py", line 888, in _call
return self._stateless_fn(*args, **kwds)
File "/Library/Python/3.8/site-packages/tensorflow/python/eager/function.py", line 2942, in call
return graph_function._call_flat(
File "/Library/Python/3.8/site-packages/tensorflow/python/eager/function.py", line 1918, in _call_flat
return self._build_call_outputs(self._inference_function.call(
File "/Library/Python/3.8/site-packages/tensorflow/python/eager/function.py", line 555, in call
outputs = execute.execute(
File "/Library/Python/3.8/site-packages/tensorflow/python/eager/execute.py", line 59, in quick_execute
tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
tensorflow.python.framework.errors_impl.InvalidArgumentError: Conv2DCustomBackpropInputOp only supports NHWC.
[[node gradient_tape/sequential/conv2d_3/Conv2D/Conv2DBackpropInput (defined at training_small.py:37) ]] [Op:__inference_train_function_781]

Function call stack:
train_function
`

@macfergus
Copy link
Collaborator

Hi Nate, I actually just ran into this same problem myself recently. This is an issue of the channels_first/channels_last options for indexing tensors (also known as NCHW / NHWC). (See appendix A for a brief discussion). Unfortunately, TensorFlow has dropped some support for channels_first indexing, I believe since TF 2.0

The two options are:

  1. Downgrade TensorFlow and Keras to 1.8.x and 2.2.x, respectively -- those are the versions we used while writing the book; or
  2. Search and replace channels_first for channels_last in the code (pretty much anywhere you create a Conv2D layer)

Either one ought to fix it -- let us know if that works for you!

@natekester
Copy link
Author

Hi Kevin, That did it! Super fast response. I appreciate it. Great content btw.

@natekester
Copy link
Author

natekester commented Dec 23, 2020

Hi Kevin,

in attempting to replicate the model training - I ran the 7.3 code (with the change of channels_last) with the small network layers, and I keep getting a result similar to the following:

Epoch 1/5
2672/2672 [==============================] - 94s 35ms/step - loss: 5.8858 - accuracy: 0.0041 - val_loss: 5.8737 - val_accuracy: 0.0041
Epoch 2/5
2672/2672 [==============================] - 96s 36ms/step - loss: 5.8662 - accuracy: 0.0039 - val_loss: 5.8437 - val_accuracy: 0.0041
Epoch 3/5
2672/2672 [==============================] - 113s 42ms/step - loss: 5.8413 - accuracy: 0.0039 - val_loss: 5.8327 - val_accuracy: 0.0043
Epoch 4/5
2672/2672 [==============================] - 103s 38ms/step - loss: 5.8230 - accuracy: 0.0042 - val_loss: 5.7696 - val_accuracy: 0.0039
Epoch 5/5
2672/2672 [==============================] - 103s 39ms/step - loss: 5.7610 - accuracy: 0.0046 - val_loss: 5.7244 - val_accuracy: 0.0057


I noticed that the cycles are very different - i.e. under epoch it has 2672/2672 instead of 12288/12288. Is that a random factor relative to the 100 (num_games) games it selects?

How would I go about getting the accuracy seen in the book?

@Nkonovalenko
Copy link

Hi Kevin,
I would like to bump this issue. I've changed the channels_first into channels_last, but with a num_games=100 and epochs=5, I only get an accuracy of 0.004. Do you have any recommendations for which files to look through? I'm guessing this is due to a typo on my part, but my processor, parallel_processor, and small are all the same.

@macfergus
Copy link
Collaborator

Hello @Nkonovalenko, please see this writeup here: https://kferg.dev/posts/2021/deep-learning-and-the-game-of-go-training-results-from-chapter-7/

Hopefully that gets you unblocked!

@constant5
Copy link

I am trying to get this to run on colab with a tpu, unfortunately the generator in the code base is not compatible with distribution across the tpu cluster. I solved this by just loading the dataset using generator=False. My problem is that the network is quickly overfitting. I guess increasing the number of games should help with this?

@Nkonovalenko
Copy link

Hello @Nkonovalenko, please see this writeup here: https://kferg.dev/posts/2021/deep-learning-and-the-game-of-go-training-results-from-chapter-7/

Hopefully that gets you unblocked!

Thank you so much, it did!

@macfergus
Copy link
Collaborator

@constant5 The generator version creates large temporary files on disk, so I suspect that's why it won't work with colab (just guessing though).

As for the overfitting, more games is a good idea. I'd say around 10,000 games is the minimum to train a network that is useful for actual game play. And more is better. Not sure what the memory constraints are in colab, but you may have to modify the code to chunk it up yourself.

@constant5
Copy link

This may have not been the most efficient way to do it but after I wrote the consolidated NumPy files to disk I rewrote the data to tf records:

X_train = np.load('data/train_features.npy',mmap_mode='c')
y_train = np.load('data/train_labels.npy',mmap_mode='c')

X_test = np.load('data/test_features.npy',mmap_mode='c')
y_test = np.load('data/test_labels.npy',mmap_mode='c')


def int64_feature(value):
    """Returns an int64_list from a bool / enum / int / uint."""
    return tf.train.Feature(int64_list=tf.train.Int64List(value=value))

def tf_record_save(X, y, type='train',num_samples=9920):
  """writes a numpy array to tfrecords of ~100Mb"""
  num_tfrecords = len(X) // num_samples

  if num_tfrecords == 0: #for arrays smaller than the default num_samples
    num_samples = len(X)
    num_tfrecords = 1

  for tfrec_num in range(num_tfrecords):
      features = X[(tfrec_num * num_samples) : ((tfrec_num + 1) * num_samples)]
      labels = y[(tfrec_num * num_samples) : ((tfrec_num + 1) * num_samples)]
      fname = f"data/tf_records/{type}_file_{tfrec_num}-{num_tfrecords}.tfrec"
      with tf.io.TFRecordWriter(fname) as writer:
          print('Writing ', fname,'...')
          for X,y in zip(features, labels):
              X = np.array(X).flatten().astype(int)
              y =np.array(y).flatten().astype(int)
              example = create_example(X, y)
              writer.write(example.SerializeToString())

tf_record_save(X_train,  y_train, type='train')
tf_record_save(X_test,  y_test, type='test')

Then I created a tfrecords data generator:

def data_input_fn(filenames, batch_size=1024):

  def _parse_tfrecord_fn(example):
    feature_description = {
        "go_board": tf.io.FixedLenFeature((19*19,), tf.int64),
        "move": tf.io.FixedLenFeature((19*19,), tf.int64)
    }
    example = tf.io.parse_single_example(example, feature_description)
    return example   

  def _prepare_sample(features):
    X = tf.reshape(features["go_board"], (19,19,1))
    y = tf.reshape(features["move"], (19*19,))
    return X, y


  def get_dataset(filenames, batch_size):

    AUTOTUNE = tf.data.experimental.AUTOTUNE
    dataset = (
        tf.data.TFRecordDataset(filenames, num_parallel_reads=AUTOTUNE)
        .map(_parse_tfrecord_fn, num_parallel_calls=AUTOTUNE)
        .map(_prepare_sample, num_parallel_calls=AUTOTUNE)
        .shuffle(batch_size * 10)
        .batch(batch_size)
        .prefetch(AUTOTUNE)
    )
    return dataset.cache()
  
  return get_dataset(filenames, batch_size)

train_data = data_input_fn(train_list)
test_data = data_input_fn(test_list)

This works well for colab and GPU training but not for TPU because the TPU does not support local file sharding.

@Aquietzero
Copy link

Hello @Nkonovalenko, please see this writeup here: https://kferg.dev/posts/2021/deep-learning-and-the-game-of-go-training-results-from-chapter-7/

Hopefully that gets you unblocked!

hi, i read the writeup and set num_game = 1000 and epochs = 50, but still can't get the expected accuracy. The first several epochs shows a slow enhancement but after 20 epochs the loss increases till the end. It's hard to figure out what cause the problem. Below shows some training logs.

1480/1480 [==============================] - 98s 66ms/step - loss: 5.8798 - accuracy: 0.0032 - val_loss: 5.8580 - val_accuracy: 0.0041
Epoch 2/50
1480/1480 [==============================] - 98s 66ms/step - loss: 5.8061 - accuracy: 0.0042 - val_loss: 5.7306 - val_accuracy: 0.0057
Epoch 3/50
1480/1480 [==============================] - 99s 67ms/step - loss: 5.6749 - accuracy: 0.0067 - val_loss: 5.6105 - val_accuracy: 0.0081
Epoch 4/50
1480/1480 [==============================] - 98s 66ms/step - loss: 5.5971 - accuracy: 0.0081 - val_loss: 5.5554 - val_accuracy: 0.0098
Epoch 5/50
1480/1480 [==============================] - 98s 66ms/step - loss: 5.5572 - accuracy: 0.0097 - val_loss: 5.5232 - val_accuracy: 0.0107
...
...
Epoch 46/50
1480/1480 [==============================] - 101s 68ms/step - loss: 19.7151 - accuracy: 0.0419 - val_loss: 17.9388 - val_accuracy: 0.0496
Epoch 47/50
1480/1480 [==============================] - 99s 67ms/step - loss: 21.2528 - accuracy: 0.0432 - val_loss: 16.9957 - val_accuracy: 0.0521
Epoch 48/50
1480/1480 [==============================] - 102s 69ms/step - loss: 21.7980 - accuracy: 0.0436 - val_loss: 13.3183 - val_accuracy: 0.0501
Epoch 49/50
1480/1480 [==============================] - 104s 70ms/step - loss: 21.2861 - accuracy: 0.0451 - val_loss: 15.7350 - val_accuracy: 0.0518
Epoch 50/50
1480/1480 [==============================] - 99s 67ms/step - loss: 24.7583 - accuracy: 0.0452 - val_loss: 13.1013 - val_accuracy: 0.0525

Though whether reproducing the result or not is not a blocking point of further reading the book, i still want to get a similar result for a check point. Any hint to check?

@Nkonovalenko you mentioned that you did it. So you just reproduce the result after changing only the num_games and epoch?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants