Add rebatch method for Dataset #393

yongtang · 2019-07-29T14:31:34Z

This PR adds rebatch method for Dataset where

dataset.apply(rebatch(n)) = dataset.unbatch().batch(n)

The motivation for rebatch is that there are situations we read the data in
big batches but then we want to adjust the batch size to fit differnet
scenarios.

This is part of #382.

Signed-off-by: Yong Tang yong.tang.github@outlook.com

This PR adds rebatch method for Dataset where ``` dataset.apply(rebatch(n)) = dataset.unbatch().batch(n) ``` The motivation for rebatch is that there are situations we read the data in big batches but then we want to adjust the batch size to fit differnet scenarios. Signed-off-by: Yong Tang <yong.tang.github@outlook.com>

yongtang · 2019-07-29T14:32:30Z

/cc @feihugis to take a look.

feihugis

@yongtang This PR looks great and will improve the performance of rebatching! Just left a few minor comments here.

There is another RebatchDatasetOp in TensorFlow, which utilizes the grappler to update the batch size but it only allows the rebatching by batch_size/num_workers for the distributed scenario. I think these two dataset ops are different, so we had better rename the op name to avoid the potential confusion.

feihugis · 2019-07-29T22:21:22Z

tensorflow_io/core/kernels/rebatch_dataset_op.cc

+      const auto& input_shapes = input_->output_shapes();
+      output_shapes_.reserve(input_shapes.size());
+      // Always set the first dim as None unless batch_mode is specified.
+      for (const auto& input_shape : input_shapes) {


Do we need to consider the case with unknown rank like here?

@feihugis Done. PR updated.

feihugis · 2019-07-29T23:02:09Z

tensorflow_io/core/kernels/rebatch_dataset_op.cc

+            int64 chunk_to_read = (current_batch_size_ - current_index_) < (dataset()->batch_size_ - chunk_read) ? (current_batch_size_ - current_index_) : (dataset()->batch_size_ - chunk_read);
+            for (int i = 0; i < tensors_.size(); ++i) {
+              // TODO: concurrent copy?
+              for (int64 r = 0; r < chunk_to_read; r++) {


r++ -> ++r

feihugis · 2019-07-29T23:18:44Z

tests/test_text_eager.py

@@ -53,6 +54,13 @@ def test_text_input():
      i += 1
  assert i == len(lines)

+  rebatch_dataset = text_dataset.apply(core_io.rebatch(5))


More cases can be tested: new_batch_size > cur_batch_size, new_batch_size == cur_batch_size, new_batch_size < cur_batch_size.

Done. Additional tests added.

feihugis · 2019-07-29T23:44:29Z

tensorflow_io/core/ops/core_ops.cc

+
+namespace tensorflow {
+
+REGISTER_OP("RebatchDataset")


Do we need drop_remainder Input, which will be aligned with BatchDataset?

The batch_mode input could take a string to specify the batch mode:

keep: leave the reminder as is.

drop: drop the reminder

pad: pad the reminder.

feihugis · 2019-07-29T23:56:24Z

tensorflow_io/core/kernels/rebatch_dataset_op.cc

+        }
+        // Finally, resize if needed
+        if (chunk_read > 0) {
+          if (chunk_read < dataset()->batch_size_) {


If I understand correctly, here we assume the remainder needs to be kept. Maybe we can add a comment about the assumption here. Also, If we add drop_remainder input, users can decide whether to keep the remainder.

Updated. keep, drop, and pad modes have been added.

feihugis · 2019-07-29T23:58:01Z

tensorflow_io/core/kernels/rebatch_dataset_op.cc

+              }
+            }
+            if (out_tensors->size() != tensors_.size()) {
+              return errors::InvalidArgument("number tensors should match previous one, ", tensors_.size(), " vs. ", out_tensors->size());


Do we have the sanity check for C++ style? This line length exceeds the limitation of 80 chars.

In TensorFlow, at one point the C++ style was enforced with clang-format. The issue with clang-format was that different versions of clang-format have different styles so it is really not easy to figure out which one is the right one. TensorFlow dropped the C++ style check later.

I think we could leave the C++ style check alone until we find a clang-format version that stabilize.

Got it. Thanks!

Signed-off-by: Yong Tang <yong.tang.github@outlook.com>

yongtang · 2019-07-30T03:14:55Z

@feihugis Thanks for the review. The batch_mode takes keep, drop, pad modes to decide what to do when reminder surface.

Also the name of the C++ class has been changed to AdjustBatchDataset.

For python function name I really think "rebatch" makes plenty of sense. I will just leave as is. In the future if this Dataset ops is to be added to TensorFlow core repo then we could rethink the name I think.

feihugis

Thanks @yongtang! LGTM. Left one minor comment.

feihugis · 2019-07-30T17:13:15Z

tensorflow_io/core/kernels/rebatch_dataset_op.cc

+        errors::InvalidArgument("Batch size must be greater than zero."));
+
+    string batch_mode = "";
+    OP_REQUIRES_OK(ctx,


minor: do we need to check if the input batch_mode is valid?

feihugis · 2019-07-30T17:22:53Z

tensorflow_io/core/kernels/rebatch_dataset_op.cc

+              *end_of_sequence = true;
+              return Status::OK();
+            }
+            // otherwise "pad" means keep the size


Just remind that pad is not implemented yet.

Signed-off-by: Yong Tang <yong.tang.github@outlook.com>

* Add rebatch method for Dataset This PR adds rebatch method for Dataset where ``` dataset.apply(rebatch(n)) = dataset.unbatch().batch(n) ``` The motivation for rebatch is that there are situations we read the data in big batches but then we want to adjust the batch size to fit differnet scenarios. Signed-off-by: Yong Tang <yong.tang.github@outlook.com> * Add additional tests, also add batch_mode = "keep", "drop", "pad" mode Signed-off-by: Yong Tang <yong.tang.github@outlook.com> * Rename RebatchDataset to AdjustBatchDataset Signed-off-by: Yong Tang <yong.tang.github@outlook.com> * Add additional processing in case shape is unknown Signed-off-by: Yong Tang <yong.tang.github@outlook.com> * Address review comments Signed-off-by: Yong Tang <yong.tang.github@outlook.com> * Fix failed tests Signed-off-by: Yong Tang <yong.tang.github@outlook.com>

yongtang requested review from BryanCutler and terrytangyuan July 29, 2019 14:31

yongtang mentioned this pull request Jul 29, 2019

Discuss Batch Standards in TFIO with Keras #382

Open

feihugis suggested changes Jul 30, 2019

View reviewed changes

yongtang added 3 commits July 30, 2019 02:54

Add additional tests, also add batch_mode = "keep", "drop", "pad" mode

12946d3

Signed-off-by: Yong Tang <yong.tang.github@outlook.com>

Rename RebatchDataset to AdjustBatchDataset

159ad59

Signed-off-by: Yong Tang <yong.tang.github@outlook.com>

Add additional processing in case shape is unknown

5f86234

Signed-off-by: Yong Tang <yong.tang.github@outlook.com>

feihugis approved these changes Jul 30, 2019

View reviewed changes

Address review comments

8f12b88

Signed-off-by: Yong Tang <yong.tang.github@outlook.com>

yongtang mentioned this pull request Jul 30, 2019

Add read_text to read lines from splittable text file #397

Merged

feihugis approved these changes Jul 30, 2019

View reviewed changes

yongtang added kokoro:force-run kokoro:run Kokoro CI labels Jul 30, 2019

kokoro-team removed kokoro:run Kokoro CI kokoro:force-run labels Jul 30, 2019

Fix failed tests

95166c5

Signed-off-by: Yong Tang <yong.tang.github@outlook.com>

yongtang merged commit 06b38f8 into tensorflow:master Jul 31, 2019

yongtang deleted the rebatch branch July 31, 2019 00:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add rebatch method for Dataset #393

Add rebatch method for Dataset #393

yongtang commented Jul 29, 2019 •

edited

Loading

yongtang commented Jul 29, 2019

feihugis left a comment

feihugis Jul 29, 2019

yongtang Jul 30, 2019

feihugis Jul 29, 2019

yongtang Jul 30, 2019

feihugis Jul 29, 2019 •

edited

Loading

yongtang Jul 30, 2019

feihugis Jul 29, 2019

yongtang Jul 30, 2019

feihugis Jul 29, 2019

yongtang Jul 30, 2019

feihugis Jul 29, 2019

yongtang Jul 30, 2019

feihugis Jul 30, 2019

yongtang commented Jul 30, 2019

feihugis left a comment

feihugis Jul 30, 2019

feihugis Jul 30, 2019


		namespace tensorflow {

		REGISTER_OP("RebatchDataset")

Add rebatch method for Dataset #393

Add rebatch method for Dataset #393

Conversation

yongtang commented Jul 29, 2019 • edited Loading

yongtang commented Jul 29, 2019

feihugis left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

feihugis Jul 29, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yongtang commented Jul 30, 2019

feihugis left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yongtang commented Jul 29, 2019 •

edited

Loading

feihugis Jul 29, 2019 •

edited

Loading