Rework on ParquetDataset for easy access and better cache size in eag…

…er mode (tensorflow#384) * Rework on ParquetDataset for easy access and better cache size in eager mode This fix is part of the effort to improve overall Dataset for easy access and better cache size in eager mode. See 382 and 366 for related discussions. In order to be able to read file either in filename or in mmeory, this PR adds an SizedRandomAccessFile which allows to provide an optional memory buffer as file content. This could be useful in process compression or archives where we could just read the uncompressed file content into memory. The preivous limitation in Dataset was that Dataset was a iterable so sequence length is unknown until graph runtime. In this PR, we provide an helper function to read the specs of parquet file and lenth is know. This also could open other avenues such as map parquet file with __getitem__ and __len__. Further, parquet file could be read into a Tensor and processed easily (such as pandas like API). The read_parquet_specs could be similarly applied to HDF5 which is more important: HDF5 could have dataset with different sizes. Summary: 1) Two basic C++ kernel ops are implemnted: read_parquet_specs and read_parquet 2) One ParquetDataset that is python implementation only (no C++ anymore) 3) ParquetDataset support eager and graph mode, in graph mode, dtype and shape are provided by user explicitly. In eager mode, only column name is needed. 4) read_parquet works in eager and graph mode, can read records either in full, or in slices 5) read_parquet_specs works in eager mode only (limitation). For cache batch vs. batch in tf.keras 1) Added a hidden `capacity` to adjust the cache batch size 2) batch to be passed in tf.keras is unrelated to `capacity`, but we could use `rebatch` to change at the end of the pipeline. 3) `capacity` could be padded to allow `rebatch` to only cut a slice over one chunk. If not padded to `batch_size` in tf.keras, then `rebatch` likely will copy over boundary. Signed-off-by: Yong Tang <yong.tang.github@outlook.com> * Fix build failures Signed-off-by: Yong Tang <yong.tang.github@outlook.com> * Rename read_parquet_columns => list_parquet_columns Signed-off-by: Yong Tang <yong.tang.github@outlook.com> * Remove batch args, and add test in graph mode Signed-off-by: Yong Tang <yong.tang.github@outlook.com>
i-ony · Aug 5, 2019 · f05b3f0 · f05b3f0
1 parent 5442147
commit f05b3f0
Show file tree

Hide file tree

Showing 9 changed files with 504 additions and 507 deletions.
diff --git a/tensorflow_io/core/BUILD b/tensorflow_io/core/BUILD
@@ -135,6 +135,7 @@ cc_binary(
         "//tensorflow_io/json:json_ops",
         "//tensorflow_io/lmdb:lmdb_ops",
         "//tensorflow_io/mnist:mnist_ops",
+        "//tensorflow_io/parquet:parquet_ops",
         "//tensorflow_io/prometheus:prometheus_ops",
         "//tensorflow_io/text:text_ops",
         "@libarchive",

diff --git a/tensorflow_io/parquet/BUILD b/tensorflow_io/parquet/BUILD
@@ -7,18 +7,16 @@ load(
     "tf_io_copts",
 )
 
-cc_binary(
-    name = "python/ops/_parquet_ops.so",
+cc_library(
+    name = "parquet_ops",
     srcs = [
-        "kernels/parquet_input.cc",
+        "kernels/parquet_kernels.cc",
         "ops/parquet_ops.cc",
     ],
     copts = tf_io_copts(),
-    linkshared = 1,
+    linkstatic = True,
     deps = [
         "//tensorflow_io/core:dataset_ops",
         "@arrow",
-        "@local_config_tf//:libtensorflow_framework",
-        "@local_config_tf//:tf_header_lib",
     ],
 )
diff --git a/tensorflow_io/parquet/__init__.py b/tensorflow_io/parquet/__init__.py
@@ -15,18 +15,24 @@
 """Parquet Dataset.
 
 @@ParquetDataset
+@@read_parquet
+@@list_parquet_columns
 """
 
 from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
 from tensorflow_io.parquet.python.ops.parquet_ops import ParquetDataset
+from tensorflow_io.parquet.python.ops.parquet_ops import read_parquet
+from tensorflow_io.parquet.python.ops.parquet_ops import list_parquet_columns
 
 from tensorflow.python.util.all_util import remove_undocumented
 
 _allowed_symbols = [
     "ParquetDataset",
+    "read_parquet",
+    "list_parquet_columns",
 ]
 
 remove_undocumented(__name__, allowed_exception_list=_allowed_symbols)
diff --git a/tensorflow_io/parquet/kernels/parquet_input.cc b/tensorflow_io/parquet/kernels/parquet_input.cc