Skip to content

Commit

Permalink
Rework on ParquetDataset for easy access and better cache size in eag…
Browse files Browse the repository at this point in the history
…er mode (tensorflow#384)

* Rework on ParquetDataset for easy access and better cache size in eager mode

This fix is part of the effort to improve overall Dataset for
easy access and better cache size in eager mode.
See 382 and 366 for related discussions.

In order to be able to read file either in filename or in mmeory, this PR
adds an SizedRandomAccessFile which allows to provide an optional memory buffer
as file content. This could be useful in process compression or archives
where we could just read the uncompressed file content into memory.

The preivous limitation in Dataset was that Dataset was a iterable so sequence
length is unknown until graph runtime. In this PR, we provide an helper function
to read the specs of parquet file and lenth is know.

This also could open other avenues such as map parquet file with __getitem__ and __len__.
Further, parquet file could be read into a Tensor and processed easily (such as pandas like API).

The read_parquet_specs could be similarly applied to HDF5 which is more important:
HDF5 could have dataset with different sizes.

Summary:
1) Two basic C++ kernel ops are implemnted: read_parquet_specs and read_parquet
2) One ParquetDataset that is python implementation only (no C++ anymore)
3) ParquetDataset support eager and graph mode, in graph mode, dtype and shape
   are provided by user explicitly. In eager mode, only column name is needed.
4) read_parquet works in eager and graph mode, can read records either in full, or in slices
5) read_parquet_specs works in eager mode only (limitation).

For cache batch vs. batch in tf.keras
1) Added a hidden `capacity` to adjust the cache batch size
2) batch to be passed in tf.keras is unrelated to `capacity`, but we could use `rebatch`
   to change at the end of the pipeline.
3) `capacity` could be padded to allow `rebatch` to only cut a slice over one chunk.
   If not padded to `batch_size` in tf.keras, then `rebatch` likely will copy over boundary.

Signed-off-by: Yong Tang <yong.tang.github@outlook.com>

* Fix build failures

Signed-off-by: Yong Tang <yong.tang.github@outlook.com>

* Rename read_parquet_columns => list_parquet_columns

Signed-off-by: Yong Tang <yong.tang.github@outlook.com>

* Remove batch args, and add test in graph mode

Signed-off-by: Yong Tang <yong.tang.github@outlook.com>
  • Loading branch information
yongtang authored Aug 5, 2019
1 parent 5442147 commit f05b3f0
Show file tree
Hide file tree
Showing 9 changed files with 504 additions and 507 deletions.
1 change: 1 addition & 0 deletions tensorflow_io/core/BUILD
Original file line number Diff line number Diff line change
Expand Up @@ -135,6 +135,7 @@ cc_binary(
"//tensorflow_io/json:json_ops",
"//tensorflow_io/lmdb:lmdb_ops",
"//tensorflow_io/mnist:mnist_ops",
"//tensorflow_io/parquet:parquet_ops",
"//tensorflow_io/prometheus:prometheus_ops",
"//tensorflow_io/text:text_ops",
"@libarchive",
Expand Down
10 changes: 4 additions & 6 deletions tensorflow_io/parquet/BUILD
Original file line number Diff line number Diff line change
Expand Up @@ -7,18 +7,16 @@ load(
"tf_io_copts",
)

cc_binary(
name = "python/ops/_parquet_ops.so",
cc_library(
name = "parquet_ops",
srcs = [
"kernels/parquet_input.cc",
"kernels/parquet_kernels.cc",
"ops/parquet_ops.cc",
],
copts = tf_io_copts(),
linkshared = 1,
linkstatic = True,
deps = [
"//tensorflow_io/core:dataset_ops",
"@arrow",
"@local_config_tf//:libtensorflow_framework",
"@local_config_tf//:tf_header_lib",
],
)
6 changes: 6 additions & 0 deletions tensorflow_io/parquet/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,18 +15,24 @@
"""Parquet Dataset.
@@ParquetDataset
@@read_parquet
@@list_parquet_columns
"""

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

from tensorflow_io.parquet.python.ops.parquet_ops import ParquetDataset
from tensorflow_io.parquet.python.ops.parquet_ops import read_parquet
from tensorflow_io.parquet.python.ops.parquet_ops import list_parquet_columns

from tensorflow.python.util.all_util import remove_undocumented

_allowed_symbols = [
"ParquetDataset",
"read_parquet",
"list_parquet_columns",
]

remove_undocumented(__name__, allowed_exception_list=_allowed_symbols)
315 changes: 0 additions & 315 deletions tensorflow_io/parquet/kernels/parquet_input.cc

This file was deleted.

Loading

0 comments on commit f05b3f0

Please sign in to comment.