-
Notifications
You must be signed in to change notification settings - Fork 283
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Discuss Batch Standards in TFIO with Keras #382
Comments
To adds additional information, tf.data actually has a
I think we only need to focus on single filename case (as multiple files could be processed at higher level with concatenate). After Dataset has been concatenated batch mode could reapply:
I do think
but I suspect |
This PR is part of the effort in enhancing performance and ease of use for tf.data pipeline, as was discussed in tensorflow#382 and tensorflow#366. Previously, HDF5Dataset is relatively manual and user has to find out the dataset (columns) in hdf5 file. In this PR, the idea is to allow user to use list_hdf5_datasets to probe the shape, dtype, and name of the datasets within HDF5. A subsequent call to read_hdf5 will bring content to a shaped Tensor so that it could be used later in TensorFlow. The read_hdf5 has the option to specify a slice (or a subblock) of the dataset. This should open up possibility in the future to allow binding a class with a hdf5 file by implement `__len__` and `__getitem__`. With list_hdf5_datasets and read_hdf5 ops, it is also possible to ease the HDF5Dataset in eager mode. In eager mode, HDF5Dataset could juse call list_hdf5_datasets to find out all the necessary information, then calling read_hdf5 in pieces to maintain the `batch_size` to be fed in tf.keras. The limitation would be in graph mode as in graph mode user still has to specify almost everything dtype, shape, name for HDF5Dataset to work. This PR has not changed HDF5Dataset implementation to use list_hdf5_datasets and read_hdf5 ops. But this could be easily done and see 384 for similar changes. Signed-off-by: Yong Tang <yong.tang.github@outlook.com>
This PR is part of the effort in enhancing performance and ease of use for tf.data pipeline, as was discussed in tensorflow#382 and tensorflow#366. Previously, HDF5Dataset is relatively manual and user has to find out the dataset (columns) in hdf5 file. In this PR, the idea is to allow user to use list_hdf5_datasets to probe the shape, dtype, and name of the datasets within HDF5. A subsequent call to read_hdf5 will bring content to a shaped Tensor so that it could be used later in TensorFlow. The read_hdf5 has the option to specify a slice (or a subblock) of the dataset. This should open up possibility in the future to allow binding a class with a hdf5 file by implement `__len__` and `__getitem__`. With list_hdf5_datasets and read_hdf5 ops, it is also possible to ease the HDF5Dataset in eager mode. In eager mode, HDF5Dataset could juse call list_hdf5_datasets to find out all the necessary information, then calling read_hdf5 in pieces to maintain the `batch_size` to be fed in tf.keras. The limitation would be in graph mode as in graph mode user still has to specify almost everything dtype, shape, name for HDF5Dataset to work. This PR has not changed HDF5Dataset implementation to use list_hdf5_datasets and read_hdf5 ops. But this could be easily done and see 384 for similar changes. Signed-off-by: Yong Tang <yong.tang.github@outlook.com>
Added a PR #393 to introduce @terrytangyuan @BryanCutler Here is my thinking on tf.keras batch vs cache batch issue:
So overall I think we could:
|
This PR is part of the effort in enhancing performance and ease of use for tf.data pipeline, as was discussed in tensorflow#382 and tensorflow#366. Previously, HDF5Dataset is relatively manual and user has to find out the dataset (columns) in hdf5 file. In this PR, the idea is to allow user to use list_hdf5_datasets to probe the shape, dtype, and name of the datasets within HDF5. A subsequent call to read_hdf5 will bring content to a shaped Tensor so that it could be used later in TensorFlow. The read_hdf5 has the option to specify a slice (or a subblock) of the dataset. This should open up possibility in the future to allow binding a class with a hdf5 file by implement `__len__` and `__getitem__`. With list_hdf5_datasets and read_hdf5 ops, it is also possible to ease the HDF5Dataset in eager mode. In eager mode, HDF5Dataset could juse call list_hdf5_datasets to find out all the necessary information, then calling read_hdf5 in pieces to maintain the `batch_size` to be fed in tf.keras. The limitation would be in graph mode as in graph mode user still has to specify almost everything dtype, shape, name for HDF5Dataset to work. This PR has not changed HDF5Dataset implementation to use list_hdf5_datasets and read_hdf5 ops. But this could be easily done and see 384 for similar changes. Signed-off-by: Yong Tang <yong.tang.github@outlook.com>
This sounds like a good paradigm to me. Do you plan some way for the user to set the cache batch size or is it determined by the Dataset? For the Arrow datasets, I'll have to think a bit more on the best way to handle this. Currently, records are read as batch, which is effectively the "cache batch" just not in the form of tensors. If this outputs the entire batch as tensors then it would be a second copy of all the batched data, which might not be good. Still, it might be useful to cache the batch as tensors and release the original record batch. Then the user could call a |
@BryanCutler for internal cache size I am thinking about "try to fine tune automatically" unless user override (through The capacity could be overridden in |
@BryanCutler Haven't spend much time to look into Arrow yet, though it looks like there are two different types: the feather file format and the arrow streaming format. The file format probably fits the Tensor case we discussed here as in theory, we could easily distribute a list of cached tensors to different nodes to improve the performance by concurrently running ops on list of chunked (cached) tensors. The streaming format, depending on if it is replayable or not, may not fit the file format handling. |
@BryanCutler @terrytangyuan Thinking again I think the problem probably goes to either tf.data should be closer to The overall tf.data pipeline works well with tf.keras (closer to This is very obviously when I play with pandas API in PR #356. Could not achieve many operations with tf.data, as tf.data is just an iterable. But I could easily implement anything with plain Tensor and ops. (still actively working on #356 but I do need tools beyond tf.data) That is actually why I want to rework on quite a few file formats (hdf5, parquet, Avro, text in #399, #392, #384) as I would like to be able to read file into a Tensor so that I can just have additional ops for feature engineering. (The PR has been done in a way such that it allows user to read data into both tf.data and tensor with the same code base). On the other hand, there are cases where Dataset is closer to @BryanCutler For the batch in Arrow I tend to think we could default one way but allows user to override to optimize in another way (depending on if the usage is closer to |
@BryanCutler Overall my experience when reworking on some of the format (hdf5, parquet, Avro) is that, if a file is splittable by nature, then we could just write primitive ops to read the file. The primitive ops could be used in a normal graph to read data into a Tensor (which could be easily accessible with many powerful operations), or it could piece together into a tf.data pipeline where memory could be limited (TB's of data vs memory). if a file is not splittable by nature (e.g., PCAP file which is just a concatenation of variable length of packets) then it probably fits a C++ implementation of Dataset. But still even in case a file is not splittable, we should still support reading a whole file into Tensor in addition to a C++ dataset implementation. The benefit is that manipulating Tensor is too easy while manipulating tf.data is too limited. |
@BryanCutler @terrytangyuan Some additional discussion about the ways to process input format. It seems we are mostly dealing with the following
I come up with the list, as during the feather PR #404 I noticed that feather file was designed to support zero-copy and expect everything is in memory (at least for each column). I still expect future Arrow Feather format versions may support splittable chunks (multiple batch records). We discussed about the limitations of tf.data.Datast as it is an iterable (no Given the above scenarios, I am thinking we could do the following:
Any comments or suggestions? |
Thanks @yongtang for the very detailed list! For Arrow, there are 2 memory formats: Arrow Stream and Arrow File. Both support chunking as record batches. Arrow Stream is strict stream. Arrow File is random access, but not necessarily to be read all in memory. Feather files use the Arrow File format on disk, and the file could be chunked as well. I think it's just currently not implemented to read the file a chunk at a time, but could be in the future. |
The above proposal sounds pretty good to me. I would just want to be careful that for splittable (or chunked) files we keep a code path available that will keep memory usage at a minimum, but also support reading everything into memory if the user wants. |
This PR is part of the effort in enhancing performance and ease of use for tf.data pipeline, as was discussed in tensorflow#382 and tensorflow#366. Previously, HDF5Dataset is relatively manual and user has to find out the dataset (columns) in hdf5 file. In this PR, the idea is to allow user to use list_hdf5_datasets to probe the shape, dtype, and name of the datasets within HDF5. A subsequent call to read_hdf5 will bring content to a shaped Tensor so that it could be used later in TensorFlow. The read_hdf5 has the option to specify a slice (or a subblock) of the dataset. This should open up possibility in the future to allow binding a class with a hdf5 file by implement `__len__` and `__getitem__`. With list_hdf5_datasets and read_hdf5 ops, it is also possible to ease the HDF5Dataset in eager mode. In eager mode, HDF5Dataset could juse call list_hdf5_datasets to find out all the necessary information, then calling read_hdf5 in pieces to maintain the `batch_size` to be fed in tf.keras. The limitation would be in graph mode as in graph mode user still has to specify almost everything dtype, shape, name for HDF5Dataset to work. This PR has not changed HDF5Dataset implementation to use list_hdf5_datasets and read_hdf5 ops. But this could be easily done and see 384 for similar changes. Signed-off-by: Yong Tang <yong.tang.github@outlook.com>
* Rework on HDF5: add list_hdf5_datasets and read_hdf5 ops This PR is part of the effort in enhancing performance and ease of use for tf.data pipeline, as was discussed in #382 and #366. Previously, HDF5Dataset is relatively manual and user has to find out the dataset (columns) in hdf5 file. In this PR, the idea is to allow user to use list_hdf5_datasets to probe the shape, dtype, and name of the datasets within HDF5. A subsequent call to read_hdf5 will bring content to a shaped Tensor so that it could be used later in TensorFlow. The read_hdf5 has the option to specify a slice (or a subblock) of the dataset. This should open up possibility in the future to allow binding a class with a hdf5 file by implement `__len__` and `__getitem__`. With list_hdf5_datasets and read_hdf5 ops, it is also possible to ease the HDF5Dataset in eager mode. In eager mode, HDF5Dataset could juse call list_hdf5_datasets to find out all the necessary information, then calling read_hdf5 in pieces to maintain the `batch_size` to be fed in tf.keras. The limitation would be in graph mode as in graph mode user still has to specify almost everything dtype, shape, name for HDF5Dataset to work. This PR has not changed HDF5Dataset implementation to use list_hdf5_datasets and read_hdf5 ops. But this could be easily done and see 384 for similar changes. Signed-off-by: Yong Tang <yong.tang.github@outlook.com> * Process default value of count and start Signed-off-by: Yong Tang <yong.tang.github@outlook.com> * Support HDF5Datast in graph mode Signed-off-by: Yong Tang <yong.tang.github@outlook.com>
) * Rework on HDF5: add list_hdf5_datasets and read_hdf5 ops This PR is part of the effort in enhancing performance and ease of use for tf.data pipeline, as was discussed in tensorflow#382 and tensorflow#366. Previously, HDF5Dataset is relatively manual and user has to find out the dataset (columns) in hdf5 file. In this PR, the idea is to allow user to use list_hdf5_datasets to probe the shape, dtype, and name of the datasets within HDF5. A subsequent call to read_hdf5 will bring content to a shaped Tensor so that it could be used later in TensorFlow. The read_hdf5 has the option to specify a slice (or a subblock) of the dataset. This should open up possibility in the future to allow binding a class with a hdf5 file by implement `__len__` and `__getitem__`. With list_hdf5_datasets and read_hdf5 ops, it is also possible to ease the HDF5Dataset in eager mode. In eager mode, HDF5Dataset could juse call list_hdf5_datasets to find out all the necessary information, then calling read_hdf5 in pieces to maintain the `batch_size` to be fed in tf.keras. The limitation would be in graph mode as in graph mode user still has to specify almost everything dtype, shape, name for HDF5Dataset to work. This PR has not changed HDF5Dataset implementation to use list_hdf5_datasets and read_hdf5 ops. But this could be easily done and see 384 for similar changes. Signed-off-by: Yong Tang <yong.tang.github@outlook.com> * Process default value of count and start Signed-off-by: Yong Tang <yong.tang.github@outlook.com> * Support HDF5Datast in graph mode Signed-off-by: Yong Tang <yong.tang.github@outlook.com>
Following the discussion on #366 batching can serve different purposes and optimizing for each is not always done the same way.
The text was updated successfully, but these errors were encountered: