-
Notifications
You must be signed in to change notification settings - Fork 283
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Add FileDataset to read whole content of file into tf.data pipeline #366
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yongtang This PR looks great! Just add some minor comments.
@@ -0,0 +1,92 @@ | |||
/* Copyright 2018 The TensorFlow Authors. All Rights Reserved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2018 -> 2019
// First try to find out the size of the file, depending on if size is available, we will set the chunk to read. | ||
uint64 file_size = 0; | ||
if (sized_stream->GetFileSize(&file_size) == Status::OK()) { | ||
chunk_size = file_size; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be helpful if we add a warning log here when file_size
is big?
} else { | ||
string buffer; | ||
int64 total_size = 0; | ||
for (size_t i = 0; i < entries.size(); i++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: If file_size
is known, will total_size
be equal to file_size
? Could we calculate total_size
during reading the file:
int64 total_size = 0;
while (status.ok()) {
string buffer;
status = s->ReadNBytes(chunk_size, &buffer);
if (status.ok() || errors::IsOutOfRange(status)) {
entries.emplace_back(std::move(buffer));
total_size += buffer.size();
}
}
total_size += entries[i].size(); | ||
} | ||
buffer.reserve(total_size); | ||
for (size_t i = 0; i < entries.size(); i++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
++i
may be better.
bool DecodeAttributes(const VariantTensorData& data) override { | ||
return true; | ||
} | ||
protected: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
protected:
is empty. We can delete it here.
tensorflow_io/core/ops/file_ops.cc
Outdated
@@ -0,0 +1,46 @@ | |||
/* Copyright 2018 The TensorFlow Authors. All Rights Reserved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2018 -> 2019
@@ -60,3 +61,18 @@ def __init__(self, fn, data_input, batch, dtypes, shapes): | |||
self._batch, | |||
output_types=self._dtypes, | |||
output_shapes=self._shapes), self._batch, self._dtypes, self._shapes) | |||
|
|||
class FileDataset(BaseDataset): | |||
"""A FileDataset that read file content as string""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: read -> reads
@@ -21,39 +21,9 @@ | |||
from tensorflow import dtypes | |||
from tensorflow.compat.v1 import data | |||
from tensorflow_io import _load_library | |||
from tensorflow_io.core.python.ops import data_ops as data_ops |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: could we just use from tensorflow_io.core.python.ops import data_ops
without as data_ops
?
When we started the project, we thought data read through tf.data pipeline will always be record like. That is, each file will have multiple records. This scenario is the case for many situations like Text, Video, etc. where the natural boundary within the file are new line feed, or inidvidual frames. But for many images formats such as webp, jpeg, etc, each image is already a boundary and there is no need to further partition. Further, people still perfer `decode_xxx` call in many situations. This PR adds a FileDataset which could take files (and potentially compressed files) and feed whole file content as string into tf.data. This FileDataset is not a fit where files are really large (e.g., 100GB of text record). However, it is good enough for image file usage. As an example, this PR also converts WebPDataset to use FileDataset + decode_webp. Note: TIFF file could have multiple images with different shape so it is not a fit for FileDataset as well. Signed-off-by: Yong Tang <yong.tang.github@outlook.com>
…+decode_webp Signed-off-by: Yong Tang <yong.tang.github@outlook.com>
Signed-off-by: Yong Tang <yong.tang.github@outlook.com>
Signed-off-by: Yong Tang <yong.tang.github@outlook.com>
Signed-off-by: Yong Tang <yong.tang.github@outlook.com>
How is this different from using ds = tf.data.Dataset.list_files('file/pattern/*')
ds = ds.map(tf.io.read_file) or ds = tf.data.Dataset.from_tensor_slices('file/path').map(tf.io.read_file) |
@damienpontifex That is a good question. When the project was started, it was a collection of several C++ implementation of tf.data.Dataset (Kafka, Parquet, Ignite, Kinesis) and IGFS file system that used to be in tensorflow's core repo (and spin off due to the deprecation of tf.contrib). Some of them are file formats, some of them are not even file format, but really just streaming IO processing (e.g., Kafka/Kinesis). Because of the legacy, we used to favor C++ implementations on tf.data.Dataset, than high level python implementations. We thought that could be advantageous as it could be part of the TensorFlow's graph (save and load), and should performs well without the limitation of As we move forward, there are several limitations:
We made some efforts to streamline the C++ code so that contribution could be easier. For example, if you look into some recent additions of Dataset such as Pcap (PR #303), you probably noticed that you don't need to even understand TensorFlow's Graph concept in C++. You only need to implement a method called That helps a little. But then we realized that some of the early additions of tf.data.Dataset are not really that great due to the repetitive C++ code implementation. For example, as you might see from this PR, we used to implement both So for this reason, I was hoping this PR could at least remove the need to have both |
Now back to the FileDataset discussion. I think
is a good point. I am in favor of reducing the repetitive C++ code (think @terrytangyuan also agrees even before that), and replace it with:
For example, image files could be very straightforward, so let's do all image files this way:
That is also why I am in favor of dicom's 'decode_dicom_image/decode_dicom_data' approach than rework as a C++ That was the initial motivation for FileDataset. (apparently I forgot we could use |
@damienpontifex I am thinking close this PR, or just replace WebPDataset with But there are also some additional items we may have to be concerned with. |
@damienpontifex For one, even though most of the image files are straightforward as they most likely will fit in memory, there are special files that may have multiple image frames. For example, GIF may have multiple frames of the same size, and TIFF may have multiple images of even different size. How to come up with a (Note TIFF issues is also one of the reasons PR #186 is in Working-in-Progress state for so long without any progress). |
@damienpontifex Another issue is the compression (note compression is different from archive. compression consists of one component such as gzip. archive consists of multiple components such as zip.). In TensorFlow's core repo, you probably noticed that file formats such as TFRecord and CSV actually manually repeated the C++ implantation of compression in both cases. At one point I want to avoid the repeated C++ code of compression so I added the C++ template to handle that. For example, in pcap you can add another compression with just adding a filter such as Now with As we discuss, I think we could add a generic
to extract any layers of compression filters. But, archive is still a problem as we could not use We could have a ZipDataset to allow rendering multiple components. But how are we going to chain them together with map? Because tf.data.Dataset seems to not allow recursive levels of tf.data.Dataset itself to iterate. |
@damienpontifex Finally, there are also files that just not fit into memory. For example, a video file naturally should be considered as a collection of individual image frames. We could not just read GB's of video in memory and decode it in one batch. There are also some other formats like hdf5 that could be big and user want to extract one record at a time (instead of one batch to read GB's of data into memory). At the moment we just use C++ implementation manually to process them one format after another format. I haven't come up with a better way to handle them yet. |
@damienpontifex Thanks for bring this issue up. Let me move this PR to Working-in-Progress for now. |
Thank you @yongtang! That was a very comprehensive insight. I love the thinking process and I'd love to be able to work on this type of thing more/full time. Another spanner in all of this work is the different language bindings (well outside this discussion). I know c++ helps in this domain, so having things down lower opens opportunities to use them across these as I'm seeing also in the swift for TF repo now. Agreed on there being a lot to keep track of and I I'd used the |
Thanks @damienpontifex. Yes one advantages with C++ primitive ops is that it could be part of the TF graph. And as long as it is part of the TF Graph then it is always possible to bind with another language such as Golang or R or Swift. (I have lots interests to swift too, haven't had a chance to play with it yet). That is one of the reasons we still prefer C++ for low level (but we should restrict C++ to only low level primitive ops, not composite ops, for the ease of developing with more contributors). On another note about language binding & mixture. It is not even necessarily one way around (other language => C++). In fact, it is even possible to build code from other C-family language into C++ as well. For example, in tensorflow-io there is already a prometheus dataset (PR #265) which actually builds from prometheus' Golang SDK and embedded into part of the C++ tf.data.Dataset as shared library. I was hoping it could be set as an example to get more interests from golang developers and docker/kubernetes community, for tensorflow-io project. (Documentation is missing but feature has been there. I probably should spend more time on documentation). Swift could be similar, either through cdecl or callback I believe. Anyway, I think there are plenty of rooms to make this an interesting projects. |
@yongtang @damienpontifex These discussions are great and inspiring!
If I remember correctly, HDF5 and TIFF could support chunk/tile structure. The dataset op could read and decompress the chunk one by one instead of reading the whole big file. To process different data formats, I'm wondering if something like below would work:
struct Metadata {
string path;
long byte_offset;
long byte_length;
CompressionType compression_type;
};
class Reader {
public:
Reader(Metadata metadata) {
TF_ASSERT_OK(env->NewRandomAccessFile(metadata.path, &file_));
}
Status Read(const Metadata& metadata, Tensor* tensor) {
string read;
RandomAccessInputStream in(file.get());
TF_RETURN_IF_ERROR(in.Seek(metadata.byte_offset));
TF_RETURN_IF_ERROR(in.ReadNBytes(metadata.byte_length, &read));
switch (metadata.compression_type) {
case CompressionType::ZLIB :
TF_RETURN_IF_ERROR(Decompression::Decode_compression_zlib(read, tensor));
case CompressionType::GZIP :
TF_RETURN_IF_ERROR(Decompression::Decode_compression_gzip(read, tensor));
...
}
return Status::OK();
}
private:
std::unique_ptr<RandomAccessFile> file_;
};
class HDF5DatasetOp {
...
Status ReadMetadata(const string& path) {
// populate metadata_vec_
}
...
// This method can be shared between different data formats.
Status GetNextInternal(IteratorContext* ctx,
std::vector<Tensor>* out_tensors,
bool* end_of_sequence) override {
if (index_ == metadata_vec_.size()) {
*end_of_sequence = true;
return Status::OK();
}
TF_RETURN_IF_ERROR(reader.read(metadata_vec_[index_], output));
out_tensors->emplace_back(std::move(output));
++index;
return Status::OK();
}
private:
std::vector<MetaData> metadata_vec_;
long index_;
Reader reader;
}; This approach supports different data formats stored in different storage systems and can also share the code between different data formats (e.g. compression and decompression, reading and writing). |
@feihugis Yes meta data definitely works in archives (e.g., zip). In fact, to go one step further, we could even store the buffer in metadata itself, so that the next round of The metadata could be wrapped into If we allow to have an optional input of meta data, we could do it in this way: struct Metadata {
string path;
long byte_offset;
long byte_length;
CompressionType compression_type;
void *buffer; // optional
}; dataset = tf.data.Dataset.from_tensor_slices('sample.zip')
dataset = dataset.map(unarchive) # unarchive: filename => (filename, metadata) NOTE: might have multiple (filename, metadata) pairs:
(sample.zip, sample1.data buffer)
(sample.zip, sample2.data buffer)
(sample.zip, sample3.data buffer)
dataset = dataset.map(read_object) # read_object: (filename, metadata) => buffer NOTE: if metadata already have buffer in memory, just pass the pass to output
(sample1.data buffer)
(sample2.data buffer)
(sample3.data buffer) In case there is no middle step, then
|
@feihugis @damienpontifex Overall I think we have several things to fix:
Also add @BryanCutler here about the batch. Previously, we thought we will reuse the same batch concept for serving two purposes:
But those two batch concepts are different. @BryanCutler I am wondering if it makes sense to split those things out. In other words, for 1) we should "read as much record as possible, as long as it fits the memory", and for 2) we should do a |
@yongtang Yeah, Zip file is a good example. Do you think we can apply the metadata-based approach to other data formats? It may be challenging for some data formats as the libraries may not expose this kind of API to extract the metadata info from the files. |
If we want to directly feed datasets to tf.keras for model training, do we also need to consider the data shuffle? |
@BryanCutler There are already quite a lot of discussions on this thread, though I want to add one more. Previously, when we create a dataset, we create a dataset in one shot for every column. For example, for CsvDataset, we have to specify EVERY column before hand before it even runs. The requirement was actually because we used to implement against 1.13/1.14 where TF Graph is static. So we need to know EVERYTHING before hand, in order to run the graph (or pass to tf.keras). Now as we move to TF 2.0, knowing everything before hand is not necessarily anymore. We could just parse the file and find the meta data in eager mode, then build the dataset to pass to tf.keras. In this situation, I am wondering if it makes sense to focus on "building dataset with one column at a time"? Something like:
The reason, is that, when we try to build a dataset from ALL columns, we assume all columns should have the same number of records. But this is not the case for many files such as HDFS or Feather (if I understand correctly). I noticed this issue when I tries to play with pandas. Just realized in our current implementation, it is hard to apply NA or null field. But with TF 2.0 and eager execution, we actually have more freedom to handle those situations. For example, we could do additional bit masking before merge different columns. From that standpoint, maybe it makes more sense to focus on building a dataset with only one column at a time? |
@feihugis The good thing is that, in Tensorflow's core repo there are not a lot of format supported anyway (only TFRecord, FixedLengthRecord, and CsvDataset). Actually tensorflow-io has a lot more formats supported than tensorflow's core repo so I think that may not be a big issue for us. |
@feihugis shuffle could be kind of tricky, as naturally tf.data.Dataset's pipeline is not exactly a good fit for shuffle. tf.data.Dataset is essentially an iterable so shuffle is very expensive, unless the dataset is small. The provided I think true shuffle may have to be outside of |
To reiterate the problems here. We actually have two issues:
Overall, tf.data.Dataset has been designed in With the upcoming TF 2.0 in mind, we really should read everything into memory (unless forced to chunk). /cc @BryanCutler for column dataset, also @CaptainDuke as it may impact HDF5. I think I have some ideas, will come up with PoC soon. |
@yongtang there is a lot of good discussion here and I agree with both your points on batching and handling NULL values differently for eager mode. I'll make separate issues where we can continue to discuss. |
This PR is part of the effort in enhancing performance and ease of use for tf.data pipeline, as was discussed in tensorflow#382 and tensorflow#366. Previously, HDF5Dataset is relatively manual and user has to find out the dataset (columns) in hdf5 file. In this PR, the idea is to allow user to use list_hdf5_datasets to probe the shape, dtype, and name of the datasets within HDF5. A subsequent call to read_hdf5 will bring content to a shaped Tensor so that it could be used later in TensorFlow. The read_hdf5 has the option to specify a slice (or a subblock) of the dataset. This should open up possibility in the future to allow binding a class with a hdf5 file by implement `__len__` and `__getitem__`. With list_hdf5_datasets and read_hdf5 ops, it is also possible to ease the HDF5Dataset in eager mode. In eager mode, HDF5Dataset could juse call list_hdf5_datasets to find out all the necessary information, then calling read_hdf5 in pieces to maintain the `batch_size` to be fed in tf.keras. The limitation would be in graph mode as in graph mode user still has to specify almost everything dtype, shape, name for HDF5Dataset to work. This PR has not changed HDF5Dataset implementation to use list_hdf5_datasets and read_hdf5 ops. But this could be easily done and see 384 for similar changes. Signed-off-by: Yong Tang <yong.tang.github@outlook.com>
This PR is part of the effort in enhancing performance and ease of use for tf.data pipeline, as was discussed in tensorflow#382 and tensorflow#366. Previously, HDF5Dataset is relatively manual and user has to find out the dataset (columns) in hdf5 file. In this PR, the idea is to allow user to use list_hdf5_datasets to probe the shape, dtype, and name of the datasets within HDF5. A subsequent call to read_hdf5 will bring content to a shaped Tensor so that it could be used later in TensorFlow. The read_hdf5 has the option to specify a slice (or a subblock) of the dataset. This should open up possibility in the future to allow binding a class with a hdf5 file by implement `__len__` and `__getitem__`. With list_hdf5_datasets and read_hdf5 ops, it is also possible to ease the HDF5Dataset in eager mode. In eager mode, HDF5Dataset could juse call list_hdf5_datasets to find out all the necessary information, then calling read_hdf5 in pieces to maintain the `batch_size` to be fed in tf.keras. The limitation would be in graph mode as in graph mode user still has to specify almost everything dtype, shape, name for HDF5Dataset to work. This PR has not changed HDF5Dataset implementation to use list_hdf5_datasets and read_hdf5 ops. But this could be easily done and see 384 for similar changes. Signed-off-by: Yong Tang <yong.tang.github@outlook.com>
This PR is part of the effort in enhancing performance and ease of use for tf.data pipeline, as was discussed in tensorflow#382 and tensorflow#366. Previously, HDF5Dataset is relatively manual and user has to find out the dataset (columns) in hdf5 file. In this PR, the idea is to allow user to use list_hdf5_datasets to probe the shape, dtype, and name of the datasets within HDF5. A subsequent call to read_hdf5 will bring content to a shaped Tensor so that it could be used later in TensorFlow. The read_hdf5 has the option to specify a slice (or a subblock) of the dataset. This should open up possibility in the future to allow binding a class with a hdf5 file by implement `__len__` and `__getitem__`. With list_hdf5_datasets and read_hdf5 ops, it is also possible to ease the HDF5Dataset in eager mode. In eager mode, HDF5Dataset could juse call list_hdf5_datasets to find out all the necessary information, then calling read_hdf5 in pieces to maintain the `batch_size` to be fed in tf.keras. The limitation would be in graph mode as in graph mode user still has to specify almost everything dtype, shape, name for HDF5Dataset to work. This PR has not changed HDF5Dataset implementation to use list_hdf5_datasets and read_hdf5 ops. But this could be easily done and see 384 for similar changes. Signed-off-by: Yong Tang <yong.tang.github@outlook.com>
This PR is part of the effort in enhancing performance and ease of use for tf.data pipeline, as was discussed in tensorflow#382 and tensorflow#366. Previously, HDF5Dataset is relatively manual and user has to find out the dataset (columns) in hdf5 file. In this PR, the idea is to allow user to use list_hdf5_datasets to probe the shape, dtype, and name of the datasets within HDF5. A subsequent call to read_hdf5 will bring content to a shaped Tensor so that it could be used later in TensorFlow. The read_hdf5 has the option to specify a slice (or a subblock) of the dataset. This should open up possibility in the future to allow binding a class with a hdf5 file by implement `__len__` and `__getitem__`. With list_hdf5_datasets and read_hdf5 ops, it is also possible to ease the HDF5Dataset in eager mode. In eager mode, HDF5Dataset could juse call list_hdf5_datasets to find out all the necessary information, then calling read_hdf5 in pieces to maintain the `batch_size` to be fed in tf.keras. The limitation would be in graph mode as in graph mode user still has to specify almost everything dtype, shape, name for HDF5Dataset to work. This PR has not changed HDF5Dataset implementation to use list_hdf5_datasets and read_hdf5 ops. But this could be easily done and see 384 for similar changes. Signed-off-by: Yong Tang <yong.tang.github@outlook.com>
* Rework on HDF5: add list_hdf5_datasets and read_hdf5 ops This PR is part of the effort in enhancing performance and ease of use for tf.data pipeline, as was discussed in #382 and #366. Previously, HDF5Dataset is relatively manual and user has to find out the dataset (columns) in hdf5 file. In this PR, the idea is to allow user to use list_hdf5_datasets to probe the shape, dtype, and name of the datasets within HDF5. A subsequent call to read_hdf5 will bring content to a shaped Tensor so that it could be used later in TensorFlow. The read_hdf5 has the option to specify a slice (or a subblock) of the dataset. This should open up possibility in the future to allow binding a class with a hdf5 file by implement `__len__` and `__getitem__`. With list_hdf5_datasets and read_hdf5 ops, it is also possible to ease the HDF5Dataset in eager mode. In eager mode, HDF5Dataset could juse call list_hdf5_datasets to find out all the necessary information, then calling read_hdf5 in pieces to maintain the `batch_size` to be fed in tf.keras. The limitation would be in graph mode as in graph mode user still has to specify almost everything dtype, shape, name for HDF5Dataset to work. This PR has not changed HDF5Dataset implementation to use list_hdf5_datasets and read_hdf5 ops. But this could be easily done and see 384 for similar changes. Signed-off-by: Yong Tang <yong.tang.github@outlook.com> * Process default value of count and start Signed-off-by: Yong Tang <yong.tang.github@outlook.com> * Support HDF5Datast in graph mode Signed-off-by: Yong Tang <yong.tang.github@outlook.com>
With the upcoming PR #437, I think this PR is not needed anymore. |
) * Rework on HDF5: add list_hdf5_datasets and read_hdf5 ops This PR is part of the effort in enhancing performance and ease of use for tf.data pipeline, as was discussed in tensorflow#382 and tensorflow#366. Previously, HDF5Dataset is relatively manual and user has to find out the dataset (columns) in hdf5 file. In this PR, the idea is to allow user to use list_hdf5_datasets to probe the shape, dtype, and name of the datasets within HDF5. A subsequent call to read_hdf5 will bring content to a shaped Tensor so that it could be used later in TensorFlow. The read_hdf5 has the option to specify a slice (or a subblock) of the dataset. This should open up possibility in the future to allow binding a class with a hdf5 file by implement `__len__` and `__getitem__`. With list_hdf5_datasets and read_hdf5 ops, it is also possible to ease the HDF5Dataset in eager mode. In eager mode, HDF5Dataset could juse call list_hdf5_datasets to find out all the necessary information, then calling read_hdf5 in pieces to maintain the `batch_size` to be fed in tf.keras. The limitation would be in graph mode as in graph mode user still has to specify almost everything dtype, shape, name for HDF5Dataset to work. This PR has not changed HDF5Dataset implementation to use list_hdf5_datasets and read_hdf5 ops. But this could be easily done and see 384 for similar changes. Signed-off-by: Yong Tang <yong.tang.github@outlook.com> * Process default value of count and start Signed-off-by: Yong Tang <yong.tang.github@outlook.com> * Support HDF5Datast in graph mode Signed-off-by: Yong Tang <yong.tang.github@outlook.com>
When we started the project, we thought data read through
tf.data pipeline will always be record like. That is, each file
will have multiple records. This scenario is the case for many situations
like Text, Video, etc. where the natural boundary within the file
are new line feed, or inidvidual frames.
But for many images formats such as webp, jpeg, etc, each image
is already a boundary and there is no need to further partition.
Further, people still perfer
decode_xxx
call in many situations.This PR adds a FileDataset which could take files (and potentially compressed files)
and feed whole file content as string into tf.data.
This FileDataset is not a fit where files are really large (e.g., 100GB of text record).
However, it is good enough for image file usage.
As an example, this PR also converts WebPDataset to use FileDataset + decode_webp.
Note: TIFF file could have multiple images with different shape so it is not a fit for FileDataset as well.
Signed-off-by: Yong Tang yong.tang.github@outlook.com