Add list_feather_columns function in eager mode #404

yongtang · 2019-08-01T15:41:35Z

This PR adds list_feather_columns function in eager mode,
so that it is possible to get the column name and spec
information for feather format.

This PR implements an ::arrow::io::RandomAccessFile interface
so it is possible to read files through scheme file system,
e.g., s3, gcs, azfs, etc.

The ::arrow::io::RandomAccessFile is the same as in Parquet PR #384
so they could be combined.

Also see related discussion in #382.

Signed-off-by: Yong Tang yong.tang.github@outlook.com

BryanCutler

@yongtang this looks great, but if I understand correctly this is reading the whole file into memory? I don't think that's necessary if the goal is to list the column names/dtypes.

BryanCutler · 2019-08-01T22:56:52Z

tensorflow_io/arrow/kernels/arrow_kernels.cc

+        continue;
+      }
+      string dtype = "";
+      switch (data_type) {


the Arrow DataType has a ToString method which I think should give the same output and not require the switch here. WDYT?

yongtang · 2019-08-02T02:27:52Z

Thanks @BryanCutler. Yes it seems like it will read the whole file into memory unless supports_zero_copy is set, which will not work with scheme files such as s3 or gcs. Surprised that feather format didn't offer reading metadata only.

But let me take a look again and see if there are other workarounds.

yongtang · 2019-08-02T03:10:50Z

Looks like the feather metadata are actually flatbuffers, let me update the PR shortly.

yongtang · 2019-08-02T12:33:15Z

@BryanCutler The PR has been updated, now list_feather_columns will only read the metadata at the end of the file. Please take a look.

BryanCutler

LGTM, I just had a couple questions. It's unfortunate that it's not easier to get the metadata from the file, but I think the Feather format is still not mature. After the Arrow 1.0 release I think it will start to get improved upon. If we were just getting the column names, I think it would have been ok, but to get the dtypes also it looks like this is the only way without reading each column.

BryanCutler · 2019-08-02T21:54:51Z

tensorflow_io/arrow/kernels/arrow_kernels.cc

+
+    const Tensor& memory_tensor = context->input(1);
+    const string& memory = memory_tensor.scalar<string>()();
+    std::unique_ptr<SizedRandomAccessFile> file(new SizedRandomAccessFile(env_, filename, memory));


Would you mind explaining why we want to create a SizedRandomAccessFile instead of just using the built-in interface?

std::shared_ptr<arrow::io::ReadableFile> in_file; arrow::io::ReadableFile::Open(filename, &in_file);

I believe this just uses standard system calls..

BryanCutler · 2019-08-02T21:57:53Z

tensorflow_io/arrow/kernels/arrow_kernels.cc

+        break;
+      }
+      if (dtype == "") {
+        continue;


Would it be better to say "Unsupport dtype" for the default case and still add the column to the list rather than skipping it?

Thanks @BryanCutler. Updated the PR to change the dtype to INVALID and populate to the python process (so that it is possible to process at the higher level).

yongtang · 2019-08-04T15:33:22Z

@BryanCutler For SizedRandomAccessFile there are some additional information.

In tensorflow's file system, it supports additional scheme prefixed file paths (e.g. s3//bucket/object), mostly with cloud vendors, such as gcs (Google Cloud), s3 (AWS), azfs (Microsoft azure), oss (Alibaba Cloud), igfs (Apache Ignite).

Most of the implementations are actually in tensorflow-io (azfs/oss/igfs) now.

In order to support those cloud file systems, the file path s3//bucket/object will have to be translated to callback style APIs (C++ class).

The API that is exposed is the ::tensorflow::RandomAccessFile. As long as this RandomAccessFile C++ interface is implemented, then registered scheme (s3://, gcs://, azfs://) will be supported.

In other words, if file open and file read is done through ::tensorflow::RandomAccessFile, then cloud file paths are supported. In feather's case, wrapping the Open with ::tensorflow::RandomAccessFile means we can open s://bucket/mytest.feather.file.

You may notice in:

io/tests/test_arrow_eager.py

Lines 910 to 913 in c484af7

    
           # test single file 
        
           # prefix "file://" to test scheme file system (e.g., s3, gcs, azfs, ignite) 
        
           columns = arrow_io.list_feather_columns("file://" + f.name) 
        
           for name, dtype in list(zip(batch.schema.names, batch.schema.types)):

I prefixed the file:// scheme in the path, so that it becomes file:///mylocal/path.

The file:// scheme is a special scheme alias to unix files. With file:// it goes through tensorflow's API so as long as file:// path is supported, all cloud file paths (s3://, gcs://) should all have be supported already (in theory).

There is only one issue with ::tensorflow::RandomAccessFile: it does not expose an GetFileSize API (which honestly is a little silly). Without GetFileSize, for may file format that stores meta data at the end (e.g, zip file or even feather file), you have to continuously calling Read() until you hit a OutOfRange exception.

There are some discussions about adding GetFileSize to future tensorflow C API (tensorflow/community#101) but that will be much later.

For that reason, I patched ::tensorflow::RandomAccessFile to have SizedRandomAccessFile. It is a subclass but stores a GetFileSize. (more convenience).

The SizedRandomAccessFile also takes an optional memory buffer and size of the buffer, in case a file is passed as a memory. So, SizedRandomAccessFile is due purpose: it could take tensorflow scheme file system, it also could just take a buffer as the whole content of the file. The later case is useful when user may already read a file into a string tensor (e.g, decode_wav expect a string tensor, decode_csv in tf expect a string tensor as well).

(We could also wrap buffer into a separate subclass of ::tensorflow::RandomAccessFile. I did that before in tensorflow's core repo at one point. But I just think it is easy to wrap everything into a SizedRandomAccessFile in tensorflow-io.)

The SizedRandomAccessFile will need to be wired into apache arrow's callback read interface ::arrow::io::RandomAccessFile in order for it to be consumed by Apache Arrow. Otherwise, a normal arrow::io::ReadableFile::Open with filename will not be able to read s3:// directly.

The wiring is in ArrowRandomAccessFile (See line 77). Now

arrow::io::ReadableFile::Open(::arrow::io::RandomAccessFile, ...) will be able to read s3 and gcs directly.

Also /cc @terrytangyuan @jiachengxu in case interested.

This PR adds list_feather_columns function in eager mode, so that it is possible to get the column name and spec information for feather format. This PR implements an `::arrow::io::RandomAccessFile` interface so it is possible to read files through scheme file system, e.g., s3, gcs, azfs, etc. The `::arrow::io::RandomAccessFile` is the same as in Parquet PR 384 so they could be combined. Signed-off-by: Yong Tang <yong.tang.github@outlook.com>

…through feather api. Signed-off-by: Yong Tang <yong.tang.github@outlook.com>

…, based on review comment Signed-off-by: Yong Tang <yong.tang.github@outlook.com>

Signed-off-by: Yong Tang <yong.tang.github@outlook.com>

… the same Signed-off-by: Yong Tang <yong.tang.github@outlook.com>

BryanCutler · 2019-08-05T21:47:38Z

Thanks for the great explanation @yongtang ! That makes perfect sense. It also sounds like the ArrowFeatherDataset should be updated to use this interface as well? I could do that as a followup too

yongtang · 2019-08-05T23:38:52Z

Thanks @BryanCutler. If you can create a follow up PR then that would be great 👍

* Add list_feather_columns function in eager mode This PR adds list_feather_columns function in eager mode, so that it is possible to get the column name and spec information for feather format. This PR implements an `::arrow::io::RandomAccessFile` interface so it is possible to read files through scheme file system, e.g., s3, gcs, azfs, etc. The `::arrow::io::RandomAccessFile` is the same as in Parquet PR 384 so they could be combined. Signed-off-by: Yong Tang <yong.tang.github@outlook.com> * Use flatbuffer to read feather metadata, to avoid reading whole file through feather api. Signed-off-by: Yong Tang <yong.tang.github@outlook.com> * Keep unsupported datatype so that it is possible to process in python, based on review comment Signed-off-by: Yong Tang <yong.tang.github@outlook.com> * Combine .so files into one place to reduce whl package size Signed-off-by: Yong Tang <yong.tang.github@outlook.com> * Combine ArrowRandomAccessFile and ParquetRandomAccessFile as they are the same Signed-off-by: Yong Tang <yong.tang.github@outlook.com>

yongtang requested a review from BryanCutler August 1, 2019 15:41

yongtang force-pushed the feather branch from 232872f to a92d47f Compare August 1, 2019 16:00

BryanCutler reviewed Aug 1, 2019

View reviewed changes

yongtang force-pushed the feather branch from a92d47f to c34aa02 Compare August 2, 2019 03:42

yongtang mentioned this pull request Aug 2, 2019

Discuss Batch Standards in TFIO with Keras #382

Open

BryanCutler approved these changes Aug 2, 2019

View reviewed changes

yongtang added 5 commits August 5, 2019 21:05

Use flatbuffer to read feather metadata, to avoid reading whole file …

bc305be

…through feather api. Signed-off-by: Yong Tang <yong.tang.github@outlook.com>

Keep unsupported datatype so that it is possible to process in python…

bc124c9

…, based on review comment Signed-off-by: Yong Tang <yong.tang.github@outlook.com>

Combine .so files into one place to reduce whl package size

1198381

Signed-off-by: Yong Tang <yong.tang.github@outlook.com>

Combine ArrowRandomAccessFile and ParquetRandomAccessFile as they are…

ee18da1

… the same Signed-off-by: Yong Tang <yong.tang.github@outlook.com>

yongtang force-pushed the feather branch from c484af7 to ee18da1 Compare August 5, 2019 21:12

yongtang merged commit d0fe60c into tensorflow:master Aug 5, 2019

yongtang deleted the feather branch August 5, 2019 23:39

BryanCutler mentioned this pull request Aug 13, 2019

Use TFIO ArrowRandomAccessFile to read Arrow Feather files #418

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add list_feather_columns function in eager mode #404

Add list_feather_columns function in eager mode #404

yongtang commented Aug 1, 2019 •

edited

Loading

BryanCutler left a comment

BryanCutler Aug 1, 2019

yongtang commented Aug 2, 2019

yongtang commented Aug 2, 2019

yongtang commented Aug 2, 2019

BryanCutler left a comment

BryanCutler Aug 2, 2019

BryanCutler Aug 2, 2019

yongtang Aug 4, 2019

yongtang commented Aug 4, 2019 •

edited

Loading

BryanCutler commented Aug 5, 2019

yongtang commented Aug 5, 2019

Add list_feather_columns function in eager mode #404

Add list_feather_columns function in eager mode #404

Conversation

yongtang commented Aug 1, 2019 • edited Loading

BryanCutler left a comment

Choose a reason for hiding this comment

BryanCutler Aug 1, 2019

Choose a reason for hiding this comment

yongtang commented Aug 2, 2019

yongtang commented Aug 2, 2019

yongtang commented Aug 2, 2019

BryanCutler left a comment

Choose a reason for hiding this comment

BryanCutler Aug 2, 2019

Choose a reason for hiding this comment

BryanCutler Aug 2, 2019

Choose a reason for hiding this comment

yongtang Aug 4, 2019

Choose a reason for hiding this comment

yongtang commented Aug 4, 2019 • edited Loading

BryanCutler commented Aug 5, 2019

yongtang commented Aug 5, 2019

yongtang commented Aug 1, 2019 •

edited

Loading

yongtang commented Aug 4, 2019 •

edited

Loading