Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add list_feather_columns function in eager mode #404

Merged
merged 5 commits into from
Aug 5, 2019

Conversation

yongtang
Copy link
Member

@yongtang yongtang commented Aug 1, 2019

This PR adds list_feather_columns function in eager mode,
so that it is possible to get the column name and spec
information for feather format.

This PR implements an ::arrow::io::RandomAccessFile interface
so it is possible to read files through scheme file system,
e.g., s3, gcs, azfs, etc.

The ::arrow::io::RandomAccessFile is the same as in Parquet PR #384
so they could be combined.

Also see related discussion in #382.

Signed-off-by: Yong Tang yong.tang.github@outlook.com

Copy link
Member

@BryanCutler BryanCutler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yongtang this looks great, but if I understand correctly this is reading the whole file into memory? I don't think that's necessary if the goal is to list the column names/dtypes.

continue;
}
string dtype = "";
switch (data_type) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the Arrow DataType has a ToString method which I think should give the same output and not require the switch here. WDYT?

@yongtang
Copy link
Member Author

yongtang commented Aug 2, 2019

Thanks @BryanCutler. Yes it seems like it will read the whole file into memory unless supports_zero_copy is set, which will not work with scheme files such as s3 or gcs. Surprised that feather format didn't offer reading metadata only.

But let me take a look again and see if there are other workarounds.

@yongtang
Copy link
Member Author

yongtang commented Aug 2, 2019

Looks like the feather metadata are actually flatbuffers, let me update the PR shortly.

@yongtang
Copy link
Member Author

yongtang commented Aug 2, 2019

@BryanCutler The PR has been updated, now list_feather_columns will only read the metadata at the end of the file. Please take a look.

Copy link
Member

@BryanCutler BryanCutler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, I just had a couple questions. It's unfortunate that it's not easier to get the metadata from the file, but I think the Feather format is still not mature. After the Arrow 1.0 release I think it will start to get improved upon. If we were just getting the column names, I think it would have been ok, but to get the dtypes also it looks like this is the only way without reading each column.


const Tensor& memory_tensor = context->input(1);
const string& memory = memory_tensor.scalar<string>()();
std::unique_ptr<SizedRandomAccessFile> file(new SizedRandomAccessFile(env_, filename, memory));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you mind explaining why we want to create a SizedRandomAccessFile instead of just using the built-in interface?

std::shared_ptr<arrow::io::ReadableFile> in_file;
arrow::io::ReadableFile::Open(filename, &in_file);

I believe this just uses standard system calls..

break;
}
if (dtype == "") {
continue;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be better to say "Unsupport dtype" for the default case and still add the column to the list rather than skipping it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @BryanCutler. Updated the PR to change the dtype to INVALID and populate to the python process (so that it is possible to process at the higher level).

@yongtang
Copy link
Member Author

yongtang commented Aug 4, 2019

@BryanCutler For SizedRandomAccessFile there are some additional information.

In tensorflow's file system, it supports additional scheme prefixed file paths (e.g. s3//bucket/object), mostly with cloud vendors, such as gcs (Google Cloud), s3 (AWS), azfs (Microsoft azure), oss (Alibaba Cloud), igfs (Apache Ignite).

Most of the implementations are actually in tensorflow-io (azfs/oss/igfs) now.

In order to support those cloud file systems, the file path s3//bucket/object will have to be translated to callback style APIs (C++ class).

The API that is exposed is the ::tensorflow::RandomAccessFile. As long as this RandomAccessFile C++ interface is implemented, then registered scheme (s3://, gcs://, azfs://) will be supported.

In other words, if file open and file read is done through ::tensorflow::RandomAccessFile, then cloud file paths are supported. In feather's case, wrapping the Open with ::tensorflow::RandomAccessFile means we can open s://bucket/mytest.feather.file.

You may notice in:

# test single file
# prefix "file://" to test scheme file system (e.g., s3, gcs, azfs, ignite)
columns = arrow_io.list_feather_columns("file://" + f.name)
for name, dtype in list(zip(batch.schema.names, batch.schema.types)):

I prefixed the file:// scheme in the path, so that it becomes file:///mylocal/path.

The file:// scheme is a special scheme alias to unix files. With file:// it goes through tensorflow's API so as long as file:// path is supported, all cloud file paths (s3://, gcs://) should all have be supported already (in theory).

There is only one issue with ::tensorflow::RandomAccessFile: it does not expose an GetFileSize API (which honestly is a little silly). Without GetFileSize, for may file format that stores meta data at the end (e.g, zip file or even feather file), you have to continuously calling Read() until you hit a OutOfRange exception.

There are some discussions about adding GetFileSize to future tensorflow C API (tensorflow/community#101) but that will be much later.

For that reason, I patched ::tensorflow::RandomAccessFile to have SizedRandomAccessFile. It is a subclass but stores a GetFileSize. (more convenience).

The SizedRandomAccessFile also takes an optional memory buffer and size of the buffer, in case a file is passed as a memory. So, SizedRandomAccessFile is due purpose: it could take tensorflow scheme file system, it also could just take a buffer as the whole content of the file. The later case is useful when user may already read a file into a string tensor (e.g, decode_wav expect a string tensor, decode_csv in tf expect a string tensor as well).

(We could also wrap buffer into a separate subclass of ::tensorflow::RandomAccessFile. I did that before in tensorflow's core repo at one point. But I just think it is easy to wrap everything into a SizedRandomAccessFile in tensorflow-io.)

The SizedRandomAccessFile will need to be wired into apache arrow's callback read interface ::arrow::io::RandomAccessFile in order for it to be consumed by Apache Arrow. Otherwise, a normal arrow::io::ReadableFile::Open with filename will not be able to read s3:// directly.

The wiring is in ArrowRandomAccessFile (See line 77). Now

arrow::io::ReadableFile::Open(::arrow::io::RandomAccessFile, ...) will be able to read s3 and gcs directly.

Also /cc @terrytangyuan @jiachengxu in case interested.

This PR adds list_feather_columns function in eager mode,
so that it is possible to get the column name and spec
information for feather format.

This PR implements an `::arrow::io::RandomAccessFile` interface
so it is possible to read files through scheme file system,
e.g., s3, gcs, azfs, etc.

The `::arrow::io::RandomAccessFile` is the same as in Parquet PR 384
so they could be combined.

Signed-off-by: Yong Tang <yong.tang.github@outlook.com>
…through feather api.

Signed-off-by: Yong Tang <yong.tang.github@outlook.com>
…, based on review comment

Signed-off-by: Yong Tang <yong.tang.github@outlook.com>
Signed-off-by: Yong Tang <yong.tang.github@outlook.com>
… the same

Signed-off-by: Yong Tang <yong.tang.github@outlook.com>
@BryanCutler
Copy link
Member

Thanks for the great explanation @yongtang ! That makes perfect sense. It also sounds like the ArrowFeatherDataset should be updated to use this interface as well? I could do that as a followup too

@yongtang
Copy link
Member Author

yongtang commented Aug 5, 2019

Thanks @BryanCutler. If you can create a follow up PR then that would be great 👍

@yongtang yongtang merged commit d0fe60c into tensorflow:master Aug 5, 2019
@yongtang yongtang deleted the feather branch August 5, 2019 23:39
i-ony pushed a commit to i-ony/io that referenced this pull request Feb 8, 2021
* Add list_feather_columns function in eager mode

This PR adds list_feather_columns function in eager mode,
so that it is possible to get the column name and spec
information for feather format.

This PR implements an `::arrow::io::RandomAccessFile` interface
so it is possible to read files through scheme file system,
e.g., s3, gcs, azfs, etc.

The `::arrow::io::RandomAccessFile` is the same as in Parquet PR 384
so they could be combined.

Signed-off-by: Yong Tang <yong.tang.github@outlook.com>

* Use flatbuffer to read feather metadata, to avoid reading whole file through feather api.

Signed-off-by: Yong Tang <yong.tang.github@outlook.com>

* Keep unsupported datatype so that it is possible to process in python, based on review comment

Signed-off-by: Yong Tang <yong.tang.github@outlook.com>

* Combine .so files into one place to reduce whl package size

Signed-off-by: Yong Tang <yong.tang.github@outlook.com>

* Combine ArrowRandomAccessFile and ParquetRandomAccessFile as they are the same

Signed-off-by: Yong Tang <yong.tang.github@outlook.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants