-
Notifications
You must be signed in to change notification settings - Fork 283
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add list_feather_columns function in eager mode #404
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yongtang this looks great, but if I understand correctly this is reading the whole file into memory? I don't think that's necessary if the goal is to list the column names/dtypes.
continue; | ||
} | ||
string dtype = ""; | ||
switch (data_type) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the Arrow DataType
has a ToString
method which I think should give the same output and not require the switch here. WDYT?
Thanks @BryanCutler. Yes it seems like it will read the whole file into memory unless But let me take a look again and see if there are other workarounds. |
Looks like the feather metadata are actually flatbuffers, let me update the PR shortly. |
@BryanCutler The PR has been updated, now list_feather_columns will only read the metadata at the end of the file. Please take a look. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, I just had a couple questions. It's unfortunate that it's not easier to get the metadata from the file, but I think the Feather format is still not mature. After the Arrow 1.0 release I think it will start to get improved upon. If we were just getting the column names, I think it would have been ok, but to get the dtypes also it looks like this is the only way without reading each column.
|
||
const Tensor& memory_tensor = context->input(1); | ||
const string& memory = memory_tensor.scalar<string>()(); | ||
std::unique_ptr<SizedRandomAccessFile> file(new SizedRandomAccessFile(env_, filename, memory)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you mind explaining why we want to create a SizedRandomAccessFile
instead of just using the built-in interface?
std::shared_ptr<arrow::io::ReadableFile> in_file;
arrow::io::ReadableFile::Open(filename, &in_file);
I believe this just uses standard system calls..
break; | ||
} | ||
if (dtype == "") { | ||
continue; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be better to say "Unsupport dtype" for the default case and still add the column to the list rather than skipping it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @BryanCutler. Updated the PR to change the dtype to INVALID
and populate to the python process (so that it is possible to process at the higher level).
@BryanCutler For In tensorflow's file system, it supports additional scheme prefixed file paths (e.g. Most of the implementations are actually in tensorflow-io (azfs/oss/igfs) now. In order to support those cloud file systems, the file path The API that is exposed is the In other words, if file open and file read is done through You may notice in: Lines 910 to 913 in c484af7
I prefixed the The There is only one issue with There are some discussions about adding For that reason, I patched The (We could also wrap buffer into a separate subclass of The The wiring is in
Also /cc @terrytangyuan @jiachengxu in case interested. |
This PR adds list_feather_columns function in eager mode, so that it is possible to get the column name and spec information for feather format. This PR implements an `::arrow::io::RandomAccessFile` interface so it is possible to read files through scheme file system, e.g., s3, gcs, azfs, etc. The `::arrow::io::RandomAccessFile` is the same as in Parquet PR 384 so they could be combined. Signed-off-by: Yong Tang <yong.tang.github@outlook.com>
…through feather api. Signed-off-by: Yong Tang <yong.tang.github@outlook.com>
…, based on review comment Signed-off-by: Yong Tang <yong.tang.github@outlook.com>
Signed-off-by: Yong Tang <yong.tang.github@outlook.com>
… the same Signed-off-by: Yong Tang <yong.tang.github@outlook.com>
Thanks for the great explanation @yongtang ! That makes perfect sense. It also sounds like the ArrowFeatherDataset should be updated to use this interface as well? I could do that as a followup too |
Thanks @BryanCutler. If you can create a follow up PR then that would be great 👍 |
* Add list_feather_columns function in eager mode This PR adds list_feather_columns function in eager mode, so that it is possible to get the column name and spec information for feather format. This PR implements an `::arrow::io::RandomAccessFile` interface so it is possible to read files through scheme file system, e.g., s3, gcs, azfs, etc. The `::arrow::io::RandomAccessFile` is the same as in Parquet PR 384 so they could be combined. Signed-off-by: Yong Tang <yong.tang.github@outlook.com> * Use flatbuffer to read feather metadata, to avoid reading whole file through feather api. Signed-off-by: Yong Tang <yong.tang.github@outlook.com> * Keep unsupported datatype so that it is possible to process in python, based on review comment Signed-off-by: Yong Tang <yong.tang.github@outlook.com> * Combine .so files into one place to reduce whl package size Signed-off-by: Yong Tang <yong.tang.github@outlook.com> * Combine ArrowRandomAccessFile and ParquetRandomAccessFile as they are the same Signed-off-by: Yong Tang <yong.tang.github@outlook.com>
This PR adds list_feather_columns function in eager mode,
so that it is possible to get the column name and spec
information for feather format.
This PR implements an
::arrow::io::RandomAccessFile
interfaceso it is possible to read files through scheme file system,
e.g., s3, gcs, azfs, etc.
The
::arrow::io::RandomAccessFile
is the same as in Parquet PR #384so they could be combined.
Also see related discussion in #382.
Signed-off-by: Yong Tang yong.tang.github@outlook.com