-
Notifications
You must be signed in to change notification settings - Fork 169
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory => File Layout Compatibility between Feather and Arrow #159
Comments
Using the arrow format for data frames in Julia seems like a reasonable idea to me. @wesm can probably speak to the file layout issues better than I can, but I think the main priority is making sure that the data (not the metadata) is in the same format so it's always a trivial mmap to work with data on disk. |
@quinnj if you can reasonably adopt the Arrow memory layout in DataFrames.jl internals that seems like a good idea to me. It also provides you a path to supporting more complex data types. The downside is that for nested data types, mutability is tricky to provide as a feature (though I frankly would prefer immutable data frames in general). |
Thanks @hadley, @wesm for chiming in. I've made some progress here, though it's just the beginning stages for sure. One thing I ran into is that currently in Julia we don't have a clean way (that I can tell), to not only mmap a file, but also cleanly point all the values arrays to their various offsets in the mmap and have it all play nicely with GC, at least not without keeping a reference to the original mmap. For example, right now I'm defining abstract ArrowColumn{T} <: AbstractVector{T}
immutable PrimitiveColumn{T} <: ArrowColumn{T}
buffer::Vector{UInt8} # potential reference to mmap
length::Int32
null_count::Int32
nulls::BitVector # null == 0 == false, not-null == 1 == true; always padded to 64-byte alignments
data::Vector{T} # always padded to 64-byte alignments
end the I did also run into the nested data mutability thing this morning as well. I'm still developing thoughts, but I don't think it'd be a deal-breaker to seal them off as immutable (with APIs to construct new columns obviously). |
Feather V2 which is part of the upcoming Arrow 0.17.0 release is exactly the Arrow IPC "file" format |
Do forgive if this isn't the right forum (I also considered the Arrow mailing list, but this seemed more appropriate given the question around feather).
In one of @wesm recent talks (at PyData Berlin I think?), one issue he mentioned was the current inefficiency of having to convert between the feather file format and the internal R/pandas dataframe representation (see this slide), with a seeming assumption that it would be more efficient if you could read a feather file directly into an in-memory Arrow structure. (and then have data processing abilities on that structure, such as provided by pandas/dplyr/etc).
After digging into both feather & Arrow, I've been contemplating having a setup in Julia where we not only leverage @dmbates beginning work on reading feather files, but also having a corresponding Julia structure following the Arrow memory layout specifications. (perhaps eventually replacing the internal representation of Julia's DataFrame itself, see extensive discussion on DataFrames.jl "soul" here).
My main questions at this point are:
As a specific question related to the second point, the feather format currently includes the layout of a primitive array like:
whereas Arrow specifies:
The difference being Arrow seems to require the length and null_count to be laid out before the null bitmask and values array, whereas these are stored in the "metadata" section of a feather file for each column.
I'm just wondering if little differences in layout like these are to be expected going forward where it makes sense or if there will be efforts to match the layouts more exactly.
The text was updated successfully, but these errors were encountered: