Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory => File Layout Compatibility between Feather and Arrow #159

Closed
quinnj opened this issue May 24, 2016 · 4 comments
Closed

Memory => File Layout Compatibility between Feather and Arrow #159

quinnj opened this issue May 24, 2016 · 4 comments

Comments

@quinnj
Copy link

quinnj commented May 24, 2016

Do forgive if this isn't the right forum (I also considered the Arrow mailing list, but this seemed more appropriate given the question around feather).

In one of @wesm recent talks (at PyData Berlin I think?), one issue he mentioned was the current inefficiency of having to convert between the feather file format and the internal R/pandas dataframe representation (see this slide), with a seeming assumption that it would be more efficient if you could read a feather file directly into an in-memory Arrow structure. (and then have data processing abilities on that structure, such as provided by pandas/dplyr/etc).

After digging into both feather & Arrow, I've been contemplating having a setup in Julia where we not only leverage @dmbates beginning work on reading feather files, but also having a corresponding Julia structure following the Arrow memory layout specifications. (perhaps eventually replacing the internal representation of Julia's DataFrame itself, see extensive discussion on DataFrames.jl "soul" here).

My main questions at this point are:

  • Does this seem like an over-arching good/bad idea?
  • Are the feather file format and Arrow memory layouts expected to be synchronous over the long-term?

As a specific question related to the second point, the feather format currently includes the layout of a primitive array like:

<null bitmask, optional> <values>

whereas Arrow specifies:

* Length: 5, Null count: 1
* Null bitmap buffer:

  |Byte 0 (validity bitmap) | Bytes 1-63            |
  |-------------------------|-----------------------|
  |00011011                 | 0 (padding)           |

* Value Buffer:

  |Bytes 0-3   | Bytes 4-7   | Bytes 8-11  | Bytes 12-15 | Bytes 16-19 | Bytes 20-63 |
  |------------|-------------|-------------|-------------|-------------|-------------|
  | 1          | 2           | unspecified | 4           | 8           | unspecified |

The difference being Arrow seems to require the length and null_count to be laid out before the null bitmask and values array, whereas these are stored in the "metadata" section of a feather file for each column.

I'm just wondering if little differences in layout like these are to be expected going forward where it makes sense or if there will be efforts to match the layouts more exactly.

@hadley
Copy link
Collaborator

hadley commented May 24, 2016

Using the arrow format for data frames in Julia seems like a reasonable idea to me.

@wesm can probably speak to the file layout issues better than I can, but I think the main priority is making sure that the data (not the metadata) is in the same format so it's always a trivial mmap to work with data on disk.

@wesm
Copy link
Owner

wesm commented May 25, 2016

@quinnj if you can reasonably adopt the Arrow memory layout in DataFrames.jl internals that seems like a good idea to me. It also provides you a path to supporting more complex data types. The downside is that for nested data types, mutability is tricky to provide as a feature (though I frankly would prefer immutable data frames in general).

@quinnj
Copy link
Author

quinnj commented May 25, 2016

Thanks @hadley, @wesm for chiming in. I've made some progress here, though it's just the beginning stages for sure.

One thing I ran into is that currently in Julia we don't have a clean way (that I can tell), to not only mmap a file, but also cleanly point all the values arrays to their various offsets in the mmap and have it all play nicely with GC, at least not without keeping a reference to the original mmap. For example, right now I'm defining

abstract ArrowColumn{T} <: AbstractVector{T}

immutable PrimitiveColumn{T} <: ArrowColumn{T}
    buffer::Vector{UInt8} # potential reference to mmap
    length::Int32
    null_count::Int32
    nulls::BitVector # null == 0 == false, not-null == 1 == true; always padded to 64-byte alignments
    data::Vector{T} # always padded to 64-byte alignments
end

the buffer mmap reference is really only relevant when reading in a Feather file (though might be useful in other contexts as well). What I'm wondering is if this is some kind of "violation" of the Arrow format, because I'm including that additional mmap reference in the Column type (in the buffer field).

I did also run into the nested data mutability thing this morning as well. I'm still developing thoughts, but I don't think it'd be a deal-breaker to seal them off as immutable (with APIs to construct new columns obviously).

@wesm
Copy link
Owner

wesm commented Apr 10, 2020

Feather V2 which is part of the upcoming Arrow 0.17.0 release is exactly the Arrow IPC "file" format

@wesm wesm closed this as completed Apr 10, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants