Memory => File Layout Compatibility between Feather and Arrow #159

quinnj · 2016-05-24T04:25:15Z

Do forgive if this isn't the right forum (I also considered the Arrow mailing list, but this seemed more appropriate given the question around feather).

In one of @wesm recent talks (at PyData Berlin I think?), one issue he mentioned was the current inefficiency of having to convert between the feather file format and the internal R/pandas dataframe representation (see this slide), with a seeming assumption that it would be more efficient if you could read a feather file directly into an in-memory Arrow structure. (and then have data processing abilities on that structure, such as provided by pandas/dplyr/etc).

After digging into both feather & Arrow, I've been contemplating having a setup in Julia where we not only leverage @dmbates beginning work on reading feather files, but also having a corresponding Julia structure following the Arrow memory layout specifications. (perhaps eventually replacing the internal representation of Julia's DataFrame itself, see extensive discussion on DataFrames.jl "soul" here).

My main questions at this point are:

Does this seem like an over-arching good/bad idea?
Are the feather file format and Arrow memory layouts expected to be synchronous over the long-term?

As a specific question related to the second point, the feather format currently includes the layout of a primitive array like:

<null bitmask, optional> <values>

whereas Arrow specifies:

* Length: 5, Null count: 1
* Null bitmap buffer:

  |Byte 0 (validity bitmap) | Bytes 1-63            |
  |-------------------------|-----------------------|
  |00011011                 | 0 (padding)           |

* Value Buffer:

  |Bytes 0-3   | Bytes 4-7   | Bytes 8-11  | Bytes 12-15 | Bytes 16-19 | Bytes 20-63 |
  |------------|-------------|-------------|-------------|-------------|-------------|
  | 1          | 2           | unspecified | 4           | 8           | unspecified |

The difference being Arrow seems to require the length and null_count to be laid out before the null bitmask and values array, whereas these are stored in the "metadata" section of a feather file for each column.

I'm just wondering if little differences in layout like these are to be expected going forward where it makes sense or if there will be efforts to match the layouts more exactly.

The text was updated successfully, but these errors were encountered:

hadley · 2016-05-24T13:02:42Z

Using the arrow format for data frames in Julia seems like a reasonable idea to me.

@wesm can probably speak to the file layout issues better than I can, but I think the main priority is making sure that the data (not the metadata) is in the same format so it's always a trivial mmap to work with data on disk.

wesm · 2016-05-25T15:50:58Z

@quinnj if you can reasonably adopt the Arrow memory layout in DataFrames.jl internals that seems like a good idea to me. It also provides you a path to supporting more complex data types. The downside is that for nested data types, mutability is tricky to provide as a feature (though I frankly would prefer immutable data frames in general).

quinnj · 2016-05-25T16:10:47Z

Thanks @hadley, @wesm for chiming in. I've made some progress here, though it's just the beginning stages for sure.

One thing I ran into is that currently in Julia we don't have a clean way (that I can tell), to not only mmap a file, but also cleanly point all the values arrays to their various offsets in the mmap and have it all play nicely with GC, at least not without keeping a reference to the original mmap. For example, right now I'm defining

abstract ArrowColumn{T} <: AbstractVector{T}

immutable PrimitiveColumn{T} <: ArrowColumn{T}
    buffer::Vector{UInt8} # potential reference to mmap
    length::Int32
    null_count::Int32
    nulls::BitVector # null == 0 == false, not-null == 1 == true; always padded to 64-byte alignments
    data::Vector{T} # always padded to 64-byte alignments
end

the buffer mmap reference is really only relevant when reading in a Feather file (though might be useful in other contexts as well). What I'm wondering is if this is some kind of "violation" of the Arrow format, because I'm including that additional mmap reference in the Column type (in the buffer field).

I did also run into the nested data mutability thing this morning as well. I'm still developing thoughts, but I don't think it'd be a deal-breaker to seal them off as immutable (with APIs to construct new columns obviously).

wesm · 2020-04-10T01:19:46Z

Feather V2 which is part of the upcoming Arrow 0.17.0 release is exactly the Arrow IPC "file" format

wesm closed this as completed Apr 10, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory => File Layout Compatibility between Feather and Arrow #159

Memory => File Layout Compatibility between Feather and Arrow #159

quinnj commented May 24, 2016

hadley commented May 24, 2016

wesm commented May 25, 2016

quinnj commented May 25, 2016

wesm commented Apr 10, 2020

Memory => File Layout Compatibility between Feather and Arrow #159

Memory => File Layout Compatibility between Feather and Arrow #159

Comments

quinnj commented May 24, 2016

hadley commented May 24, 2016

wesm commented May 25, 2016

quinnj commented May 25, 2016

wesm commented Apr 10, 2020