Missing type gets lost when writing partitions of DataFrame #403

svilupp · 2023-03-12T18:29:19Z

This is an odd one and likely to be a PICNIC...

Problem: Missigness in a string column is lost after saving/loading arrow file

When it happens: When a column in my dataset has type Union{Missing,String}, I partition it, and the missing item appears only in the later partitions. It's easily reproducible (see below).

Debugging:

It happens only to DataFrames (not Tables.rowtable when created from a namedtuple)
Only when partitioned as Iterators.partition(Tables.rows(df), 2). If partitioned as Iterators.partition(df,2) available from version >1.5.0, it is fine
If missing type appears in the first partition, it's fine
Validity bitmap is written correctly
But field is marked as not-nullable (!)

┌ Debug: building field: name = x1, nullable = false, T = String, type = Arrow.Flatbuf.Utf8
└ @ Arrow ~/Documents/GitHub/arrow-julia/src/write.jl:486
--- in correct cases, this appears
┌ Debug: building field: name = x1, nullable = true, T = Union{Missing, String}, type = Arrow.Flatbuf.Utf8
└ @ Arrow ~/Documents/GitHub/arrow-julia/src/write.jl:486

MWE

using Arrow, Tables, Random, DataFramesMeta
using Logging
debuglogger = ConsoleLogger(stderr, Logging.Debug)

# Create dataset
fn = "test_types.arrow"
df = Tables.rowtable((; x1 =["a","b",missing,"c"], x2 = 1:4)) |> DataFrame

# Works okay
Arrow.write(fn, df; compress = nothing);
t=Arrow.Table(fn)
t[:x1]
# Arrow.List{Union{Missing, String}, Int32, Vector{UInt8}}

# Works okay
Arrow.write(fn, Iterators.partition(df,2); compress = nothing);
t=Arrow.Table(fn)
t[:x1]
# SentinelArrays.ChainedVector{Union{Missing, String}, Arrow.List{Union{Missing, String}, Int32, Vector{UInt8}}}:

# broken -- missingness is lost
Arrow.write(fn, Iterators.partition(Tables.rows(df), 2); compress = nothing);
t=Arrow.Table(fn)
t[:x1]
# SentinelArrays.ChainedVector{String, Arrow.List{String, Int32, Vector{UInt8}}}

# Works okay with Tables
t = Tables.rowtable((; x1 =["a","b",missing,"c"], x2 = 1:4))
Arrow.write(fn, Iterators.partition(Tables.rows(t), 2); compress = nothing);
t=Arrow.Table(fn)
t[:x1]
# SentinelArrays.ChainedVector{Union{Missing, String}, Arrow.List{Union{Missing, String}, Int32, Vector{UInt8}}}

Versioninfo:

Julia Version 1.8.5
Commit 17cfb8e65ea (2023-01-08 06:45 UTC)
Platform Info:
OS: macOS (arm64-apple-darwin21.5.0)
CPU: 8 × Apple M1 Pro
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-13.0.1 (ORCJIT, apple-m1)
Threads: 6 on 6 virtual cores

Arrow: 2.4.3 on main branch

The text was updated successfully, but these errors were encountered:

svilupp · 2023-03-12T19:04:02Z

I think I know where it's coming from.

The issue happens here

Only the first partition is scanned to determine the schema
Unfortunately, the partition of DataFrameRows loses the parent schema when pushed through Tables.columns
It does however keep the reference to the parent (and its schema)

In other words, we do partition |> Tables.columns |> Tables.schema, which loses the missingness.

I don't know enough about the Tables API/contract to know whether this is an Arrow problem, Tables problem, or DataFrames problem. Does this issue belong somewhere else?

It would be an easy fix to get schema info from the parent object, but are all Tables-compatible sources required to keep that?

Eg,

change from partition |> Tables.columns |> Tables.schema
to partition |> Tables.columns |> Base.Fix2(getfield,:parent) |> Tables.schema

Should I open a PR?

Illustration

# correct when working with Tables object
t = Tables.rowtable((; x1 =["a","b",missing,"c"], x2 = 1:4))
for part in Iterators.partition(Tables.rows(t), 2)
    @info "Parent type: $(part.parent|>Tables.schema)"
    @info "Columns type: $(Tables.columns(part)|>Tables.schema)"
end

  ┌ Info: Parent type: Tables.Schema:
  │  :x1  Union{Missing, String}
  └  :x2  Int64
  ┌ Info: Columns type: Tables.Schema:
  │  :x1  Union{Missing, String}
  └  :x2  Int64
  ┌ Info: Parent type: Tables.Schema:
  │  :x1  Union{Missing, String}
  └  :x2  Int64
  ┌ Info: Columns type: Tables.Schema:
  │  :x1  Union{Missing, String}
  └  :x2  Int64

# incorrect when working with DataFrame
df = Tables.rowtable((; x1 =["a","b",missing,"c"], x2 = 1:4)) |> DataFrame
for part in Iterators.partition(Tables.rows(df), 2)
    @info "Parent type: $(part.parent|>Tables.schema)"
    @info "Columns type: $(Tables.columns(part)|>Tables.schema)"
end

  ┌ Info: Parent type: Tables.Schema:
  │  :x1  Union{Missing, String}
  └  :x2  Int64
  ┌ Info: Columns type: Tables.Schema:
  │  :x1  String
  └  :x2  Int64
  ┌ Info: Parent type: Tables.Schema:
  │  :x1  Union{Missing, String}
  └  :x2  Int64
  ┌ Info: Columns type: Tables.Schema:
  │  :x1  Union{Missing, String}
  └  :x2  Int64

EDIT: I suspect this will affect other partitioners that rely on Iterators over Tables.rows(), eg, TableOperations.makepartition()

evetion · 2023-11-10T15:33:47Z

At the moment, a similar thing is blocking #477.

This was referenced Mar 12, 2023

When partitioned, partition might lose the missingness eltype (in Tables.schema) JuliaData/DataFrames.jl#3298

Closed

add Iterators.partition for DataFrameRows JuliaData/DataFrames.jl#3299

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Missing type gets lost when writing partitions of DataFrame #403

Missing type gets lost when writing partitions of DataFrame #403

svilupp commented Mar 12, 2023

svilupp commented Mar 12, 2023 •

edited

Loading

evetion commented Nov 10, 2023

Missing type gets lost when writing partitions of DataFrame #403

Missing type gets lost when writing partitions of DataFrame #403

Comments

svilupp commented Mar 12, 2023

svilupp commented Mar 12, 2023 • edited Loading

evetion commented Nov 10, 2023

svilupp commented Mar 12, 2023 •

edited

Loading