Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing type gets lost when writing partitions of DataFrame #403

Open
svilupp opened this issue Mar 12, 2023 · 2 comments
Open

Missing type gets lost when writing partitions of DataFrame #403

svilupp opened this issue Mar 12, 2023 · 2 comments

Comments

@svilupp
Copy link
Contributor

svilupp commented Mar 12, 2023

This is an odd one and likely to be a PICNIC...

Problem: Missigness in a string column is lost after saving/loading arrow file

When it happens: When a column in my dataset has type Union{Missing,String}, I partition it, and the missing item appears only in the later partitions. It's easily reproducible (see below).

Debugging:

  • It happens only to DataFrames (not Tables.rowtable when created from a namedtuple)
  • Only when partitioned as Iterators.partition(Tables.rows(df), 2). If partitioned as Iterators.partition(df,2) available from version >1.5.0, it is fine
  • If missing type appears in the first partition, it's fine
  • Validity bitmap is written correctly
  • But field is marked as not-nullable (!)

┌ Debug: building field: name = x1, nullable = false, T = String, type = Arrow.Flatbuf.Utf8
└ @ Arrow ~/Documents/GitHub/arrow-julia/src/write.jl:486
--- in correct cases, this appears
┌ Debug: building field: name = x1, nullable = true, T = Union{Missing, String}, type = Arrow.Flatbuf.Utf8
└ @ Arrow ~/Documents/GitHub/arrow-julia/src/write.jl:486

MWE

using Arrow, Tables, Random, DataFramesMeta
using Logging
debuglogger = ConsoleLogger(stderr, Logging.Debug)

# Create dataset
fn = "test_types.arrow"
df = Tables.rowtable((; x1 =["a","b",missing,"c"], x2 = 1:4)) |> DataFrame

# Works okay
Arrow.write(fn, df; compress = nothing);
t=Arrow.Table(fn)
t[:x1]
# Arrow.List{Union{Missing, String}, Int32, Vector{UInt8}}

# Works okay
Arrow.write(fn, Iterators.partition(df,2); compress = nothing);
t=Arrow.Table(fn)
t[:x1]
# SentinelArrays.ChainedVector{Union{Missing, String}, Arrow.List{Union{Missing, String}, Int32, Vector{UInt8}}}:

# broken -- missingness is lost
Arrow.write(fn, Iterators.partition(Tables.rows(df), 2); compress = nothing);
t=Arrow.Table(fn)
t[:x1]
# SentinelArrays.ChainedVector{String, Arrow.List{String, Int32, Vector{UInt8}}}

# Works okay with Tables
t = Tables.rowtable((; x1 =["a","b",missing,"c"], x2 = 1:4))
Arrow.write(fn, Iterators.partition(Tables.rows(t), 2); compress = nothing);
t=Arrow.Table(fn)
t[:x1]
# SentinelArrays.ChainedVector{Union{Missing, String}, Arrow.List{Union{Missing, String}, Int32, Vector{UInt8}}}

Versioninfo:

Julia Version 1.8.5
Commit 17cfb8e65ea (2023-01-08 06:45 UTC)
Platform Info:
OS: macOS (arm64-apple-darwin21.5.0)
CPU: 8 × Apple M1 Pro
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-13.0.1 (ORCJIT, apple-m1)
Threads: 6 on 6 virtual cores

Arrow: 2.4.3 on main branch

@svilupp
Copy link
Contributor Author

svilupp commented Mar 12, 2023

I think I know where it's coming from.

The issue happens here

  • Only the first partition is scanned to determine the schema
  • Unfortunately, the partition of DataFrameRows loses the parent schema when pushed through Tables.columns
  • It does however keep the reference to the parent (and its schema)

In other words, we do partition |> Tables.columns |> Tables.schema, which loses the missingness.

I don't know enough about the Tables API/contract to know whether this is an Arrow problem, Tables problem, or DataFrames problem. Does this issue belong somewhere else?

It would be an easy fix to get schema info from the parent object, but are all Tables-compatible sources required to keep that?

Eg,

  • change from partition |> Tables.columns |> Tables.schema
  • to partition |> Tables.columns |> Base.Fix2(getfield,:parent) |> Tables.schema

Should I open a PR?

Illustration

# correct when working with Tables object
t = Tables.rowtable((; x1 =["a","b",missing,"c"], x2 = 1:4))
for part in Iterators.partition(Tables.rows(t), 2)
    @info "Parent type: $(part.parent|>Tables.schema)"
    @info "Columns type: $(Tables.columns(part)|>Tables.schema)"
end

  ┌ Info: Parent type: Tables.Schema:
  │  :x1  Union{Missing, String}
  └  :x2  Int64
  ┌ Info: Columns type: Tables.Schema:
  │  :x1  Union{Missing, String}
  └  :x2  Int64
  ┌ Info: Parent type: Tables.Schema:
  │  :x1  Union{Missing, String}
  └  :x2  Int64
  ┌ Info: Columns type: Tables.Schema:
  │  :x1  Union{Missing, String}
  └  :x2  Int64

# incorrect when working with DataFrame
df = Tables.rowtable((; x1 =["a","b",missing,"c"], x2 = 1:4)) |> DataFrame
for part in Iterators.partition(Tables.rows(df), 2)
    @info "Parent type: $(part.parent|>Tables.schema)"
    @info "Columns type: $(Tables.columns(part)|>Tables.schema)"
end

  ┌ Info: Parent type: Tables.Schema:
  │  :x1  Union{Missing, String}
  └  :x2  Int64
  ┌ Info: Columns type: Tables.Schema:
  │  :x1  String
  └  :x2  Int64
  ┌ Info: Parent type: Tables.Schema:
  │  :x1  Union{Missing, String}
  └  :x2  Int64
  ┌ Info: Columns type: Tables.Schema:
  │  :x1  Union{Missing, String}
  └  :x2  Int64

EDIT: I suspect this will affect other partitioners that rely on Iterators over Tables.rows(), eg, TableOperations.makepartition()

@evetion
Copy link

evetion commented Nov 10, 2023

At the moment, a similar thing is blocking #477.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants