When partitioned, partition might lose the missingness eltype (in Tables.schema) #3298

svilupp · 2023-03-12T19:20:58Z

Problem: If a user partitions a DataFrame with Iterators.partition(Tables.rows(df), 2), the corresponding Tables.schema for each partition will not be type-stable
Eg, if the first partition does not have any missing information, it would not have the missing type in its schema despite the overall vector having missingness allowed.

Why it's a problem: Arrow.write determines the schema from the first record batch, because Tables-compatible sources retain parents' schema information even for partitions. In effect, missing fields could be lost and replaced by empty fields of the concrete type (eg, "" for String).

Example

# correct when working with Tables object
t = Tables.rowtable((; x1 =["a","b",missing,"c"], x2 = 1:4))
for part in Iterators.partition(Tables.rows(t), 2)
    @info "Parent type: $(part.parent|>Tables.schema)"
    @info "Columns type: $(Tables.columns(part)|>Tables.schema)"
end

  ┌ Info: Parent type: Tables.Schema:
  │  :x1  Union{Missing, String}
  └  :x2  Int64
  ┌ Info: Columns type: Tables.Schema:
  │  :x1  Union{Missing, String}
  └  :x2  Int64
  ┌ Info: Parent type: Tables.Schema:
  │  :x1  Union{Missing, String}
  └  :x2  Int64
  ┌ Info: Columns type: Tables.Schema:
  │  :x1  Union{Missing, String}
  └  :x2  Int64

# incorrect when working with DataFrame
df = Tables.rowtable((; x1 =["a","b",missing,"c"], x2 = 1:4)) |> DataFrame
for part in Iterators.partition(Tables.rows(df), 2)
    @info "Parent type: $(part.parent|>Tables.schema)"
    @info "Columns type: $(Tables.columns(part)|>Tables.schema)"
end

  ┌ Info: Parent type: Tables.Schema:
  │  :x1  Union{Missing, String}
  └  :x2  Int64
  ┌ Info: Columns type: Tables.Schema:
  │  :x1  String
  └  :x2  Int64
  ┌ Info: Parent type: Tables.Schema:
  │  :x1  Union{Missing, String}
  └  :x2  Int64
  ┌ Info: Columns type: Tables.Schema:
  │  :x1  Union{Missing, String}
  └  :x2  Int64

It might be expected, I'm not sure what's required by Tables interface. But I'm opening for visibility, the original issue is in apache/arrow-julia#403 , because that's where the downstream error happens.

Versioninfo:

DataFrames v1.5.0

Julia Version 1.8.5
Commit 17cfb8e65ea (2023-01-08 06:45 UTC)
Platform Info:
OS: macOS (arm64-apple-darwin21.5.0)
CPU: 8 × Apple M1 Pro
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-13.0.1 (ORCJIT, apple-m1)
Threads: 6 on 6 virtual cores

The text was updated successfully, but these errors were encountered:

bkamins · 2023-03-12T19:56:53Z

I will add it. Note that currently it is assumed that you use Iterators.partition on a data frame, and this works as expected:

julia> for part in Iterators.partition(df, 2)
           @info "Parent type: $(part|>Tables.schema)"
           @info "Columns type: $(Tables.columns(part)|>Tables.schema)"
           end
┌ Info: Parent type: Tables.Schema:
│  :x1  Union{Missing, String}
└  :x2  Int64
┌ Info: Columns type: Tables.Schema:
│  :x1  Union{Missing, String}
└  :x2  Int64
┌ Info: Parent type: Tables.Schema:
│  :x1  Union{Missing, String}
└  :x2  Int64
┌ Info: Columns type: Tables.Schema:
│  :x1  Union{Missing, String}
└  :x2  Int64

bkamins · 2023-03-12T20:12:11Z

I will open #3299 in a few minutes fixing this:

julia> for part in Iterators.partition(Tables.rows(df), 2)
           @info "Parent type: $(parent(part) |> Tables.schema)"
               @info "Columns type: $(Tables.columns(part)|>Tables.schema)"
               end
┌ Info: Parent type: Tables.Schema:
│  :x1  Union{Missing, String}
└  :x2  Int64
┌ Info: Columns type: Tables.Schema:
│  :x1  Union{Missing, String}
└  :x2  Int64
┌ Info: Parent type: Tables.Schema:
│  :x1  Union{Missing, String}
└  :x2  Int64
┌ Info: Columns type: Tables.Schema:
│  :x1  Union{Missing, String}
└  :x2  Int64

(note that you should not use part.parent but rather parent(part) in Arrow.jl (if you use information about parent). But in general, it should be enough to just write:

julia> for part in Iterators.partition(Tables.rows(df), 2)
           @info "Parent type: $(part |> Tables.schema)"
               @info "Columns type: $(Tables.columns(part)|>Tables.schema)"
               end
┌ Info: Parent type: Tables.Schema:
│  :x1  Union{Missing, String}
└  :x2  Int64
┌ Info: Columns type: Tables.Schema:
│  :x1  Union{Missing, String}
└  :x2  Int64
┌ Info: Parent type: Tables.Schema:
│  :x1  Union{Missing, String}
└  :x2  Int64
┌ Info: Columns type: Tables.Schema:
│  :x1  Union{Missing, String}
└  :x2  Int64

(at least in DataFrames.jl - without checking parent)

svilupp · 2023-03-13T07:29:00Z

Thanks for the tip! The parent was merely an observation / quick fix.

The call Tables.partitions(df) |> Tables.columns(part) |> Tables.schema happens within Arrow.write as it needs the collection of columns to serialize into Arrow format.

I think the design relies on this statement in the docs for Tables.partitions:

?Tables.partitions(x)
Request a "table" iterator from x. Each iterated element must be a "table" in the sense that one may call Tables.rows or
Tables.columns to get a row-iterator or collection of columns. All iterated elements must have an identical schema, so that users may
call Tables.schema(first_element) on the first iterated element and know that each subsequent iteration will match the same schema

But, unfortunately, the information gets lost if users pre-partition their DataFrame based on row chunks (in the pre-1.5.0 world) with Tables.rows

EDIT: Basically, the intermediate call to "Tables.columns" before Arrow serializes data is what is causing all this pain. Because when DataFrameRows are passed to it, it just materializes them as they are (and the parent schema is lost)

bkamins added the feature label Mar 12, 2023

bkamins added this to the patch milestone Mar 12, 2023

bkamins mentioned this issue Mar 12, 2023

add Iterators.partition for DataFrameRows #3299

Merged

bkamins closed this as completed in #3299 Apr 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When partitioned, partition might lose the missingness eltype (in Tables.schema) #3298

When partitioned, partition might lose the missingness eltype (in Tables.schema) #3298

svilupp commented Mar 12, 2023

bkamins commented Mar 12, 2023

bkamins commented Mar 12, 2023

svilupp commented Mar 13, 2023 •

edited

Loading

When partitioned, partition might lose the missingness eltype (in Tables.schema) #3298

When partitioned, partition might lose the missingness eltype (in Tables.schema) #3298

Comments

svilupp commented Mar 12, 2023

bkamins commented Mar 12, 2023

bkamins commented Mar 12, 2023

svilupp commented Mar 13, 2023 • edited Loading

svilupp commented Mar 13, 2023 •

edited

Loading