Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When partitioned, partition might lose the missingness eltype (in Tables.schema) #3298

Closed
svilupp opened this issue Mar 12, 2023 · 3 comments · Fixed by #3299
Closed

When partitioned, partition might lose the missingness eltype (in Tables.schema) #3298

svilupp opened this issue Mar 12, 2023 · 3 comments · Fixed by #3299
Labels
Milestone

Comments

@svilupp
Copy link
Contributor

svilupp commented Mar 12, 2023

Problem: If a user partitions a DataFrame with Iterators.partition(Tables.rows(df), 2), the corresponding Tables.schema for each partition will not be type-stable
Eg, if the first partition does not have any missing information, it would not have the missing type in its schema despite the overall vector having missingness allowed.

Why it's a problem: Arrow.write determines the schema from the first record batch, because Tables-compatible sources retain parents' schema information even for partitions. In effect, missing fields could be lost and replaced by empty fields of the concrete type (eg, "" for String).

Example

# correct when working with Tables object
t = Tables.rowtable((; x1 =["a","b",missing,"c"], x2 = 1:4))
for part in Iterators.partition(Tables.rows(t), 2)
    @info "Parent type: $(part.parent|>Tables.schema)"
    @info "Columns type: $(Tables.columns(part)|>Tables.schema)"
end

  ┌ Info: Parent type: Tables.Schema:
  │  :x1  Union{Missing, String}
  └  :x2  Int64
  ┌ Info: Columns type: Tables.Schema:
  │  :x1  Union{Missing, String}
  └  :x2  Int64
  ┌ Info: Parent type: Tables.Schema:
  │  :x1  Union{Missing, String}
  └  :x2  Int64
  ┌ Info: Columns type: Tables.Schema:
  │  :x1  Union{Missing, String}
  └  :x2  Int64

# incorrect when working with DataFrame
df = Tables.rowtable((; x1 =["a","b",missing,"c"], x2 = 1:4)) |> DataFrame
for part in Iterators.partition(Tables.rows(df), 2)
    @info "Parent type: $(part.parent|>Tables.schema)"
    @info "Columns type: $(Tables.columns(part)|>Tables.schema)"
end

  ┌ Info: Parent type: Tables.Schema:
  │  :x1  Union{Missing, String}
  └  :x2  Int64
  ┌ Info: Columns type: Tables.Schema:
  │  :x1  String
  └  :x2  Int64
  ┌ Info: Parent type: Tables.Schema:
  │  :x1  Union{Missing, String}
  └  :x2  Int64
  ┌ Info: Columns type: Tables.Schema:
  │  :x1  Union{Missing, String}
  └  :x2  Int64

It might be expected, I'm not sure what's required by Tables interface. But I'm opening for visibility, the original issue is in apache/arrow-julia#403 , because that's where the downstream error happens.

Versioninfo:

DataFrames v1.5.0

Julia Version 1.8.5
Commit 17cfb8e65ea (2023-01-08 06:45 UTC)
Platform Info:
OS: macOS (arm64-apple-darwin21.5.0)
CPU: 8 × Apple M1 Pro
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-13.0.1 (ORCJIT, apple-m1)
Threads: 6 on 6 virtual cores

@bkamins bkamins added this to the patch milestone Mar 12, 2023
@bkamins
Copy link
Member

bkamins commented Mar 12, 2023

I will add it. Note that currently it is assumed that you use Iterators.partition on a data frame, and this works as expected:

julia> for part in Iterators.partition(df, 2)
           @info "Parent type: $(part|>Tables.schema)"
           @info "Columns type: $(Tables.columns(part)|>Tables.schema)"
           end
┌ Info: Parent type: Tables.Schema:
│  :x1  Union{Missing, String}
└  :x2  Int64
┌ Info: Columns type: Tables.Schema:
│  :x1  Union{Missing, String}
└  :x2  Int64
┌ Info: Parent type: Tables.Schema:
│  :x1  Union{Missing, String}
└  :x2  Int64
┌ Info: Columns type: Tables.Schema:
│  :x1  Union{Missing, String}
└  :x2  Int64

@bkamins
Copy link
Member

bkamins commented Mar 12, 2023

I will open #3299 in a few minutes fixing this:

julia> for part in Iterators.partition(Tables.rows(df), 2)
           @info "Parent type: $(parent(part) |> Tables.schema)"
               @info "Columns type: $(Tables.columns(part)|>Tables.schema)"
               end
┌ Info: Parent type: Tables.Schema:
│  :x1  Union{Missing, String}
└  :x2  Int64
┌ Info: Columns type: Tables.Schema:
│  :x1  Union{Missing, String}
└  :x2  Int64
┌ Info: Parent type: Tables.Schema:
│  :x1  Union{Missing, String}
└  :x2  Int64
┌ Info: Columns type: Tables.Schema:
│  :x1  Union{Missing, String}
└  :x2  Int64

(note that you should not use part.parent but rather parent(part) in Arrow.jl (if you use information about parent). But in general, it should be enough to just write:

julia> for part in Iterators.partition(Tables.rows(df), 2)
           @info "Parent type: $(part |> Tables.schema)"
               @info "Columns type: $(Tables.columns(part)|>Tables.schema)"
               end
┌ Info: Parent type: Tables.Schema:
│  :x1  Union{Missing, String}
└  :x2  Int64
┌ Info: Columns type: Tables.Schema:
│  :x1  Union{Missing, String}
└  :x2  Int64
┌ Info: Parent type: Tables.Schema:
│  :x1  Union{Missing, String}
└  :x2  Int64
┌ Info: Columns type: Tables.Schema:
│  :x1  Union{Missing, String}
└  :x2  Int64

(at least in DataFrames.jl - without checking parent)

@svilupp
Copy link
Contributor Author

svilupp commented Mar 13, 2023

Thanks for the tip! The parent was merely an observation / quick fix.

The call Tables.partitions(df) |> Tables.columns(part) |> Tables.schema happens within Arrow.write as it needs the collection of columns to serialize into Arrow format.

I think the design relies on this statement in the docs for Tables.partitions:

?Tables.partitions(x)
Request a "table" iterator from x. Each iterated element must be a "table" in the sense that one may call Tables.rows or
Tables.columns to get a row-iterator or collection of columns. All iterated elements must have an identical schema, so that users may
call Tables.schema(first_element) on the first iterated element and know that each subsequent iteration will match the same schema

But, unfortunately, the information gets lost if users pre-partition their DataFrame based on row chunks (in the pre-1.5.0 world) with Tables.rows

EDIT: Basically, the intermediate call to "Tables.columns" before Arrow serializes data is what is causing all this pain. Because when DataFrameRows are passed to it, it just materializes them as they are (and the parent schema is lost)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants