Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nullable fields don't always need Union{Missing, T} #384

Open
evetion opened this issue Jan 27, 2023 · 2 comments · May be fixed by #477
Open

Nullable fields don't always need Union{Missing, T} #384

evetion opened this issue Jan 27, 2023 · 2 comments · May be fixed by #477

Comments

@evetion
Copy link

evetion commented Jan 27, 2023

I'm trying to implement the GeoArrow spec, which gives back coordinates in a deeply nested list of a FixedList (a point). Because these lists are theoretically nullable, in Julia we get an deeply nested list with Unions of Missing, even though these vectors contain no missings. An example for a column of LineStrings (there are geometry types that require two more levels of nesting):

2-element Arrow.List{Vector{Union{Missing, Vector{Union{Missing, Tuple{Float64, Float64}}}}}

It's pretty hard to convert these elements to a concrete Vector{Vector{NTuple, Float64}} without allocating. Is there a way to edit the view to be non missing? An alternative way would be to pass all(validitybitmap) in build to juliaeltype, so we only set Missing when there are actual missing values.

I'm happy to make a PR if there's consensus on what to do.

Might be related to #373.

@quinnj
Copy link
Member

quinnj commented Jun 13, 2023

We recently updated the Arrow.List type to return a SubArray into the underlying data array; does that help your overall issue here w/ the allocations?

Yeah, we could potentially check the validitybitmap to see if there are any missings before building the eltype, but it does make me a tad nervous for some unrelated side effects it might introduce.

I'd say let's go for a PR and then we can take a look at how much work this would actually be.

@Moelf
Copy link
Contributor

Moelf commented Jun 13, 2023

I don't think it's fixed:

julia> col1 = Vector{Union{Int64, String}}[
        ["one", 2],
        ["one", 2, 3],
        ["one", 2, 3, 4],
        ["one", 2, 3, 4, 5]];

julia> df = DataFrame(;col1)
4×1 DataFrame
 Row │ col1
     │ Array
─────┼───────────────────────────────────
   1 │ Union{Int64, String}["one", 2]
   2 │ Union{Int64, String}["one", 2, 3]
   3 │ Union{Int64, String}["one", 2, 3
   4 │ Union{Int64, String}["one", 2, 3

julia> a = tempname()
"/tmp/jl_IngNyJwngp"

julia> Arrow.write(a, df)
"/tmp/jl_IngNyJwngp"

julia> Arrow.Table(a)
Arrow.Table with 4 rows, 1 columns, and schema:
 :col1    SubArray{Union{Missing, Int64, String}, 1, Arrow.DenseUnion{Union{Missing, Int64, String}, Arrow.UnionT{Arrow.Flatbuf.UnionMode.Dense, nothing, Tuple{Union{Missing, Int64}, String}}, Tuple{Arrow.Primitive{Union{Missing, Int64}, Vector{Int64}}, Arrow.List{String, Int32, Vector{UInt8}}}}, Tuple{UnitRange{Int64}}, true}

julia> Arrow.Table(a).col1[1]
2-element view(::Arrow.DenseUnion{Union{Missing, Int64, String}, Arrow.UnionT{Arrow.Flatbuf.UnionMode.Dense, nothing, Tuple{Union{Missing, Int64}, String}}, Tuple{Arrow.Primitive{Union{Missing, Int64}, Vector{Int64}}, Arrow.List{String, Int32, Vector{UInt8}}}}, 1:2) with eltype Union{Missing, Int64, String}:
  "one"
 2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants