Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve tables support #276

Merged
merged 3 commits into from
Feb 23, 2024
Merged

improve tables support #276

merged 3 commits into from
Feb 23, 2024

Conversation

aplavin
Copy link
Member

@aplavin aplavin commented Jun 17, 2023

StructArrays constructor fundamentally cannot convert an arbitrary Tables.jl table to a StructArray. This would interfere with current, and totally sensible, collection handling: Tables.jl don't provide a common abstract type to dispatch.

However, I think it makes sense to support tables when dispatching is straightforward. Here, I add direct construction support for AbstractColumns. Its subtypes include numerous columnar tables defined in Tables.jl and in other packages. As a concrete motivating example, this enables CSV.read("file.csv", StructArray).

UPD: also, this PR adds a separate function StructArrays.fromtable to support constructing from arbitrary Tables objects, and Tables.materializer support.

@aplavin
Copy link
Member Author

aplavin commented Jun 17, 2023

Test failures seem irrelevant.

@aplavin
Copy link
Member Author

aplavin commented Jul 13, 2023

Bump

@aplavin
Copy link
Member Author

aplavin commented Aug 16, 2023

bump!

@piever
Copy link
Collaborator

piever commented Aug 24, 2023

Sorry for the delay in the review merge of the various PRs! So, here I agree that it makes sense to have a simpler way to do this, without asking users to do StructArray(Tables.columntable(x)), but somehow dispatching on AbstractColumns seems not ideal (or at least the docs of AbstractColumns suggest against it). OTOH, using the Tables machinery on StructArray(x::Any) seems like it could be confusing for a variety of reasons (for instance, a StructArray has more structure than a table as it preserves the eltype).

Possible options:

  1. See if it is possible to have a generic wrapper type Table to simply declare that an object is a table and change CSV.read(input, sink) = sink(Table(CSV.File(input))) in CSV.jl. Then, StructArray(t::Table) can do the right thing. Scratched, as the materializer approach mentioned below seems easier.
  2. Simply create a StructArrays.fromtable(x) = StructArray(columntable(x)) method. CSV.read(input, StructArrays.fromtable) seems like a reasonable interface to me.

cc: @quinnj

@aplavin
Copy link
Member Author

aplavin commented Aug 24, 2023

As I understand, dispatching on AbstractColumns is not harmful in any case. It's just that not all columnar tables are <: AbstractColumns, so this dispatch won't trigger for all tables. However, if something does subtype AbstractColumns, it surely is intended to be a table by the author of the type.

@aplavin
Copy link
Member Author

aplavin commented Aug 24, 2023

using the Tables machinery on StructArray(x::Any) seems like it could be confusing

100%, please never do that :) Tables and arrays can easily conflict, and this is manifested as bugs in functions on arrays that shouldn't be affected on Tables at all - see #278 (because append! uses Tables machinery in StructArrays).

@piever
Copy link
Collaborator

piever commented Aug 24, 2023

I'm just a little concerned that it could create odd expectations that StructArray(x) works from tables only some of the time. I see how in practice that is already the case, but adding dispatches one by one may not be a fully satisfactory solution.

I'm adding a link to a similar discussion in CSV.jl JuliaData/CSV.jl#941 (comment). I would agree with the proposal linked above to instead have a simple fromtable function and maybe also a materializer(::Type{<:StructArray}) = fromtable. Then users could do

CSV.read(input, StructArrays.fromtable) # or even
CSV.read(input, materializer(StructArray))

If that is unintuitive, there is also the option of using materializer(sink) rather than sink in the default definition of CSV.read, but it seems like there was no consensus for that at the time (see above link).

@aplavin
Copy link
Member Author

aplavin commented Dec 1, 2023

There are more and more cases when I wish this conversion was defined... AbstractColumns is not just for CSV, it's for many types defined both in Tables.jl and in other table-reading packages. Most recently, I encountered Arrow.jl: defining this method here would allow Arrow.Table(file) |> StructArray (now requires columntable inbetween).

StructArray is basically the type-stable columnar array + table in Julia, would be great to make interop as straightforward as possible. Yes, we cannot support all tables because there's nothing to dispatch on (and using Tables stuff without dispatch is quite bad, #278 for a specific bug it causes). But supporting more tables, those where it can be done cleanly, is a strict improvement IMO.

@aplavin
Copy link
Member Author

aplavin commented Dec 27, 2023

Bump...
fromtable() and/or materializer() are also nice to have, but kinda orthogonal to this. Here, we can support conversion from many table types directly, without calling extra functions explicitly.

@aplavin
Copy link
Member Author

aplavin commented Jan 12, 2024

gentle bump...

@aplavin
Copy link
Member Author

aplavin commented Feb 14, 2024

bump

@aplavin
Copy link
Member Author

aplavin commented Feb 18, 2024

Added fromtable and materializer following #291. Are there any remaining concerns regarding Tables support, or this can be merged?

@oschulz
Copy link
Contributor

oschulz commented Feb 18, 2024

Thanks for this @aplavin !

@aplavin aplavin merged commit 7522dfd into JuliaArrays:master Feb 23, 2024
7 checks passed
@aplavin
Copy link
Member Author

aplavin commented Feb 23, 2024

I'll take the liberty to merge this, following an advice on slack #arrays. The PR didn't get a strong "no" from maintainers, changes are nonbreaking, and quite local (ie not an overhaul).
Please let me know if anything is wrong, it's the first time I do this in JuliArrays.

@aplavin aplavin deleted the tables branch June 15, 2024 13:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants