Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CSV.read not working properly with StructArray #941

Open
robsmith11 opened this issue Nov 13, 2021 · 5 comments
Open

CSV.read not working properly with StructArray #941

robsmith11 opened this issue Nov 13, 2021 · 5 comments
Labels
documentation improvement Improve an existing feature/functionality

Comments

@robsmith11
Copy link

It works as expected when reading into a NamedTulpe first, but not when reading directly into a StructArray:

julia> write("/tmp/a.csv", "a,b,c\n1,2,3\n4,5,6\n");

julia> CSV.read("/tmp/a.csv", StructArrays.StructArray)
0-element StructArray() with eltype Vector{Int64} with indices 1:0

julia> StructArrays.StructArray(CSV.read("/tmp/a.csv", NamedTuple))
2-element StructArray(::Vector{Int64}, ::Vector{Int64}, ::Vector{Int64}) with eltype NamedTuple{(:a, :b, :c), Tuple{Int64, Int64, Int64}}:
 (a = 1, b = 2, c = 3)
 (a = 4, b = 5, c = 6)

julia> Pkg.status("CSV")
Status `/me/.julia/environments/v1.8/Project.toml`
  [336ed68f] CSV v0.9.10
@quinnj
Copy link
Member

quinnj commented Nov 13, 2021

Pinging @piever to chime in, but as I understand it, a StructArray is a valid Tables.jl source, but doesn't currently have a sink interface. i.e. the generic fallback StructArray(x) definitions expects x to be a struct iterator, but not necessarily a Tables.jl-related source.

My guess is it would be trivial to add some kind of StructArrays.fromtable function that would take any valid Tables.jl source and create a StructArray, if you wanted to take a stab at a PR.

@piever
Copy link

piever commented Nov 15, 2021

Yes, that is correct. You can actually already do

CSV.read(fn, StructArrayTables.columntable)

which is probably close to optimal (the data structure backing a StructArray is a named tuple of vectors).

Should CSV.read call materializer(sink) rather than sink to process the data? That way, I can just define the correct

materializer(::Type{<:StructArray})

method, and CSV.read(fn, StructArray) would just work.

@quinnj
Copy link
Member

quinnj commented Nov 16, 2021

It seems like StructArray∘Tables.columntable is the way to go to me. CSV.read is pretty well documented that you need to pass a valid sink function; materializer is usually just for when you pass in an instance of a table and want to get its materializing function.

@robsmith11
Copy link
Author

I think most users who aren't familiar with the implementation details are still going to expect StructArray to work as a sink without going through an intermediate step. It's a table-like data structure just like DataFrame.

@piever
Copy link

piever commented Nov 16, 2021

CSV.read is pretty well documented that you need to pass a valid sink function;

Makes sense. I can probably still define materializer for a StructArray type, so that users can do

CSV.read(fn, materializer(StructArray))

Rather than defining CSV.read(fn, sink::Type) as CSV.File(fn) |> materializer(sink) (it may be a step too far) maybe the docs of CSV.read could briefly mention the materializer helper to get a valid sink function from an instance or type (I think both work) of table.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation improvement Improve an existing feature/functionality
Projects
None yet
Development

No branches or pull requests

4 participants