Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataFrame constructors #1599

Closed
bkamins opened this issue Nov 14, 2018 · 12 comments
Closed

DataFrame constructors #1599

bkamins opened this issue Nov 14, 2018 · 12 comments

Comments

@bkamins
Copy link
Member

bkamins commented Nov 14, 2018

I am planning a small cleanup of DataFrame constructors and one significant change I am considering is to add a constructor for DataFrame(::AbstractVector{Pair{Symbol, T}};makeunique) where T<:AbstractVector or DataFrame(::AbstractVector{Pair{Symbol}};makeunique) (I am not 100% sure which would be better - in the second we would cast the second element of pair to AbstractVector inside the constructor).

This change would complement the existing constructors:

  • DataFrame(pairs::Pair{Symbol,<:Any}...; makeunique)
  • DataFrame(; kwargs...) (actually this constructor would be made more efficient as we would avoid splatting inside it)

and would work nicely with the fact that now eachcol returns such an AbstractVector.

Similarly - do we thing that a constructor accepting AbstractVector{<:DataFrameRow} would be good? (this would complement eachrow and is similar to an AbstractVector of NamedTuples)

CC @nalimilan

@nalimilan
Copy link
Member

Makes sense, but couldn't the Tables.jl default constructor handle this automatically? Cc: @quinnj

@bkamins
Copy link
Member Author

bkamins commented Nov 14, 2018

We could add appropriate methods following Tables.jl interface to handle them.

The only place where this is not possible is AbstractVector{<:DataFrameRow} where we would have to check if the rows come from the same data frame and if yes materialize a SubDataFrame to avoid allocations (selecting only appropriate rows from the source). Such a behavior needs a custom method.

@bkamins bkamins mentioned this issue Jan 15, 2019
31 tasks
@bkamins
Copy link
Member Author

bkamins commented Feb 10, 2019

Given a discussion in #1646 (comment) maybe we should actually have the following list of possible additional constructors of DataFrame accepting the following:

  1. a collection of AbstractVectors (now we only accept a Vector, this would allow e.g. a tuple of vectors).
  2. a collection of DataFrameRows (now we accept a collection of NamedTuples via Tables.jl and this would complement it).
  3. a collection of Pair{Symbol, T} where T<:AbstractVector, this would complement the constructors we have for Pairs by not requiring pairs to be splatted.

@nalimilan - this is what I have on a long-list of possible new constructors. None of them is crucial (so I am OK with not adding any of them - then I would close this issue), but please comment on all of them so that we can stabilize the API here.

@nalimilan
Copy link
Member

1 and 2 sound OK. Though Tables.jl should ideally support a collection of DataFrameRow objects too (maybe via a trait), since it allows tables to return this kind of object on iteration.

I'm more hesitant about 3 since that creates yet another kind of named collection, but why not.

@bkamins
Copy link
Member Author

bkamins commented Feb 11, 2019

Regardind 1.

This is something that would be left to decide if we want to go on with this (given the comments below).

The problem with this I see now is that Tables.jl by default assumes that iterable is accessed row-wise, not column wise, and in general it is impossible to tell without consuming it if it contains only vectors.

So I am not sure and maybe we should drop this and stay only with vector of vectors that we have now?

Regarding 2.

Actually we have it (for free) via Tables.jl. So this is a non-issue

Regarding 3.

Actually now I have noticed we have a problem here:

julia> DataFrame([:a=>[1,2,3],:b=>[3,4,5]])
2×2 DataFrame
│ Row │ first  │ second    │
│     │ Symbol │ Array…    │
├─────┼────────┼───────────┤
│ 1   │ a      │ [1, 2, 3] │
│ 2   │ b      │ [3, 4, 5] │

This behavior is inherited from Queryverse. So I would drop it.

@bkamins
Copy link
Member Author

bkamins commented Feb 11, 2019

By:

So I would drop it.

I mean that I would leave it as is for now.

@davidanthoff
Copy link
Contributor

This behavior is inherited from Queryverse.

I think that is a Tables.jl thing, not Queryverse.jl/TableTraits.jl. TableTraits.jl only considers iterators of named tuples to be tables, so it wouldn't treat an array of Pairs as a table. Tables.jl treats an iterator of elements that implement the getproperty interface as a table, so all iterators of composite types are tables under that definition, and I think that explains the constructor behavior you see in your point 3.

@bkamins
Copy link
Member Author

bkamins commented Feb 11, 2019

@davidanthoff - thank you for the explanation.

Summing up this discussion + #1646 (comment):

the only thing I will add is a constructor taking a tuple of vectors to create a data frame (this currently throws an error so it will be non-breaking).

@nalimilan
Copy link
Member

Case 3 is indeed annoying if we wanted to use it for column-wise operations later. Indeed it's very similar to the equivalent varargs constructor, which is column-wise. Maybe better support it to avoid the inconsistency (or at least have it throw an error).

Basically we just need to decide on a list of exceptions to the fallback Tables.jl constructor, which treats everything as row-wise. Are there any other cases which are not covered here?

@bkamins
Copy link
Member Author

bkamins commented Feb 11, 2019

Regarding Case 1 I have opened #1717.

Regarding Case 3: This would be breaking (that is why I hesitated), but for sure what we have now makes no sense, and we should handle AbstractVector{Pair{Symbol, <:AbstractVector}} and NamedTuple{N, Pair{Symbol, <:AbstractVector}} as column oriented. If we all agree on this behavior I will open a separate PR for this (now it will be a deprecation warning).

@bkamins
Copy link
Member Author

bkamins commented Feb 11, 2019

Actually rule 3 kicks in in a natural operation:

julia> DataFrame(eachcol(df, true)...)
3×4 DataFrame
│ Row │ x1       │ x2       │ x3       │ x4        │
│     │ Float64  │ Float64  │ Float64  │ Float64   │
├─────┼──────────┼──────────┼──────────┼───────────┤
│ 1   │ 0.122186 │ 0.1158   │ 0.953394 │ 0.0612821 │
│ 2   │ 0.369088 │ 0.974503 │ 0.670019 │ 0.186125  │
│ 3   │ 0.949404 │ 0.75961  │ 0.887188 │ 0.151146  │

while

julia> DataFrame(eachcol(df, true))
4×2 DataFrame
│ Row │ first  │ second                          │
│     │ Symbol │ Array{Float64,1}                │
├─────┼────────┼─────────────────────────────────┤
│ 1   │ x1     │ [0.122186, 0.369088, 0.949404]  │
│ 2   │ x2     │ [0.1158, 0.974503, 0.75961]     │
│ 3   │ x3     │ [0.953394, 0.670019, 0.887188]  │
│ 4   │ x4     │ [0.0612821, 0.186125, 0.151146] │

@bkamins
Copy link
Member Author

bkamins commented Aug 26, 2019

I am closing this as we have all I have requested now via Tables.jl:

julia> df = DataFrame(rand(3,4), [:a, :b, :c, :d])
3×4 DataFrame
│ Row │ a        │ b        │ c        │ d        │
│     │ Float64  │ Float64  │ Float64  │ Float64  │
├─────┼──────────┼──────────┼──────────┼──────────┤
│ 1   │ 0.971139 │ 0.722241 │ 0.25368  │ 0.541724 │
│ 2   │ 0.702116 │ 0.520817 │ 0.318176 │ 0.730486 │
│ 3   │ 0.363914 │ 0.541946 │ 0.341261 │ 0.134014 │

julia> DataFrame(eachrow(df))
3×4 DataFrame
│ Row │ a        │ b        │ c        │ d        │
│     │ Float64  │ Float64  │ Float64  │ Float64  │
├─────┼──────────┼──────────┼──────────┼──────────┤
│ 1   │ 0.971139 │ 0.722241 │ 0.25368  │ 0.541724 │
│ 2   │ 0.702116 │ 0.520817 │ 0.318176 │ 0.730486 │
│ 3   │ 0.363914 │ 0.541946 │ 0.341261 │ 0.134014 │

julia> DataFrame(eachcol(df, true))
3×4 DataFrame
│ Row │ a        │ b        │ c        │ d        │
│     │ Float64  │ Float64  │ Float64  │ Float64  │
├─────┼──────────┼──────────┼──────────┼──────────┤
│ 1   │ 0.971139 │ 0.722241 │ 0.25368  │ 0.541724 │
│ 2   │ 0.702116 │ 0.520817 │ 0.318176 │ 0.730486 │
│ 3   │ 0.363914 │ 0.541946 │ 0.341261 │ 0.134014 │

julia> DataFrame(eachcol(df))
3×4 DataFrame
│ Row │ x1       │ x2       │ x3       │ x4       │
│     │ Float64  │ Float64  │ Float64  │ Float64  │
├─────┼──────────┼──────────┼──────────┼──────────┤
│ 1   │ 0.971139 │ 0.722241 │ 0.25368  │ 0.541724 │
│ 2   │ 0.702116 │ 0.520817 │ 0.318176 │ 0.730486 │
│ 3   │ 0.363914 │ 0.541946 │ 0.341261 │ 0.134014 │

@quinnj (if I am missing something here please reopen otherwise please comment if it is working as expected - thank you!)

@bkamins bkamins closed this as completed Aug 26, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants