Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add missing columns when push! ing? #2150

Closed
kleinschmidt opened this issue Mar 11, 2020 · 10 comments · Fixed by #2152
Closed

add missing columns when push! ing? #2150

kleinschmidt opened this issue Mar 11, 2020 · 10 comments · Fixed by #2152
Labels
decision non-breaking The proposed change is not breaking
Milestone

Comments

@kleinschmidt
Copy link
Contributor

I'm trying to build up a data frame one row at a time, except that not every row has all the columns. I'd like to be able to push! the rows on one at a time, filling in the missing columns with missing. using cols=:subset gets me halfway there:

julia> df = DataFrame()
0×0 DataFrame

julia> push!(df, (a=1, b=2))
1×2 DataFrame
│ Row │ a     │ b     │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 112     │

julia> allowmissing!(df)
1×2 DataFrame
│ Row │ a      │ b      │
│     │ Int64⍰ │ Int64⍰ │
├─────┼────────┼────────┤
│ 112      │
julia> push!(df, (a=1, c=3), cols=:subset)
2×2 DataFrame
│ Row │ a      │ b       │
│     │ Int64⍰ │ Int64⍰  │
├─────┼────────┼─────────┤
│ 112       │
│ 21missing

But notice that no column for c is added. Having to call allowmissing! is also potentially annoying since I'll have to call that for every row or have to do the bookkeeping of keeping track of which columns have already missings in them in the loop over rows. But I think that's more related to #1716 .

I'd like either an additional argument like fill, or another value for cols which opts into adding columns as they are encountered.

@bkamins
Copy link
Member

bkamins commented Mar 11, 2020

I think it is a duplicate (actually a subset) of #2032. Could you please have a look there and if yes this can be closed.

Just as a comment - things are not easy here as you have to decide about widening rule. If you have a column of Ints and want to push a float to it - should the column be converted to floats then?

In particular - maybe you want something simpler, which could be implemented before #2032. Something like cols=:union where, as you proposed missing columns would be added filled with missing in existing rows and the new entry given type it has.

Now having to call allowmissing! is another issue, and potentially another kwarg could be added, like widen, again it is tricky, do we want to allow promotion:

  • only to Missing union
  • using promote_type (could lead to conversion of existing data)
  • using typeunion (will produce abstract types)
  • using Union of encountered types (this is not what normally is done)

In short - we know about the problem, but the decision what is a robust and future proof design here is hard 😭, so for now we disallowed this.

If you would have an idea of a minimal API extension that would do what is expected most of the time then it would be great (my fear is that having fully flexible API here would be an overkill for a regular user, but maybe I am wrong).

@bkamins bkamins added the non-breaking The proposed change is not breaking label Mar 11, 2020
@bkamins bkamins added this to the 1.x milestone Mar 11, 2020
@pdeffebach
Copy link
Contributor

We could expand vcat to work with anything that is property accessible.

@bkamins
Copy link
Member

bkamins commented Mar 13, 2020

The problem is how we should determine the return type? Would you want to dispatch on a first argument to vcat (i.e. if the first argument is a AbstractDataFrame then we return a DataFrame)? This is probably doable (I have just checked that there should not be dispatch ambiguities - which was a high risk given vcat in Base is pretty flexible). Then would you want to make anything that is Tables.jl compliant to be accepted?

Also I think converting via DataFrame! before vcat-ting should be cheap relative to vcat cost so it is not crucial to have it. But maybe I am missing something (for row-oriented storage anyway probably it is better to convert to column-oriented before vcat anyway)

@pdeffebach
Copy link
Contributor

Also I think converting via DataFrame! before vcat-ting should be cheap relative to vcat cost so it is not crucial to have it. But maybe I am missing something (for row-oriented storage anyway probably it is better to convert to column-oriented before vcat anyway)

Yes I think it is best to keep vcat for DataFrame to DataFrame, given that any making a DataFrame is cheap.

This could also be handled at the Tables.jl level, which could provide a function to fill in columns and expand types as needed. For example

julia> t = [(a = 1, b = 2), (a = 3, b = 4, c = 5)]
2-element Array{NamedTuple,1}:
 (a = 1, b = 2)
 (a = 3, b = 4, c = 5)

julia> DataFrame(t)
2×2 DataFrame
│ Row │ a     │ b     │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 1     │ 2     │
│ 2   │ 3     │ 4     │

@bkamins
Copy link
Member

bkamins commented Mar 13, 2020

I did not know about this. So Tables.columns effectively uses the :subset strategy of push!, which is also visible here:

julia> t = [(a = 1, b = 2), (b=3,a=4)]
2-element Array{NamedTuple{names,Tuple{Int64,Int64}} where names,1}:
 (a = 1, b = 2)
 (b = 3, a = 4)

julia> DataFrame(t)
2×2 DataFrame
│ Row │ a     │ b     │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 1     │ 2     │
│ 2   │ 4     │ 3     │

@quinnj - is this intended, or this is just a consequence of performance considerations?

@quinnj
Copy link
Member

quinnj commented Apr 3, 2020

For any Tables.jl source without a well-defined schema (i.e. returns nothing from Tables.schema(tbl)), which in this case, includes Arrays of non-homogenous NamedTuples, it uses the Tables.columnnames of the first row. So I don't think that's necessarily :subset because I think it's actually an error if a subsequent row doesn't have one of the column names from the first row.

@nalimilan
Copy link
Member

Actually what the example at #2150 (comment) shows is that new columns that appear after the first row are silently ignored. Is that intended?

@bkamins
Copy link
Member

bkamins commented Apr 3, 2020

Ah - right, so it is :intersect option then with promote=true (so that we promote column types if needed, which is added in #2152 PR) - right?

@quinnj
Copy link
Member

quinnj commented Apr 6, 2020

Actually what the example at #2150 (comment) shows is that new columns that appear after the first row are silently ignored. Is that intended?

Yes, extra columns not in the first row will be ignored; what I meant in my comment is that if the first row has extra columns that aren't present in subsequent rows, that would be an error (since it would be trying to get a column that didn't exist on later rows).

@bkamins
Copy link
Member

bkamins commented Apr 6, 2020

Yes - this is exactly :intersect.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
decision non-breaking The proposed change is not breaking
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants