add missing columns when push! ing? #2150

kleinschmidt · 2020-03-11T22:01:33Z

I'm trying to build up a data frame one row at a time, except that not every row has all the columns. I'd like to be able to push! the rows on one at a time, filling in the missing columns with missing. using cols=:subset gets me halfway there:

julia> df = DataFrame()
0×0 DataFrame

julia> push!(df, (a=1, b=2))
1×2 DataFrame
│ Row │ a     │ b     │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 1     │ 2     │

julia> allowmissing!(df)
1×2 DataFrame
│ Row │ a      │ b      │
│     │ Int64⍰ │ Int64⍰ │
├─────┼────────┼────────┤
│ 1   │ 1      │ 2      │
julia> push!(df, (a=1, c=3), cols=:subset)
2×2 DataFrame
│ Row │ a      │ b       │
│     │ Int64⍰ │ Int64⍰  │
├─────┼────────┼─────────┤
│ 1   │ 1      │ 2       │
│ 2   │ 1      │ missing │

But notice that no column for c is added. Having to call allowmissing! is also potentially annoying since I'll have to call that for every row or have to do the bookkeeping of keeping track of which columns have already missings in them in the loop over rows. But I think that's more related to #1716 .

I'd like either an additional argument like fill, or another value for cols which opts into adding columns as they are encountered.

The text was updated successfully, but these errors were encountered:

bkamins · 2020-03-11T22:30:14Z

I think it is a duplicate (actually a subset) of #2032. Could you please have a look there and if yes this can be closed.

Just as a comment - things are not easy here as you have to decide about widening rule. If you have a column of Ints and want to push a float to it - should the column be converted to floats then?

In particular - maybe you want something simpler, which could be implemented before #2032. Something like cols=:union where, as you proposed missing columns would be added filled with missing in existing rows and the new entry given type it has.

Now having to call allowmissing! is another issue, and potentially another kwarg could be added, like widen, again it is tricky, do we want to allow promotion:

only to Missing union
using promote_type (could lead to conversion of existing data)
using typeunion (will produce abstract types)
using Union of encountered types (this is not what normally is done)

In short - we know about the problem, but the decision what is a robust and future proof design here is hard 😭, so for now we disallowed this.

If you would have an idea of a minimal API extension that would do what is expected most of the time then it would be great (my fear is that having fully flexible API here would be an overkill for a regular user, but maybe I am wrong).

pdeffebach · 2020-03-13T17:37:10Z

We could expand vcat to work with anything that is property accessible.

bkamins · 2020-03-13T18:26:21Z

The problem is how we should determine the return type? Would you want to dispatch on a first argument to vcat (i.e. if the first argument is a AbstractDataFrame then we return a DataFrame)? This is probably doable (I have just checked that there should not be dispatch ambiguities - which was a high risk given vcat in Base is pretty flexible). Then would you want to make anything that is Tables.jl compliant to be accepted?

Also I think converting via DataFrame! before vcat-ting should be cheap relative to vcat cost so it is not crucial to have it. But maybe I am missing something (for row-oriented storage anyway probably it is better to convert to column-oriented before vcat anyway)

pdeffebach · 2020-03-13T18:52:12Z

Also I think converting via DataFrame! before vcat-ting should be cheap relative to vcat cost so it is not crucial to have it. But maybe I am missing something (for row-oriented storage anyway probably it is better to convert to column-oriented before vcat anyway)

Yes I think it is best to keep vcat for DataFrame to DataFrame, given that any making a DataFrame is cheap.

This could also be handled at the Tables.jl level, which could provide a function to fill in columns and expand types as needed. For example

julia> t = [(a = 1, b = 2), (a = 3, b = 4, c = 5)]
2-element Array{NamedTuple,1}:
 (a = 1, b = 2)
 (a = 3, b = 4, c = 5)

julia> DataFrame(t)
2×2 DataFrame
│ Row │ a     │ b     │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 1     │ 2     │
│ 2   │ 3     │ 4     │

bkamins · 2020-03-13T19:51:56Z

I did not know about this. So Tables.columns effectively uses the :subset strategy of push!, which is also visible here:

julia> t = [(a = 1, b = 2), (b=3,a=4)]
2-element Array{NamedTuple{names,Tuple{Int64,Int64}} where names,1}:
 (a = 1, b = 2)
 (b = 3, a = 4)

julia> DataFrame(t)
2×2 DataFrame
│ Row │ a     │ b     │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 1     │ 2     │
│ 2   │ 4     │ 3     │

@quinnj - is this intended, or this is just a consequence of performance considerations?

quinnj · 2020-04-03T17:34:40Z

For any Tables.jl source without a well-defined schema (i.e. returns nothing from Tables.schema(tbl)), which in this case, includes Arrays of non-homogenous NamedTuples, it uses the Tables.columnnames of the first row. So I don't think that's necessarily :subset because I think it's actually an error if a subsequent row doesn't have one of the column names from the first row.

nalimilan · 2020-04-03T17:41:03Z

Actually what the example at #2150 (comment) shows is that new columns that appear after the first row are silently ignored. Is that intended?

bkamins · 2020-04-03T18:14:57Z

Ah - right, so it is :intersect option then with promote=true (so that we promote column types if needed, which is added in #2152 PR) - right?

quinnj · 2020-04-06T15:02:08Z

Actually what the example at #2150 (comment) shows is that new columns that appear after the first row are silently ignored. Is that intended?

Yes, extra columns not in the first row will be ignored; what I meant in my comment is that if the first row has extra columns that aren't present in subsequent rows, that would be an error (since it would be trying to get a column that didn't exist on later rows).

bkamins · 2020-04-06T15:47:21Z

Yes - this is exactly :intersect.

bkamins added the non-breaking The proposed change is not breaking label Mar 11, 2020

bkamins added this to the 1.x milestone Mar 11, 2020

bkamins added the decision label Mar 11, 2020

bkamins mentioned this issue Mar 13, 2020

allow :union as cols kwarg in push! and append! #2152

Merged

bkamins closed this as completed in #2152 Apr 16, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add missing columns when push! ing? #2150

add missing columns when push! ing? #2150

kleinschmidt commented Mar 11, 2020

bkamins commented Mar 11, 2020

pdeffebach commented Mar 13, 2020

bkamins commented Mar 13, 2020 •

edited

Loading

pdeffebach commented Mar 13, 2020

bkamins commented Mar 13, 2020

quinnj commented Apr 3, 2020

nalimilan commented Apr 3, 2020

bkamins commented Apr 3, 2020

quinnj commented Apr 6, 2020

bkamins commented Apr 6, 2020

add missing columns when push! ing? #2150

add missing columns when push! ing? #2150

Comments

kleinschmidt commented Mar 11, 2020

bkamins commented Mar 11, 2020

pdeffebach commented Mar 13, 2020

bkamins commented Mar 13, 2020 • edited Loading

pdeffebach commented Mar 13, 2020

bkamins commented Mar 13, 2020

quinnj commented Apr 3, 2020

nalimilan commented Apr 3, 2020

bkamins commented Apr 3, 2020

quinnj commented Apr 6, 2020

bkamins commented Apr 6, 2020

bkamins commented Mar 13, 2020 •

edited

Loading