a new method of the flatten function in DataFrames #2890

sprmnt21 · 2021-09-24T06:29:23Z

what do you think of the utility of having a method of the flatten function in DataFrames that expands a dataframe type df2 in the form dfexp?

df = DataFrame(a=1:3)
df1=DataFrame(b=11:15)

CSV.write("tmp1.csv", df)
CSV.write("tmp2.csv", df)
CSV.write("tmp3.csv", df1)

julia> df2
3×3 DataFrame
 Row │ files     other_cols  subdf                 
     │ String    Int64       DataFrame             
─────┼─────────────────────────────────────        
   1 │ tmp1.csv           1  3×1 DataFrame         
   2 │ tmp2.csv           2  3×1 DataFrame         
   3 │ tmp3.csv           3  5×1 DataFrame         


julia> dfexp=flatten(df2,:subdf)
11×4 DataFrame
 Row │ files     other_cols  a        b            
     │ String    Int64       Int64?   Int64?       
─────┼────────────────────────────────────────     
   1 │ tmp1.csv           1        1  missing      
   2 │ tmp1.csv           1        2  missing      
   3 │ tmp1.csv           1        3  missing      
   4 │ tmp2.csv           2        1  missing      
   5 │ tmp2.csv           2        2  missing      
   6 │ tmp2.csv           2        3  missing      
   7 │ tmp3.csv           3  missing       11      
   8 │ tmp3.csv           3  missing       12      
   9 │ tmp3.csv           3  missing       13      
  10 │ tmp3.csv           3  missing       14      
  11 │ tmp3.csv           3  missing       15

this is a function that does what is indicated, just to give an idea of what is required not to suggest how it should be done.

mapreduce(r->crossjoin(DataFrame(r[Not(:subdf)]),r.subdf), (x, y) -> vcat(x, y; cols = :union), eachrow(df2))

The text was updated successfully, but these errors were encountered:

bkamins · 2021-09-24T07:25:23Z

As commented on Slack. The operation you request is vcat. The only difference with the standard vcat is that now vcat allows you to create only one indicator column and you want more than one column. We could consider extending the source keyword API to allow what you ask for.

Let us wait for opinions.

sprmnt21 · 2021-09-24T08:51:06Z

thanks for your attention.
Another way of looking at it could be the following.
Think of the starting table df2 as if it were the result of a groupby and the table dfexp was the result of some kind of un-group.

bkamins · 2021-09-24T08:54:32Z

But under your thinking all data frames in :subdf columns would have to have exactly the same column names and eltypes. This clearly is not the case in your example. That is why I quote vcat as the crucial difference between vcat and un-group is that vcat allows you do decide how you want to handle cases when column names or eltypes do not match.

pdeffebach · 2021-09-24T13:19:13Z

Additionally, I think that allowing ByRow functions in combine to re-size would solve part of this problem.

It would not work if the data frames have different keys, and you need to convert to Tables.columntable (i.e. a NamedTuple) because data frames don''t support keys. But other than that,

combine(df, :c => ByRow(Tables.columntable) => AsTable)

would maybe work.

I'm not sure this is desirable though. This might not be consistent with how combine actually works / should be thought about.

bkamins · 2021-09-24T14:01:17Z

This is what I also thought but I have exactly the same reservations.

Now I have realized that actually assuming the columns match in all data frames the following works assuming non-dataframe columns form unique keys (which I guess is expected):

julia> df = DataFrame(x=[1,2,3], y=[DataFrame(p=1:2,q=11:12), DataFrame(p=3:4,q=13:14), DataFrame(p=5:6,q=15:16)])
3×2 DataFrame
 Row │ x      y
     │ Int64  DataFrame
─────┼──────────────────────
   1 │     1  2×2 DataFrame
   2 │     2  2×2 DataFrame
   3 │     3  2×2 DataFrame

julia> combine(x -> only(x).y, groupby(df, Not(:y)))
6×3 DataFrame
 Row │ x      p      q
     │ Int64  Int64  Int64
─────┼─────────────────────
   1 │     1      1     11
   2 │     1      2     12
   3 │     2      3     13
   4 │     2      4     14
   5 │     3      5     15
   6 │     3      6     16

bkamins · 2021-09-24T14:04:48Z

and if it is not unique:

combine(x -> reduce(vcat, x.y), groupby(df, Not(:y)))

sprmnt21 · 2021-09-24T20:01:31Z

Here she is.
This solution demonstrates the potential of DataFrames to also deal with situations, for which perhaps it was not specifically designed.

bkamins · 2022-02-20T15:39:54Z

Would Tables.dictrowtable meet your requirements?

sprmnt21 · 2022-02-20T16:30:13Z

Maybe.
But you should help me figure out how to fit it to the case.

sprmnt21 · 2022-02-20T17:33:02Z

this is what I can do ...

transform(df2, :sdf =>Tables.dictrowtable(c->mapreduce(x->copy.(eachrow(x)), vcat, c)) => AsTable)

but i get
LoadError: ArgumentError: no default Tables.rows implementation for type: var"#83#85"

bkamins · 2022-02-20T22:13:08Z

I meant this:

julia> df = DataFrame(id=1:2, a=[DataFrame(x=1:2), DataFrame(y=3:4)])
2×2 DataFrame
 Row │ id     a
     │ Int64  DataFrame
─────┼──────────────────────
   1 │     1  2×1 DataFrame
   2 │     2  2×1 DataFrame

julia> select(df, :id, :a => Tables.dictrowtable => AsTable)
2×3 DataFrame
 Row │ id     x        y
     │ Int64  Array…?  Array…?
─────┼─────────────────────────
   1 │     1  [1, 2]   missing
   2 │     2  missing  [3, 4]

and then ideally you would do:

flatten(select(df, :id, :a => Tables.dictrowtable => AsTable), [:x, :y])

but now I see that I need to fix flatten to add support for missing to it. I will open a PR for this.

bkamins · 2022-02-20T22:34:41Z

Ah - this is not as easy as I have thought.

@nalimilan - what do you think. Currently in flatten when we see missing we error as it is not iterable. To make the above examples work we would need to start treating missing as having "any" number of elements - flexibly adjusting to the remaining columns that define length. This is a bit hard to implement (of course doable), but the question is if this would be an intuitive behavior.

bkamins · 2022-02-20T22:37:39Z

As a second thought - maybe adding something like unnest from dplyr would be useful?

sprmnt21 · 2022-02-20T22:50:39Z

another way to get the result using only "table functions"
this way flatten is used before dictrowtables and flatten doesn't have to handle missing.
Perhaps :-)
transform(flatten(transform(df2,:sdf=>ByRow(Tables.rowtable), renamecols=false),:sdf), :sdf=>Tables.dictrowtable=>AsTable)

sprmnt21 · 2022-02-20T23:04:04Z

the fastest form I have found, at least for the example with a few lines, is the following (although less elegant).

vcat([[repeat(df2[r:r,Not(:sdf)],nrow(df2.sdf[r])) df2[r,:sdf]] for r in 1:nrow(df2)]..., cols=:union)

nalimilan · 2022-03-03T08:49:10Z

@nalimilan - what do you think. Currently in flatten when we see missing we error as it is not iterable. To make the above examples work we would need to start treating missing as having "any" number of elements - flexibly adjusting to the remaining columns that define length. This is a bit hard to implement (of course doable), but the question is if this would be an intuitive behavior.

Wouldn't it make sense to treat any values that do not support length as scalars, and broadcast/repeat them as needed? It would have been logical to use Base.Broadcast.broadcastable for that, unfortunately we haven't done that and the result differs notbaly for strings (though we could special-case them as a temporary solution).

Regarding unnest, would #3005 more or less implement the same features?

bkamins · 2022-03-03T22:20:01Z

We did not use broadcasting because 1-element collections are automatically resized with broadcasting, which can lead to hard to catch bugs in this context.

What we could do (following your proposal):

add kwarg strict::Bool=true, which by default does what we now have;
if strict=false then:
- all values that do not support length are repeated as needed (and if in a row no values support length then the length is taken to be 1 - i.e. no-op)
- strings are repeated instead of being flattened

What do you think?

nalimilan · 2022-03-04T13:39:55Z

Ah right, broadcasting collections would be risky. But maybe we can repeat scalars (including strings) even without requiring strict=true? This sounds quite safe.

bkamins · 2022-03-04T14:05:44Z

We could. It would be mildly breaking but acceptable. We then could just do strict::Bool=false by default, and if someone wants safety one can pass strict=true.

bkamins · 2022-12-28T19:51:35Z

I am closing this issue as it is handled in #3258 so let us keep the discussion there (somehow I am not able to link this issue to that PR)

bkamins · 2023-02-05T07:54:21Z

Now tracked in #3116

bkamins added the feature label Sep 24, 2021

bkamins added this to the 1.x milestone Sep 24, 2021

bkamins mentioned this issue Feb 17, 2022

feature: cols=:union argument (or something like it) for combine with AsTable #3005

Open

bkamins modified the milestones: 1.x, 1.4 Mar 3, 2022

bkamins modified the milestones: 1.4, 1.5 Jun 7, 2022

bkamins mentioned this issue Aug 15, 2022

unnest #3116

Open

bkamins mentioned this issue Dec 28, 2022

Add nest, unnest, extract and extract! #3258

Draft

bkamins closed this as completed Dec 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

a new method of the flatten function in DataFrames #2890

a new method of the flatten function in DataFrames #2890

sprmnt21 commented Sep 24, 2021 •

edited

Loading

bkamins commented Sep 24, 2021

sprmnt21 commented Sep 24, 2021

bkamins commented Sep 24, 2021

pdeffebach commented Sep 24, 2021

bkamins commented Sep 24, 2021

bkamins commented Sep 24, 2021

sprmnt21 commented Sep 24, 2021 •

edited

Loading

bkamins commented Feb 20, 2022

sprmnt21 commented Feb 20, 2022

sprmnt21 commented Feb 20, 2022 •

edited

Loading

bkamins commented Feb 20, 2022 •

edited

Loading

bkamins commented Feb 20, 2022

bkamins commented Feb 20, 2022

sprmnt21 commented Feb 20, 2022 •

edited

Loading

sprmnt21 commented Feb 20, 2022

nalimilan commented Mar 3, 2022

bkamins commented Mar 3, 2022

nalimilan commented Mar 4, 2022

bkamins commented Mar 4, 2022 •

edited

Loading

bkamins commented Dec 28, 2022

bkamins commented Feb 5, 2023

a new method of the flatten function in DataFrames #2890

a new method of the flatten function in DataFrames #2890

Comments

sprmnt21 commented Sep 24, 2021 • edited Loading

bkamins commented Sep 24, 2021

sprmnt21 commented Sep 24, 2021

bkamins commented Sep 24, 2021

pdeffebach commented Sep 24, 2021

bkamins commented Sep 24, 2021

bkamins commented Sep 24, 2021

sprmnt21 commented Sep 24, 2021 • edited Loading

bkamins commented Feb 20, 2022

sprmnt21 commented Feb 20, 2022

sprmnt21 commented Feb 20, 2022 • edited Loading

bkamins commented Feb 20, 2022 • edited Loading

bkamins commented Feb 20, 2022

bkamins commented Feb 20, 2022

sprmnt21 commented Feb 20, 2022 • edited Loading

sprmnt21 commented Feb 20, 2022

nalimilan commented Mar 3, 2022

bkamins commented Mar 3, 2022

nalimilan commented Mar 4, 2022

bkamins commented Mar 4, 2022 • edited Loading

bkamins commented Dec 28, 2022

bkamins commented Feb 5, 2023

sprmnt21 commented Sep 24, 2021 •

edited

Loading

sprmnt21 commented Sep 24, 2021 •

edited

Loading

sprmnt21 commented Feb 20, 2022 •

edited

Loading

bkamins commented Feb 20, 2022 •

edited

Loading

sprmnt21 commented Feb 20, 2022 •

edited

Loading

bkamins commented Mar 4, 2022 •

edited

Loading