Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

a new method of the flatten function in DataFrames #2890

Closed
sprmnt21 opened this issue Sep 24, 2021 · 21 comments
Closed

a new method of the flatten function in DataFrames #2890

sprmnt21 opened this issue Sep 24, 2021 · 21 comments
Labels
Milestone

Comments

@sprmnt21
Copy link

sprmnt21 commented Sep 24, 2021

what do you think of the utility of having a method of the flatten function in DataFrames that expands a dataframe type df2 in the form dfexp?

df = DataFrame(a=1:3)
df1=DataFrame(b=11:15)

CSV.write("tmp1.csv", df)
CSV.write("tmp2.csv", df)
CSV.write("tmp3.csv", df1)

julia> df2
3×3 DataFrame
 Row │ files     other_cols  subdf                 
     │ String    Int64       DataFrame             
─────┼─────────────────────────────────────        
   1 │ tmp1.csv           1  3×1 DataFrame         
   2 │ tmp2.csv           2  3×1 DataFrame         
   3 │ tmp3.csv           3  5×1 DataFrame         


julia> dfexp=flatten(df2,:subdf)
11×4 DataFrame
 Row │ files     other_cols  a        b            
     │ String    Int64       Int64?   Int64?       
─────┼────────────────────────────────────────     
   1 │ tmp1.csv           1        1  missing      
   2 │ tmp1.csv           1        2  missing      
   3 │ tmp1.csv           1        3  missing      
   4 │ tmp2.csv           2        1  missing      
   5 │ tmp2.csv           2        2  missing      
   6 │ tmp2.csv           2        3  missing      
   7 │ tmp3.csv           3  missing       11      
   8 │ tmp3.csv           3  missing       12      
   9 │ tmp3.csv           3  missing       13      
  10 │ tmp3.csv           3  missing       14      
  11 │ tmp3.csv           3  missing       15    

this is a function that does what is indicated, just to give an idea of what is required not to suggest how it should be done.

mapreduce(r->crossjoin(DataFrame(r[Not(:subdf)]),r.subdf), (x, y) -> vcat(x, y; cols = :union), eachrow(df2))
@bkamins
Copy link
Member

bkamins commented Sep 24, 2021

As commented on Slack. The operation you request is vcat. The only difference with the standard vcat is that now vcat allows you to create only one indicator column and you want more than one column. We could consider extending the source keyword API to allow what you ask for.

Let us wait for opinions.

@bkamins bkamins added this to the 1.x milestone Sep 24, 2021
@sprmnt21
Copy link
Author

thanks for your attention.
Another way of looking at it could be the following.
Think of the starting table df2 as if it were the result of a groupby and the table dfexp was the result of some kind of un-group.

@bkamins
Copy link
Member

bkamins commented Sep 24, 2021

But under your thinking all data frames in :subdf columns would have to have exactly the same column names and eltypes. This clearly is not the case in your example. That is why I quote vcat as the crucial difference between vcat and un-group is that vcat allows you do decide how you want to handle cases when column names or eltypes do not match.

@pdeffebach
Copy link
Contributor

Additionally, I think that allowing ByRow functions in combine to re-size would solve part of this problem.

It would not work if the data frames have different keys, and you need to convert to Tables.columntable (i.e. a NamedTuple) because data frames don''t support keys. But other than that,

combine(df, :c => ByRow(Tables.columntable) => AsTable)

would maybe work.

I'm not sure this is desirable though. This might not be consistent with how combine actually works / should be thought about.

@bkamins
Copy link
Member

bkamins commented Sep 24, 2021

This is what I also thought but I have exactly the same reservations.

Now I have realized that actually assuming the columns match in all data frames the following works assuming non-dataframe columns form unique keys (which I guess is expected):

julia> df = DataFrame(x=[1,2,3], y=[DataFrame(p=1:2,q=11:12), DataFrame(p=3:4,q=13:14), DataFrame(p=5:6,q=15:16)])
3×2 DataFrame
 Row │ x      y
     │ Int64  DataFrame
─────┼──────────────────────
   1 │     1  2×2 DataFrame
   2 │     2  2×2 DataFrame
   3 │     3  2×2 DataFrame

julia> combine(x -> only(x).y, groupby(df, Not(:y)))
6×3 DataFrame
 Row │ x      p      q
     │ Int64  Int64  Int64
─────┼─────────────────────
   1 │     1      1     11
   2 │     1      2     12
   3 │     2      3     13
   4 │     2      4     14
   5 │     3      5     15
   6 │     3      6     16

@bkamins
Copy link
Member

bkamins commented Sep 24, 2021

and if it is not unique:

combine(x -> reduce(vcat, x.y), groupby(df, Not(:y)))

@sprmnt21
Copy link
Author

sprmnt21 commented Sep 24, 2021

Here she is.
This solution demonstrates the potential of DataFrames to also deal with situations, for which perhaps it was not specifically designed.

@bkamins
Copy link
Member

bkamins commented Feb 20, 2022

Would Tables.dictrowtable meet your requirements?

@sprmnt21
Copy link
Author

Maybe.
But you should help me figure out how to fit it to the case.

@sprmnt21
Copy link
Author

sprmnt21 commented Feb 20, 2022

this is what I can do ...

transform(df2, :sdf =>Tables.dictrowtable(c->mapreduce(x->copy.(eachrow(x)), vcat, c)) => AsTable)

but i get
LoadError: ArgumentError: no default Tables.rows implementation for type: var"#83#85"

@bkamins
Copy link
Member

bkamins commented Feb 20, 2022

I meant this:

julia> df = DataFrame(id=1:2, a=[DataFrame(x=1:2), DataFrame(y=3:4)])
2×2 DataFrame
 Row │ id     a
     │ Int64  DataFrame
─────┼──────────────────────
   1 │     1  2×1 DataFrame
   2 │     2  2×1 DataFrame

julia> select(df, :id, :a => Tables.dictrowtable => AsTable)
2×3 DataFrame
 Row │ id     x        y
     │ Int64  Array…?  Array…?
─────┼─────────────────────────
   1 │     1  [1, 2]   missing
   2 │     2  missing  [3, 4]

and then ideally you would do:

flatten(select(df, :id, :a => Tables.dictrowtable => AsTable), [:x, :y])

but now I see that I need to fix flatten to add support for missing to it. I will open a PR for this.

@bkamins
Copy link
Member

bkamins commented Feb 20, 2022

Ah - this is not as easy as I have thought.

@nalimilan - what do you think. Currently in flatten when we see missing we error as it is not iterable. To make the above examples work we would need to start treating missing as having "any" number of elements - flexibly adjusting to the remaining columns that define length. This is a bit hard to implement (of course doable), but the question is if this would be an intuitive behavior.

@bkamins
Copy link
Member

bkamins commented Feb 20, 2022

As a second thought - maybe adding something like unnest from dplyr would be useful?

@sprmnt21
Copy link
Author

sprmnt21 commented Feb 20, 2022

another way to get the result using only "table functions"
this way flatten is used before dictrowtables and flatten doesn't have to handle missing.
Perhaps :-)
transform(flatten(transform(df2,:sdf=>ByRow(Tables.rowtable), renamecols=false),:sdf), :sdf=>Tables.dictrowtable=>AsTable)

@sprmnt21
Copy link
Author

the fastest form I have found, at least for the example with a few lines, is the following (although less elegant).

vcat([[repeat(df2[r:r,Not(:sdf)],nrow(df2.sdf[r])) df2[r,:sdf]] for r in 1:nrow(df2)]..., cols=:union)

@nalimilan
Copy link
Member

@nalimilan - what do you think. Currently in flatten when we see missing we error as it is not iterable. To make the above examples work we would need to start treating missing as having "any" number of elements - flexibly adjusting to the remaining columns that define length. This is a bit hard to implement (of course doable), but the question is if this would be an intuitive behavior.

Wouldn't it make sense to treat any values that do not support length as scalars, and broadcast/repeat them as needed? It would have been logical to use Base.Broadcast.broadcastable for that, unfortunately we haven't done that and the result differs notbaly for strings (though we could special-case them as a temporary solution).

Regarding unnest, would #3005 more or less implement the same features?

@bkamins bkamins modified the milestones: 1.x, 1.4 Mar 3, 2022
@bkamins
Copy link
Member

bkamins commented Mar 3, 2022

We did not use broadcasting because 1-element collections are automatically resized with broadcasting, which can lead to hard to catch bugs in this context.

What we could do (following your proposal):

  • add kwarg strict::Bool=true, which by default does what we now have;
  • if strict=false then:
    • all values that do not support length are repeated as needed (and if in a row no values support length then the length is taken to be 1 - i.e. no-op)
    • strings are repeated instead of being flattened

What do you think?

@nalimilan
Copy link
Member

Ah right, broadcasting collections would be risky. But maybe we can repeat scalars (including strings) even without requiring strict=true? This sounds quite safe.

@bkamins
Copy link
Member

bkamins commented Mar 4, 2022

We could. It would be mildly breaking but acceptable. We then could just do strict::Bool=false by default, and if someone wants safety one can pass strict=true.

@bkamins
Copy link
Member

bkamins commented Dec 28, 2022

I am closing this issue as it is handled in #3258 so let us keep the discussion there (somehow I am not able to link this issue to that PR)

@bkamins bkamins closed this as completed Dec 28, 2022
@bkamins
Copy link
Member

bkamins commented Feb 5, 2023

Now tracked in #3116

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants