[BREAKING] add to combine and by: column selection, pseudo broadcasting, fix bug with unequal column lengths #2170

bkamins · 2020-04-05T12:16:33Z

I still need to update the documentation and tests, but all from #2166 is implemented here.

One comment is that we treat NamedTuple as-is (i.e. we do not unpack NamedTuples when pseudo-broadcasting)

… with unequal column lengths

bkamins · 2020-04-05T12:18:07Z

@nalimilan With this select and transform for GroupedDataFrame should be easy (as we have all the functionality - just need to do groupby on the result in an efficient way like in map).

bkamins · 2020-04-05T20:10:48Z

PR should be good to review.

I have added tests, improved performance a bit and written documentation (I have tried hard, but this is probably the weakest point of the PR 😢).

nalimilan

Thanks. This code is getting really complex, I hope I got it right.

One comment is that we treat NamedTuple as-is (i.e. we do not unpack NamedTuples when pseudo-broadcasting)

What reasonable alternative behavior could be want to implement?

docs/src/man/split_apply_combine.md

src/groupeddataframe/splitapplycombine.jl

nalimilan · 2020-04-06T09:35:10Z

src/groupeddataframe/splitapplycombine.jl

@@ -979,8 +1110,15 @@ end
 function _combine(fun::Base.Callable, gd::GroupedDataFrame, ::Nothing)
    firstres = fun(gd[1])
    firstmulticol = firstres isa MULTI_COLS_TYPE
+    if !(firstres isa Union{AbstractVecOrMat, AbstractDataFrame,


Would there be a way to avoid repeating this in each method?

No - this is a part of optimization you wanted (in the earlier comment as TODO for performance): we do not want to calculate idx_agg each time as single-row function is encountered, so we have to compute it before we call _combine_with_first exactly once if it is needed to be computed.

Sorry, my question was about whether it would be possible to reorganize the code to reduce duplication (not about performance).

OK - I have added an intermediate function that captures duplicate code (essentially it handles combining when returning multiple columns is allowed).

src/groupeddataframe/splitapplycombine.jl

bkamins · 2020-04-06T20:47:41Z

One comment is that we treat NamedTuple as-is (i.e. we do not unpack NamedTuples when pseudo-broadcasting)

What reasonable alternative behavior could be want to implement?

Well - we could "pseudo-broadcast" the contents of NamedTuple, that is if someone passes (a=1, b=[1,2,3]) to expand it to (a=[1,1,1], b=[1,2,3]), note that this happens in DataFrame if you write DataFrame(a=1, b=[1,2,3]). But I came to the conclusion that it is cleaner to treat NamedTuple "as is" and directly use what the user has passed (as DataFrame is only constructed using DataFrame(a=1, b=[1,2,3]) but it gets expanded inside a function not within combine).

Co-Authored-By: Milan Bouchet-Valat <nalimilan@club.fr>

bkamins · 2020-04-06T23:00:57Z

Comments applied. Just to show that we avoid recomputing idx_agg when not needed.

Preparation:

df = DataFrame(g = 10^6:-1:1)

In this PR:

julia> @btime by(df, :g, :g => first => :x1, :g => first => :x2, :g => first => :x3, :g => first => :x4, :g => first => :x5, :g => first => :x6);
  163.177 ms (365 allocations: 182.54 MiB)

on 0.20.2:

julia> @btime by(df, :g, x1 = :g => first, x2 = :g => first, x3 = :g => first, x4 = :g => first, x5 = :g => first, x6 = :g => first);
  186.147 ms (417 allocations: 236.90 MiB)

(note the differences in allocations due to the fact that we allocate idx_agg only once now)

Co-Authored-By: Milan Bouchet-Valat <nalimilan@club.fr>

src/groupeddataframe/splitapplycombine.jl

nalimilan · 2020-04-07T08:41:47Z

src/groupeddataframe/splitapplycombine.jl

@@ -979,8 +1110,15 @@ end
 function _combine(fun::Base.Callable, gd::GroupedDataFrame, ::Nothing)
    firstres = fun(gd[1])
    firstmulticol = firstres isa MULTI_COLS_TYPE
+    if !(firstres isa Union{AbstractVecOrMat, AbstractDataFrame,


Sorry, my question was about whether it would be possible to reorganize the code to reduce duplication (not about performance).

src/groupeddataframe/splitapplycombine.jl

nalimilan · 2020-04-07T09:12:20Z

Well - we could "pseudo-broadcast" the contents of NamedTuple, that is if someone passes (a=1, b=[1,2,3]) to expand it to (a=[1,1,1], b=[1,2,3]), note that this happens in DataFrame if you write DataFrame(a=1, b=[1,2,3]). But I came to the conclusion that it is cleaner to treat NamedTuple "as is" and directly use what the user has passed (as DataFrame is only constructed using DataFrame(a=1, b=[1,2,3]) but it gets expanded inside a function not within combine).

Actually I don't see in what cases this would make a difference. Currently things like by(df, :x1, :x2 => (x -> (a=[1, 2], b=3))) throw an error, right? With ByRow it's allowed, but it would make no sense to broadcast as the named tuple isn't destructured into columns. Is that what you're referring to?

bkamins · 2020-04-07T15:19:08Z

Currently things like by(df, :x1, :x2 => (x -> (a=[1, 2], b=3))) throw an error, right?

Right. And it would keep to be an error.

However also by(df, :x2 => (x -> (a=[1, 2], b=3))) throws an error, but by(df, :x2 => (x -> DataFrame(a=[1, 2], b=3))) does not (and conceptually it is almost the same).

The issue is that DataFrame constructor does pseudo-broadcasting while, of course, NamedTuple constructor does not do it. My question is if we should do it after NamedTuple is constructed (currently we do not and throw an error, and I think this is a right approach, but maybe you would disagree so I prefer to ask).

nalimilan · 2020-04-07T15:49:11Z

OK, let's keep throwing an error for now, that way we're safe to change it later if we want. If we allowed named tuples mixing scalars and vectors, it would indeed make sense to also broadcast in that case.

src/groupeddataframe/splitapplycombine.jl

Co-Authored-By: Milan Bouchet-Valat <nalimilan@club.fr>

bkamins · 2020-04-08T09:18:25Z

Thank you!

add to combine and by: column selection, pseudo broadcasting, fix bug…

052392b

… with unequal column lengths

bkamins mentioned this pull request Apr 5, 2020

[BREAKING] Making combine more flexible #2166

Closed

bkamins added grouping non-breaking The proposed change is not breaking labels Apr 5, 2020

bkamins added this to the 1.0 milestone Apr 5, 2020

bkamins added 2 commits April 5, 2020 17:31

fix Ref and 0-dim arrays

d1394c2

improve performance, add tests, add documentation

e83e25b

fix typo in tests

a01c562

bkamins added breaking The proposed change is breaking. and removed non-breaking The proposed change is not breaking labels Apr 6, 2020

bkamins changed the title ~~add to combine and by: column selection, pseudo broadcasting, fix bug with unequal column lengths~~ [BREAKING] add to combine and by: column selection, pseudo broadcasting, fix bug with unequal column lengths Apr 6, 2020

nalimilan reviewed Apr 6, 2020

View reviewed changes

bkamins and others added 2 commits April 6, 2020 23:52

Apply suggestions from code review

95128f4

Co-Authored-By: Milan Bouchet-Valat <nalimilan@club.fr>

fixes after code review

18a26a4

Update src/groupeddataframe/splitapplycombine.jl

85ef67b

Co-Authored-By: Milan Bouchet-Valat <nalimilan@club.fr>

nalimilan reviewed Apr 7, 2020

View reviewed changes

bkamins added 2 commits April 7, 2020 18:26

corrections after the code review

054ff50

fix typo

b1169e2

nalimilan reviewed Apr 7, 2020

View reviewed changes

src/groupeddataframe/splitapplycombine.jl Outdated Show resolved Hide resolved

Update src/groupeddataframe/splitapplycombine.jl

73b39e7

Co-Authored-By: Milan Bouchet-Valat <nalimilan@club.fr>

nalimilan approved these changes Apr 8, 2020

View reviewed changes

bkamins merged commit bf3110d into JuliaData:master Apr 8, 2020

bkamins deleted the pseudo_broadcast_combine branch April 8, 2020 09:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BREAKING] add to combine and by: column selection, pseudo broadcasting, fix bug with unequal column lengths #2170

[BREAKING] add to combine and by: column selection, pseudo broadcasting, fix bug with unequal column lengths #2170

bkamins commented Apr 5, 2020

bkamins commented Apr 5, 2020 •

edited

Loading

bkamins commented Apr 5, 2020

nalimilan left a comment

nalimilan Apr 6, 2020

bkamins Apr 6, 2020

nalimilan Apr 7, 2020

bkamins Apr 7, 2020

bkamins commented Apr 6, 2020

bkamins commented Apr 6, 2020

nalimilan Apr 7, 2020

nalimilan commented Apr 7, 2020

bkamins commented Apr 7, 2020

nalimilan commented Apr 7, 2020

bkamins commented Apr 8, 2020

[BREAKING] add to combine and by: column selection, pseudo broadcasting, fix bug with unequal column lengths #2170

[BREAKING] add to combine and by: column selection, pseudo broadcasting, fix bug with unequal column lengths #2170

Conversation

bkamins commented Apr 5, 2020

bkamins commented Apr 5, 2020 • edited Loading

bkamins commented Apr 5, 2020

nalimilan left a comment

Choose a reason for hiding this comment

nalimilan Apr 6, 2020

Choose a reason for hiding this comment

bkamins Apr 6, 2020

Choose a reason for hiding this comment

nalimilan Apr 7, 2020

Choose a reason for hiding this comment

bkamins Apr 7, 2020

Choose a reason for hiding this comment

bkamins commented Apr 6, 2020

bkamins commented Apr 6, 2020

nalimilan Apr 7, 2020

Choose a reason for hiding this comment

nalimilan commented Apr 7, 2020

bkamins commented Apr 7, 2020

nalimilan commented Apr 7, 2020

bkamins commented Apr 8, 2020

bkamins commented Apr 5, 2020 •

edited

Loading