Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BREAKING] add to combine and by: column selection, pseudo broadcasting, fix bug with unequal column lengths #2170

Merged
merged 10 commits into from
Apr 8, 2020

Conversation

bkamins
Copy link
Member

@bkamins bkamins commented Apr 5, 2020

Fixes #2166

I still need to update the documentation and tests, but all from #2166 is implemented here.

One comment is that we treat NamedTuple as-is (i.e. we do not unpack NamedTuples when pseudo-broadcasting)

@bkamins bkamins added grouping non-breaking The proposed change is not breaking labels Apr 5, 2020
@bkamins bkamins added this to the 1.0 milestone Apr 5, 2020
@bkamins
Copy link
Member Author

bkamins commented Apr 5, 2020

@nalimilan With this select and transform for GroupedDataFrame should be easy (as we have all the functionality - just need to do groupby on the result in an efficient way like in map).

@bkamins
Copy link
Member Author

bkamins commented Apr 5, 2020

PR should be good to review.

I have added tests, improved performance a bit and written documentation (I have tried hard, but this is probably the weakest point of the PR 😢).

@bkamins bkamins added breaking The proposed change is breaking. and removed non-breaking The proposed change is not breaking labels Apr 6, 2020
@bkamins bkamins changed the title add to combine and by: column selection, pseudo broadcasting, fix bug with unequal column lengths [BREAKING] add to combine and by: column selection, pseudo broadcasting, fix bug with unequal column lengths Apr 6, 2020
Copy link
Member

@nalimilan nalimilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. This code is getting really complex, I hope I got it right.

One comment is that we treat NamedTuple as-is (i.e. we do not unpack NamedTuples when pseudo-broadcasting)

What reasonable alternative behavior could be want to implement?

docs/src/man/split_apply_combine.md Outdated Show resolved Hide resolved
docs/src/man/split_apply_combine.md Outdated Show resolved Hide resolved
docs/src/man/split_apply_combine.md Outdated Show resolved Hide resolved
src/groupeddataframe/splitapplycombine.jl Outdated Show resolved Hide resolved
src/groupeddataframe/splitapplycombine.jl Outdated Show resolved Hide resolved
@@ -979,8 +1110,15 @@ end
function _combine(fun::Base.Callable, gd::GroupedDataFrame, ::Nothing)
firstres = fun(gd[1])
firstmulticol = firstres isa MULTI_COLS_TYPE
if !(firstres isa Union{AbstractVecOrMat, AbstractDataFrame,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would there be a way to avoid repeating this in each method?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No - this is a part of optimization you wanted (in the earlier comment as TODO for performance): we do not want to calculate idx_agg each time as single-row function is encountered, so we have to compute it before we call _combine_with_first exactly once if it is needed to be computed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, my question was about whether it would be possible to reorganize the code to reduce duplication (not about performance).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK - I have added an intermediate function that captures duplicate code (essentially it handles combining when returning multiple columns is allowed).

src/groupeddataframe/splitapplycombine.jl Outdated Show resolved Hide resolved
src/groupeddataframe/splitapplycombine.jl Outdated Show resolved Hide resolved
src/groupeddataframe/splitapplycombine.jl Outdated Show resolved Hide resolved
src/groupeddataframe/splitapplycombine.jl Show resolved Hide resolved
@bkamins
Copy link
Member Author

bkamins commented Apr 6, 2020

One comment is that we treat NamedTuple as-is (i.e. we do not unpack NamedTuples when pseudo-broadcasting)

What reasonable alternative behavior could be want to implement?

Well - we could "pseudo-broadcast" the contents of NamedTuple, that is if someone passes (a=1, b=[1,2,3]) to expand it to (a=[1,1,1], b=[1,2,3]), note that this happens in DataFrame if you write DataFrame(a=1, b=[1,2,3]). But I came to the conclusion that it is cleaner to treat NamedTuple "as is" and directly use what the user has passed (as DataFrame is only constructed using DataFrame(a=1, b=[1,2,3]) but it gets expanded inside a function not within combine).

bkamins and others added 2 commits April 6, 2020 23:52
Co-Authored-By: Milan Bouchet-Valat <nalimilan@club.fr>
@bkamins
Copy link
Member Author

bkamins commented Apr 6, 2020

Comments applied. Just to show that we avoid recomputing idx_agg when not needed.

Preparation:

df = DataFrame(g = 10^6:-1:1)

In this PR:

julia> @btime by(df, :g, :g => first => :x1, :g => first => :x2, :g => first => :x3, :g => first => :x4, :g => first => :x5, :g => first => :x6);
  163.177 ms (365 allocations: 182.54 MiB)

on 0.20.2:

julia> @btime by(df, :g, x1 = :g => first, x2 = :g => first, x3 = :g => first, x4 = :g => first, x5 = :g => first, x6 = :g => first);
  186.147 ms (417 allocations: 236.90 MiB)

(note the differences in allocations due to the fact that we allocate idx_agg only once now)

Co-Authored-By: Milan Bouchet-Valat <nalimilan@club.fr>
src/groupeddataframe/splitapplycombine.jl Show resolved Hide resolved
src/groupeddataframe/splitapplycombine.jl Show resolved Hide resolved
src/groupeddataframe/splitapplycombine.jl Show resolved Hide resolved
@@ -979,8 +1110,15 @@ end
function _combine(fun::Base.Callable, gd::GroupedDataFrame, ::Nothing)
firstres = fun(gd[1])
firstmulticol = firstres isa MULTI_COLS_TYPE
if !(firstres isa Union{AbstractVecOrMat, AbstractDataFrame,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, my question was about whether it would be possible to reorganize the code to reduce duplication (not about performance).

src/groupeddataframe/splitapplycombine.jl Outdated Show resolved Hide resolved
@nalimilan
Copy link
Member

Well - we could "pseudo-broadcast" the contents of NamedTuple, that is if someone passes (a=1, b=[1,2,3]) to expand it to (a=[1,1,1], b=[1,2,3]), note that this happens in DataFrame if you write DataFrame(a=1, b=[1,2,3]). But I came to the conclusion that it is cleaner to treat NamedTuple "as is" and directly use what the user has passed (as DataFrame is only constructed using DataFrame(a=1, b=[1,2,3]) but it gets expanded inside a function not within combine).

Actually I don't see in what cases this would make a difference. Currently things like by(df, :x1, :x2 => (x -> (a=[1, 2], b=3))) throw an error, right? With ByRow it's allowed, but it would make no sense to broadcast as the named tuple isn't destructured into columns. Is that what you're referring to?

@bkamins
Copy link
Member Author

bkamins commented Apr 7, 2020

Currently things like by(df, :x1, :x2 => (x -> (a=[1, 2], b=3))) throw an error, right?

Right. And it would keep to be an error.

However also by(df, :x2 => (x -> (a=[1, 2], b=3))) throws an error, but by(df, :x2 => (x -> DataFrame(a=[1, 2], b=3))) does not (and conceptually it is almost the same).

The issue is that DataFrame constructor does pseudo-broadcasting while, of course, NamedTuple constructor does not do it. My question is if we should do it after NamedTuple is constructed (currently we do not and throw an error, and I think this is a right approach, but maybe you would disagree so I prefer to ask).

@nalimilan
Copy link
Member

OK, let's keep throwing an error for now, that way we're safe to change it later if we want. If we allowed named tuples mixing scalars and vectors, it would indeed make sense to also broadcast in that case.

Co-Authored-By: Milan Bouchet-Valat <nalimilan@club.fr>
@bkamins bkamins merged commit bf3110d into JuliaData:master Apr 8, 2020
@bkamins bkamins deleted the pseudo_broadcast_combine branch April 8, 2020 09:18
@bkamins
Copy link
Member Author

bkamins commented Apr 8, 2020

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking The proposed change is breaking. grouping
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BREAKING] Making combine more flexible
2 participants