-
Notifications
You must be signed in to change notification settings - Fork 368
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BREAKING] add to combine and by: column selection, pseudo broadcasting, fix bug with unequal column lengths #2170
Conversation
… with unequal column lengths
@nalimilan With this |
PR should be good to review. I have added tests, improved performance a bit and written documentation (I have tried hard, but this is probably the weakest point of the PR 😢). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. This code is getting really complex, I hope I got it right.
One comment is that we treat
NamedTuple
as-is (i.e. we do not unpackNamedTuples
when pseudo-broadcasting)
What reasonable alternative behavior could be want to implement?
@@ -979,8 +1110,15 @@ end | |||
function _combine(fun::Base.Callable, gd::GroupedDataFrame, ::Nothing) | |||
firstres = fun(gd[1]) | |||
firstmulticol = firstres isa MULTI_COLS_TYPE | |||
if !(firstres isa Union{AbstractVecOrMat, AbstractDataFrame, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would there be a way to avoid repeating this in each method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No - this is a part of optimization you wanted (in the earlier comment as TODO for performance): we do not want to calculate idx_agg
each time as single-row function is encountered, so we have to compute it before we call _combine_with_first
exactly once if it is needed to be computed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, my question was about whether it would be possible to reorganize the code to reduce duplication (not about performance).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK - I have added an intermediate function that captures duplicate code (essentially it handles combining when returning multiple columns is allowed).
Well - we could "pseudo-broadcast" the contents of |
Co-Authored-By: Milan Bouchet-Valat <nalimilan@club.fr>
Comments applied. Just to show that we avoid recomputing Preparation:
In this PR:
on 0.20.2:
(note the differences in allocations due to the fact that we allocate |
Co-Authored-By: Milan Bouchet-Valat <nalimilan@club.fr>
@@ -979,8 +1110,15 @@ end | |||
function _combine(fun::Base.Callable, gd::GroupedDataFrame, ::Nothing) | |||
firstres = fun(gd[1]) | |||
firstmulticol = firstres isa MULTI_COLS_TYPE | |||
if !(firstres isa Union{AbstractVecOrMat, AbstractDataFrame, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, my question was about whether it would be possible to reorganize the code to reduce duplication (not about performance).
Actually I don't see in what cases this would make a difference. Currently things like |
Right. And it would keep to be an error. However also The issue is that |
OK, let's keep throwing an error for now, that way we're safe to change it later if we want. If we allowed named tuples mixing scalars and vectors, it would indeed make sense to also broadcast in that case. |
Co-Authored-By: Milan Bouchet-Valat <nalimilan@club.fr>
Thank you! |
Fixes #2166
I still need to update the documentation and tests, but all from #2166 is implemented here.
One comment is that we treat
NamedTuple
as-is (i.e. we do not unpackNamedTuples
when pseudo-broadcasting)