Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add more splatting and appending of arguments in by #1620

Closed
wants to merge 40 commits into from

Conversation

pdeffebach
Copy link
Contributor

This is a first attempt at resolving #1615

It just gets around this by assuming that if you have many different arguments we can vcat them together and pass it off to combine. combine has a pretty good infrastructure for sorting out arguments so I don't think this needs to happen at by.

@pdeffebach
Copy link
Contributor Author

I added the method for f::Any... rather than write a new method because f::Union{Pair, Vector{Pair}}... choose Pair OR Vector{Pair} but does not allow a collection of both.

Unless my interpretation is mistaken, which is likely is.

@bkamins
Copy link
Member

bkamins commented Dec 8, 2018

A small comment (I have not analyzed the code yet): we want combine and by to have the same API so probably combine should get an additional method also and then by implementation would change.

@nalimilan
Copy link
Member

I think this should be done by combine, as by just passes f to it.

@nalimilan
Copy link
Member

I added the method for f::Any... rather than write a new method because f::Union{Pair, Vector{Pair}}... choose Pair OR Vector{Pair} but does not allow a collection of both.

You need AbstractVector{<:Pair} (classic issue with Pair).

@bkamins
Copy link
Member

bkamins commented Dec 9, 2018

I think this should be done by combine

Just to make sure I understand you correctly. Do you mean what I have recommended - to have combine handle all required cases consistently with by? And then by would just pass what it gets to combine and leave all the work for combine?

@pdeffebach
Copy link
Contributor Author

Okay I added a method for combine to make this work.

I changed the by method to have a splatted ::Union{Pair, AbstractVector{<:Pair}} but I could presume Milan you want a splatted ::Any... in by?

@pdeffebach
Copy link
Contributor Author

This needs more work. I'm going to see if I can get this to work while also allowing (non-vector) keyword arguments.

@@ -338,7 +338,8 @@ function combine(f::Any, gd::GroupedDataFrame)
end
end
combine(gd::GroupedDataFrame, f::Any) = combine(f, gd)
combine(gd::GroupedDataFrame, f::Pair...) = combine(f, gd)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you remove this method, I guess you can also drop Tuple{Vararg{Pair}} from the signature of _combine?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That comment still applies.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. It still works without this. It's deleted

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It turns out it is needed. combine(f, (:c => sum, :d => mean)) breaks without it.

I thought

combine(gd::GroupedDataFrame, f::Union{Pair, AbstractVector{<:Pair}}...) 

would capture it, but it does not. If we wanted to avoid the dispatch we could add a method for combine which collects a tuple of pairs into a Vector.

src/groupeddataframe/grouping.jl Outdated Show resolved Hide resolved
@nalimilan
Copy link
Member

I changed the by method to have a splatted ::Union{Pair, AbstractVector{<:Pair}} but I could presume Milan you want a splatted ::Any... in by?

I don't think so. Why would it be needed?

@pdeffebach
Copy link
Contributor Author

Okay my strategy for this will be 4 methods for each combine and by.

combine(f::Any, g) # for do syntax with old version
combine(g, f::Any) # for non-do syntax with old version

combine(g, f::{Union{Pair, AbstractVector{<:Pair}, Tuple{<:Pair}}...; kwargs...)  

combine(g, f::NamedTuple) # the one that actually goes to _combine

Here I take the non-keyword arguments and collect them into a named tuple somehow, meaning I move the the autogeneration of variable names to this stage, also making sure I don't overwrite names from the kwargs named tuple.

Does that make sense?

@pdeffebach
Copy link
Contributor Author

I have implemented the above strategy, but only for combine, since we will have to repeat the same logic with by and its easier to focus on just combine while we nail down the strategy.

Rather than splatting a vector of Pair inputs, it uses a reduce(vcat, args) to get a nice vector of Pair arguments regardless of wither you have just a Pair, or a Vector of Pairs or a mixture of the two.

I have a big commented out block that will handle the re-naming in the case of duplicate names, but I need to put more work into understanding the logic of it and what I need to change. The code originally came from the makeunique = true option somewhere.

Any feedback is welcome.

@nalimilan
Copy link
Member

Makes sense, but note you'll have to handle the case where a single Pair is passed separately, since we support returning a NamedTuple with several fields (and then the names can only be known from firstres). Maybe keep special methods for when you already have an homogeneous named tuple, to avoid doing unnecessary work? Also, better base your work on #1651 as it changes the logic a bit.

@pdeffebach
Copy link
Contributor Author

Rebased and revived this PR. However in the process the _combine step doesn't work. Thankfully after this we can delete all the code for _combine that works with anything other than a named tuple. Not sure how to fix this error and how to figure out which _combine code should be deleted.

combine(identity, gd)
else
combine(values(f), gd)
# handle pairs on their own so we can iterate
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's really too much code for a small feature IMHO. Is there any way to make this simpler? We don't have to allow mixing keyword and positional arguments.

@pdeffebach
Copy link
Contributor Author

That's really too much code for a small feature IMHO. Is there any way to make this simpler? We don't have to allow mixing keyword and positional arguments.

If we don't mix positional and keyword arguments than things become simple again, and I doubt people will miss them. I would prefer to either close this or do a force-push of master to start anew.

It would be nice to allow splatting for multiple arrays in a function call. Then we wouldn't have to worry about supporting vector arguments at all. However this isn't allowed, right?

x = [1, 2, 3]; y = [4, 5, 6]
foo(x..., y...) # foo(1, 2, 3, 4, 5, 6)

@pdeffebach
Copy link
Contributor Author

I have reset this --hard and rebased. Will get a much more minimal implementation together tomorrow. I will

a) make a function with Union{Pair, AbstractVector{<:Pair}... and then call reduce(vcat) on it.
b) delete the method for _combine that this change would make dead code.

@pdeffebach pdeffebach reopened this Feb 13, 2019
@pdeffebach
Copy link
Contributor Author

Okay this PR should be ready for a review! I promise its a lot less to review than last time.

I want to deprecate one method, and thats combine(df, :a, (:b => mean,)). It's only allowed because _combine allows for NTuple{Pair}. But now that using many vectors is flexible, supporting a Tuple argument seems unnecessary.

@bkamins
Copy link
Member

bkamins commented Feb 13, 2019

But now that using many vectors is flexible, supporting a Tuple argument seems unnecessary.

But tuple is lighter than vector so why would we want to prefer vectors over tuples? (but this is a corner case anyway - probably @nalimilan has a stronger opinion here what is right, because he spent many more hours than me on this).

I want to deprecate one method

I could not find the code with the deprecation (but you probably can add it later - first we should decide if we want to drop support for it).

combine(df, :a, (:b => mean,))

Did you rather mean by(df, :a, (:b => mean,))?

Finally - maybe you can add to tests the original reason for this PR, i.e. something like by(df, :x, [:a, :b] .=> mean, :b => std) + as always add negative tests that should fail 😄 (they are really important in long run when we change functionality).

@pdeffebach
Copy link
Contributor Author

thanks, I will add those tests, and try to think of a good failing test.

by and combine should always have the same syntax, I think.

I have not deprecated it and it works now, but is only type-specified at the _combine level, rather than the combine() level. i.e. it fits in as part of combin(gd::AbstractDataFrame, f::Any) method.

I want to deprecate it because if we allow tuple, then we might as well have multiple tuples, along with tuple and scalar...

@bkamins
Copy link
Member

bkamins commented Feb 13, 2019

by and combine should always have the same syntax, I think.

AFAIK by takes AbstractDataFrame and grouping columns to group by and combine takes GroupedDataFrame (and it seems in what you have written you pass a data frame and grouping column).

@pdeffebach
Copy link
Contributor Author

Oh! sorry. yes that was my mistake.

@pdeffebach
Copy link
Contributor Author

I now have the naming working with my (somewhat hacky) solution. This solution seems more elegant than making the type of our Pair vector a small union of a bunch of different things.

I also added some tests that catch what went wrong.

@nalimilan
Copy link
Member

I think you can just replace p isa Pair{<:Union{Symbol,Integer}} with p isa Pair && first(p) isa Union{Integer,Symbol}.

@pdeffebach
Copy link
Contributor Author

That's a good suggestion, and I have incorporated it.

Things should be good now.

@bkamins
Copy link
Member

bkamins commented Jun 11, 2019

Union{Integer,Symbol} should be ColumnIndex.

@pdeffebach
Copy link
Contributor Author

Thanks for the catch. I've updated it.

@pdeffebach
Copy link
Contributor Author

Can this be merged even though some of the coverage tests didn't pass?

@bkamins
Copy link
Member

bkamins commented Jun 27, 2019

Coverage passes. The problem is that it needs to be merged with master again (sorry for inconvenience).
Also when you merge the CI failures on nightly will get fixed.

@pdeffebach
Copy link
Contributor Author

I think things are correct now?

@bkamins
Copy link
Member

bkamins commented Jul 16, 2019

Thank you for the update.

Let us wait for @nalimilan for a final review, as he is the "groupby" master 😄 (he is off for a few days, but we can go back to this PR very soon).

@bkamins
Copy link
Member

bkamins commented Sep 2, 2019

@nalimilan - how do you see we should proceed with this (given 1.0 release plans?)

@nalimilan
Copy link
Member

Sorry for the delay. There are still open discussions above. Can you comment on or close them as appropriate? At least the one about the map docstring still applies AFAICT.

incol = gd.parent[!, first(p)]
idx = gd.idx[gd.starts]
outcol = agg(incol, gd)
return idx, outcol
else
fun = do_f(last(p))
if p isa Pair{<:ColumnIndex}

if p isa Pair && first(p) isa ColumnIndex
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if p isa Pair && first(p) isa ColumnIndex
if p isa Pair && first(p) isa ColumnIndex

@pdeffebach
Copy link
Contributor Author

Someone posted on discuss here. I need to finish this PR

@bkamins
Copy link
Member

bkamins commented Oct 18, 2019

Would this PR avoid recompilation? (I have posted an answer how the problem on Discourse can be solved using API we have now)

@pdeffebach
Copy link
Contributor Author

No, but it would have avoided that complicated named tuple expression with repeated pairs to the same functions.

@bkamins
Copy link
Member

bkamins commented Dec 1, 2019

This PR might have to be dropped or significantly reviewed if we go for :oldcol => f => :newcol pattern as discussed with @nalimilan.

@pdeffebach
Copy link
Contributor Author

yeah i dropped the ball on this PR, and I apologize for that.

This PR would fall into place really easily if we implemented #2016 and some infrastructure to handle this kind of thing. For example ideally

by(df, :a, Between(:d, :m) => mean)
by(df, :a, [Between(:d, :m), :q] .=> mean)

Should both work via some generic code that is not unique to by.

One benefit of Milan's suggestion to avoid named tuples is that it reduces the gymnastics with arguments, support for which were dropped from this PR because it would have added a lot of boiler-plate code for getting all the different argument formats to play nice.

@nalimilan
Copy link
Member

Note that I didn't really suggest dropping keyword arguments, rather adding another possible syntax.

@bkamins
Copy link
Member

bkamins commented Dec 1, 2019

Note that I didn't really suggest dropping keyword arguments, rather adding another possible syntax.

This is how I understood this. My point is that it has to be rethought how broadcasting would come into play in :oldcol => f => :newcol syntax as quickly thinking about it @. `oldcols => f => newcols should be used (but I have not analyzed it in detail).

@bkamins bkamins added the non-breaking The proposed change is not breaking label Feb 12, 2020
@bkamins
Copy link
Member

bkamins commented Apr 4, 2020

I am closing it as it is now implemented (except for broadcasting of Between etc. which is a separate issue)

@bkamins bkamins closed this Apr 4, 2020
@bkamins
Copy link
Member

bkamins commented Apr 4, 2020

@pdeffebach - thank you for raising this functionality. I think it helped a lot to improve the package.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
non-breaking The proposed change is not breaking
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants