Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sanitized by function #1555

Closed
wants to merge 2 commits into from

Conversation

ExpandingMan
Copy link
Contributor

This implements #1554, in particular now

by(identity, df, cols)

returns df up to an ordering ambiguity. You can get the old behavior by doing append_keys=true.

Again, I'm not entirely sure whether this is something everyone agrees on, but if it is, here's the PR.

I'm not at all attached the the keyword append_keys, actually it's kind of horrible, so I'm open to better suggestions.

This could use a few tests, but lets see if there's any concensus first.

@nalimilan
Copy link
Member

Thanks. Unfortunately that's going to conflict with #1520.

Regarding the solution itself, I'm fine with adding an argument but I'd use true by default, and I'd maybe call it addkeys (append! appends rows).

@ExpandingMan
Copy link
Contributor Author

ExpandingMan commented Oct 4, 2018

So, I realize this is breaking behavior and that it's kind of a high bar to merge this with the new behavior as default, but I really do think this behavior is a better default, so let me give some more compelling reasons.

First, I'm not sure how much we want to emulate pandas, but it is worth noting that the closest pandas equivalent to by does indeed have the behavior I'm suggesting

df.groupby(cols).apply(lambda x: x)

indeed returns df up to an ordering ambiguity.

I'd argue that this behavior is much more intuitive as a default. I'd expect the combine function to perform some sort of reduction or concatenation. Other concatenation functions typically don't add any type of key (whether or not that makes sense in the context). To this day I frequently get tripped up by extra columns I got from doing by. I suspect I'm not alone (especially considering the pandas behavior).

I agree with you that wanting to retain the group keys is an extremely common use case, but that's precisely one of the problems I have with this: I wind up having to filter out redundant columns more often than not. The reason this happens is that the group keys are already contained in the dataframe that's passed as an argument to the lambda function. For example, one common use case might be doing a groupby which removes or appends some rows. In those cases you are going to wind up with duplicate rows when the combine method tries to tack on more columns.

add_keys sounds a bit strange to me because I'm not sure the word add is appropriate here. How about key_columns?

@ExpandingMan
Copy link
Contributor Author

Regarding #1520, by all means of course move ahead with it. It should be easy enough to make whatever change we decide on after that is merged.

@nalimilan
Copy link
Member

OTOH dplyr's group_by keeps grouping columns, even after calling summarize, mutate, transmute and do (which are their equivalents of combine).

The apply example from Pandas you give seems different from our combine: it sounds more similar to map, which in #1520 gives another GroupedDataFrame. Maybe what's needed is a way to remove grouping information from such a GroupedDataFrame, transforming it into a plain DataFrame. I guess we could call that operation ungroup. Just like combine, it could optionally take a transformation function for convenience and to avoid creating a second GroupedDataFrame, and it wouldn't add grouping columns.

The reason this happens is that the group keys are already contained in the dataframe that's passed as an argument to the lambda function. For example, one common use case might be doing a groupby which removes or appends some rows. In those cases you are going to wind up with duplicate rows when the combine method tries to tack on more columns.

I think the best solution to that is to check whether columns with the same names exist. dplyr's do uses the returned column, overwriting the grouping column in case of conflict. We could also check whether they are equal.

@ExpandingMan
Copy link
Contributor Author

ExpandingMan commented Oct 4, 2018

I realize that because this is a breaking change there'd have to be overwhelming agreement that it's a good change. I'd definitely prefer it, but keeping the current behavior as default is obviously far safer. I'll concede this and change it so that the current behavior is default next time I make a commit to this.

As far as checking whether the columns already exist: I think we should only do that as yet another explicit keyword argument. I'm worried that this would be unexpected and cause things to go horribly wrong if someone intended to transform them somehow. If we check whether they are equal, I'm worried about the performance implications.

I'll work on all these changes in a next commit, but I may just wait for #1520 to be merged before I take this up again.

@bkamins bkamins mentioned this pull request Jan 15, 2019
31 tasks
@nalimilan
Copy link
Member

#1520 has been merged. Do you still feel the need for this now that we support the pairs syntax which is generally more convenient (and efficient) than the old syntax?

@ExpandingMan
Copy link
Contributor Author

Great work on #1520 thanks for that.

In my opinion this is still desirable as long as the original by method exists. It's not completely clear to me whether there are still cases where the original method should be used, especially since one could always just iterate over the object returned by groupby. The original method would probably be a good thing to keep around since it's hard to foresee every possible use case.

Regardless, this certainly seems like a less important issue now, so if you really want to close it, I won't object.

@bkamins
Copy link
Member

bkamins commented Jul 24, 2019

@nalimilan - any opinion what we should do with this PR (given your overall plans for groupby family roadmap 😄)

@nalimilan
Copy link
Member

I think it's uncontroversial to add an addkeys=true argument, at least as a first step. Then we can discuss whether we should change the default to false. In any case we should also find a strategy to handle more gracefully situations where addkeys=true and the returned data frame contains columns with the same names as the keys.

@nalimilan
Copy link
Member

@ExpandingMan Would you be willing to rebase this on current master and rename append_keys to addkeys (or something else; "append" is used to append rows in DataFrames)?

@bkamins
Copy link
Member

bkamins commented Sep 3, 2019

I think it should be added after #1938 is merged (so maybe it will be easier to open an new PR for this). In general I think that by default we should append keys.

@ExpandingMan
Copy link
Contributor Author

Ok, I'm leaving this for now. If at some point you want me to open another PR, let me know.

@nalimilan
Copy link
Member

You can rebase it when you want, #1938 has been merged.

FWIW, JuliaDB uses usekey = true.

@bkamins
Copy link
Member

bkamins commented Dec 1, 2019

@nalimilan do we want to mark adding addkeys kwarg as 1.0 required feature (I think not but it is non-breaking, but it would be nice to have it).

@nalimilan
Copy link
Member

What has to be decided is whether we want to change the default behavior or not.

@nalimilan nalimilan added this to the 1.0 milestone Dec 1, 2019
@bkamins
Copy link
Member

bkamins commented Dec 1, 2019

I think we do not want to change the default behavior for 1.0. What we have now is what people have used for years and there is little benefit of changing the default (also most of the time you want this behavior). But I think having an option to avoid adding keys by setting addkeys=false is sometimes useful and should be added.

@bkamins bkamins added breaking The proposed change is breaking. non-breaking The proposed change is not breaking and removed breaking The proposed change is breaking. labels Feb 12, 2020
@bkamins
Copy link
Member

bkamins commented Feb 12, 2020

I mark it non-breaking and remove 1.0 label. This feature is nice to have but can be added later (the default will stay the same).

@bkamins bkamins removed this from the 1.0 milestone Feb 12, 2020
@nalimilan nalimilan mentioned this pull request Mar 21, 2020
@bkamins
Copy link
Member

bkamins commented Mar 21, 2020

This PR will be handled in #2156. So I am closing it. If there are any objections please reopen.

@bkamins bkamins closed this Mar 21, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
grouping non-breaking The proposed change is not breaking
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants