-
Notifications
You must be signed in to change notification settings - Fork 368
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sanitized by function #1555
sanitized by function #1555
Conversation
Thanks. Unfortunately that's going to conflict with #1520. Regarding the solution itself, I'm fine with adding an argument but I'd use |
So, I realize this is breaking behavior and that it's kind of a high bar to merge this with the new behavior as default, but I really do think this behavior is a better default, so let me give some more compelling reasons. First, I'm not sure how much we want to emulate pandas, but it is worth noting that the closest pandas equivalent to df.groupby(cols).apply(lambda x: x) indeed returns I'd argue that this behavior is much more intuitive as a default. I'd expect the I agree with you that wanting to retain the group keys is an extremely common use case, but that's precisely one of the problems I have with this: I wind up having to filter out redundant columns more often than not. The reason this happens is that the group keys are already contained in the dataframe that's passed as an argument to the lambda function. For example, one common use case might be doing a groupby which removes or appends some rows. In those cases you are going to wind up with duplicate rows when the
|
Regarding #1520, by all means of course move ahead with it. It should be easy enough to make whatever change we decide on after that is merged. |
OTOH dplyr's The
I think the best solution to that is to check whether columns with the same names exist. dplyr's |
I realize that because this is a breaking change there'd have to be overwhelming agreement that it's a good change. I'd definitely prefer it, but keeping the current behavior as default is obviously far safer. I'll concede this and change it so that the current behavior is default next time I make a commit to this. As far as checking whether the columns already exist: I think we should only do that as yet another explicit keyword argument. I'm worried that this would be unexpected and cause things to go horribly wrong if someone intended to transform them somehow. If we check whether they are equal, I'm worried about the performance implications. I'll work on all these changes in a next commit, but I may just wait for #1520 to be merged before I take this up again. |
#1520 has been merged. Do you still feel the need for this now that we support the pairs syntax which is generally more convenient (and efficient) than the old syntax? |
Great work on #1520 thanks for that. In my opinion this is still desirable as long as the original Regardless, this certainly seems like a less important issue now, so if you really want to close it, I won't object. |
@nalimilan - any opinion what we should do with this PR (given your overall plans for |
I think it's uncontroversial to add an |
@ExpandingMan Would you be willing to rebase this on current master and rename |
I think it should be added after #1938 is merged (so maybe it will be easier to open an new PR for this). In general I think that by default we should append keys. |
Ok, I'm leaving this for now. If at some point you want me to open another PR, let me know. |
You can rebase it when you want, #1938 has been merged. FWIW, JuliaDB uses |
@nalimilan do we want to mark adding |
What has to be decided is whether we want to change the default behavior or not. |
I think we do not want to change the default behavior for 1.0. What we have now is what people have used for years and there is little benefit of changing the default (also most of the time you want this behavior). But I think having an option to avoid adding keys by setting |
I mark it non-breaking and remove 1.0 label. This feature is nice to have but can be added later (the default will stay the same). |
This PR will be handled in #2156. So I am closing it. If there are any objections please reopen. |
This implements #1554, in particular now
by(identity, df, cols)
returns
df
up to an ordering ambiguity. You can get the old behavior by doingappend_keys=true
.Again, I'm not entirely sure whether this is something everyone agrees on, but if it is, here's the PR.
I'm not at all attached the the keyword
append_keys
, actually it's kind of horrible, so I'm open to better suggestions.This could use a few tests, but lets see if there's any concensus first.