sanitized by function #1555

ExpandingMan · 2018-10-03T20:31:36Z

This implements #1554, in particular now

by(identity, df, cols)

returns df up to an ordering ambiguity. You can get the old behavior by doing append_keys=true.

Again, I'm not entirely sure whether this is something everyone agrees on, but if it is, here's the PR.

I'm not at all attached the the keyword append_keys, actually it's kind of horrible, so I'm open to better suggestions.

This could use a few tests, but lets see if there's any concensus first.

nalimilan · 2018-10-04T07:58:37Z

Thanks. Unfortunately that's going to conflict with #1520.

Regarding the solution itself, I'm fine with adding an argument but I'd use true by default, and I'd maybe call it addkeys (append! appends rows).

ExpandingMan · 2018-10-04T13:39:56Z

So, I realize this is breaking behavior and that it's kind of a high bar to merge this with the new behavior as default, but I really do think this behavior is a better default, so let me give some more compelling reasons.

First, I'm not sure how much we want to emulate pandas, but it is worth noting that the closest pandas equivalent to by does indeed have the behavior I'm suggesting

df.groupby(cols).apply(lambda x: x)

indeed returns df up to an ordering ambiguity.

I'd argue that this behavior is much more intuitive as a default. I'd expect the combine function to perform some sort of reduction or concatenation. Other concatenation functions typically don't add any type of key (whether or not that makes sense in the context). To this day I frequently get tripped up by extra columns I got from doing by. I suspect I'm not alone (especially considering the pandas behavior).

I agree with you that wanting to retain the group keys is an extremely common use case, but that's precisely one of the problems I have with this: I wind up having to filter out redundant columns more often than not. The reason this happens is that the group keys are already contained in the dataframe that's passed as an argument to the lambda function. For example, one common use case might be doing a groupby which removes or appends some rows. In those cases you are going to wind up with duplicate rows when the combine method tries to tack on more columns.

add_keys sounds a bit strange to me because I'm not sure the word add is appropriate here. How about key_columns?

ExpandingMan · 2018-10-04T13:43:13Z

Regarding #1520, by all means of course move ahead with it. It should be easy enough to make whatever change we decide on after that is merged.

nalimilan · 2018-10-04T15:50:32Z

OTOH dplyr's group_by keeps grouping columns, even after calling summarize, mutate, transmute and do (which are their equivalents of combine).

The apply example from Pandas you give seems different from our combine: it sounds more similar to map, which in #1520 gives another GroupedDataFrame. Maybe what's needed is a way to remove grouping information from such a GroupedDataFrame, transforming it into a plain DataFrame. I guess we could call that operation ungroup. Just like combine, it could optionally take a transformation function for convenience and to avoid creating a second GroupedDataFrame, and it wouldn't add grouping columns.

The reason this happens is that the group keys are already contained in the dataframe that's passed as an argument to the lambda function. For example, one common use case might be doing a groupby which removes or appends some rows. In those cases you are going to wind up with duplicate rows when the combine method tries to tack on more columns.

I think the best solution to that is to check whether columns with the same names exist. dplyr's do uses the returned column, overwriting the grouping column in case of conflict. We could also check whether they are equal.

ExpandingMan · 2018-10-04T15:58:49Z

I realize that because this is a breaking change there'd have to be overwhelming agreement that it's a good change. I'd definitely prefer it, but keeping the current behavior as default is obviously far safer. I'll concede this and change it so that the current behavior is default next time I make a commit to this.

As far as checking whether the columns already exist: I think we should only do that as yet another explicit keyword argument. I'm worried that this would be unexpected and cause things to go horribly wrong if someone intended to transform them somehow. If we check whether they are equal, I'm worried about the performance implications.

I'll work on all these changes in a next commit, but I may just wait for #1520 to be merged before I take this up again.

nalimilan · 2019-01-21T09:11:34Z

#1520 has been merged. Do you still feel the need for this now that we support the pairs syntax which is generally more convenient (and efficient) than the old syntax?

ExpandingMan · 2019-01-21T14:59:40Z

Great work on #1520 thanks for that.

In my opinion this is still desirable as long as the original by method exists. It's not completely clear to me whether there are still cases where the original method should be used, especially since one could always just iterate over the object returned by groupby. The original method would probably be a good thing to keep around since it's hard to foresee every possible use case.

Regardless, this certainly seems like a less important issue now, so if you really want to close it, I won't object.

bkamins · 2019-07-24T14:41:32Z

@nalimilan - any opinion what we should do with this PR (given your overall plans for groupby family roadmap 😄)

nalimilan · 2019-07-25T13:42:54Z

I think it's uncontroversial to add an addkeys=true argument, at least as a first step. Then we can discuss whether we should change the default to false. In any case we should also find a strategy to handle more gracefully situations where addkeys=true and the returned data frame contains columns with the same names as the keys.

nalimilan · 2019-09-03T10:06:45Z

@ExpandingMan Would you be willing to rebase this on current master and rename append_keys to addkeys (or something else; "append" is used to append rows in DataFrames)?

bkamins · 2019-09-03T13:08:16Z

I think it should be added after #1938 is merged (so maybe it will be easier to open an new PR for this). In general I think that by default we should append keys.

ExpandingMan · 2019-09-03T14:09:10Z

Ok, I'm leaving this for now. If at some point you want me to open another PR, let me know.

nalimilan · 2019-09-09T07:46:08Z

You can rebase it when you want, #1938 has been merged.

FWIW, JuliaDB uses usekey = true.

bkamins · 2019-12-01T13:36:07Z

@nalimilan do we want to mark adding addkeys kwarg as 1.0 required feature (I think not but it is non-breaking, but it would be nice to have it).

nalimilan · 2019-12-01T14:42:28Z

What has to be decided is whether we want to change the default behavior or not.

bkamins · 2019-12-01T20:38:26Z

I think we do not want to change the default behavior for 1.0. What we have now is what people have used for years and there is little benefit of changing the default (also most of the time you want this behavior). But I think having an option to avoid adding keys by setting addkeys=false is sometimes useful and should be added.

bkamins · 2020-02-12T10:59:13Z

I mark it non-breaking and remove 1.0 label. This feature is nice to have but can be added later (the default will stay the same).

bkamins · 2020-03-21T10:52:07Z

This PR will be handled in #2156. So I am closing it. If there are any objections please reopen.

ExpandingMan added 2 commits October 3, 2018 16:28

sanitized the by function

314bc8f

fixed bug in aggregate

445ca55

bkamins mentioned this pull request Jan 15, 2019

DataFrames.jl roadmap #1678

Closed

31 tasks

nalimilan mentioned this pull request Jan 21, 2019

DataFrame for GroupedDataFrame #1689

Merged

nalimilan mentioned this pull request Jan 21, 2019

check for preserved index column during combine #1460

Closed

nalimilan added the grouping label Jan 21, 2019

nalimilan added this to the 1.0 milestone Dec 1, 2019

bkamins added breaking The proposed change is breaking. non-breaking The proposed change is not breaking and removed breaking The proposed change is breaking. labels Feb 12, 2020

bkamins removed this from the 1.0 milestone Feb 12, 2020

nalimilan mentioned this pull request Mar 21, 2020

Redesign of combine #2156

Closed

bkamins closed this Mar 21, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sanitized by function #1555

sanitized by function #1555

ExpandingMan commented Oct 3, 2018

nalimilan commented Oct 4, 2018

ExpandingMan commented Oct 4, 2018 •

edited

Loading

ExpandingMan commented Oct 4, 2018

nalimilan commented Oct 4, 2018

ExpandingMan commented Oct 4, 2018 •

edited

Loading

nalimilan commented Jan 21, 2019

ExpandingMan commented Jan 21, 2019

bkamins commented Jul 24, 2019

nalimilan commented Jul 25, 2019

nalimilan commented Sep 3, 2019

bkamins commented Sep 3, 2019

ExpandingMan commented Sep 3, 2019

nalimilan commented Sep 9, 2019

bkamins commented Dec 1, 2019

nalimilan commented Dec 1, 2019

bkamins commented Dec 1, 2019

bkamins commented Feb 12, 2020

bkamins commented Mar 21, 2020

sanitized by function #1555

sanitized by function #1555

Conversation

ExpandingMan commented Oct 3, 2018

nalimilan commented Oct 4, 2018

ExpandingMan commented Oct 4, 2018 • edited Loading

ExpandingMan commented Oct 4, 2018

nalimilan commented Oct 4, 2018

ExpandingMan commented Oct 4, 2018 • edited Loading

nalimilan commented Jan 21, 2019

ExpandingMan commented Jan 21, 2019

bkamins commented Jul 24, 2019

nalimilan commented Jul 25, 2019

nalimilan commented Sep 3, 2019

bkamins commented Sep 3, 2019

ExpandingMan commented Sep 3, 2019

nalimilan commented Sep 9, 2019

bkamins commented Dec 1, 2019

nalimilan commented Dec 1, 2019

bkamins commented Dec 1, 2019

bkamins commented Feb 12, 2020

bkamins commented Mar 21, 2020

ExpandingMan commented Oct 4, 2018 •

edited

Loading

ExpandingMan commented Oct 4, 2018 •

edited

Loading