check for preserved index column during combine #1460

kleinschmidt · 2018-07-19T20:14:48Z

When using by with a function that returns a dataframe with the grouping columns still present, combine interprets it as a duplicate column and renames it with a warning. It would be nicer if we could check for whether the grouping column is present and all elements equal to the group values, and if so ignore it.

current behavior:

julia> by(d, :c1) do df
           filter(row -> row[:c2] == maximum(df[:c2]), df)
       end
WARNING: Duplicate variable names are deprecated: pass makeunique=true to add a suffix automatically.
...
2×3 DataFrames.DataFrame
│ Row │ c1 │ c1_1 │ c2 │
├─────┼────┼──────┼────┤
│ 1   │ a  │ a    │ 3  │
│ 2   │ b  │ b    │ 4  │

desired behavior:

2×2 DataFrames.DataFrame
│ Row │ c1 │ c2 │
├─────┼────┼────┤
│ 1   │ a  │ 3  │
│ 2   │ b  │ 4  │

The text was updated successfully, but these errors were encountered:

bkamins · 2018-07-27T20:28:49Z

MWE would even be:

by(identity, d, :c1)

with the same result.

Maybe a lesser evil (at a cost of performance) would be to do what you propose only if the columns have identical contents. If they would not maybe we should throw an error (this will be in the future - and a warning now).

bkamins · 2019-01-21T09:52:58Z

@nalimilan I see two approaches here:

do what you propose (silently overwrite the columns)
add a keepgrouping keyword argument to combine and by; if it is set to true (the default) we keep the current behavior; if it is set to false we always drop them (this might be useful in some cases in general)

If we picked option 2., then at the same time we should add skipmissing keyword argument to by to clean up the API. What would you prefer?

bkamins · 2019-01-21T09:55:29Z

By "drop them" I mean that we do not append grouping columns to the result.

nalimilan · 2019-01-21T10:07:42Z

Option 2. is #1555 with a different default. I'm not completely sure which option is best. Behavior 1. sounds the most useful in practice, with the possible drawback that one could overwrite grouping columns accidentally and get incorrect results. We had discussed the option of checking whether columns are equal and throw an error if they aren't (possibly with an argument to allow overwriting even if different).

bkamins · 2019-01-21T10:37:00Z

Ah - right. We also should keep column order in mind (when we add grouping columns they come first).

Given the new split-apply-combine API (not using a SubDataFrame by default) I think that in practice it is better to have a keyword argument like in #1555, but set to keep the grouping columns by default (as I have proposed in option 2).

Also if we agree on this scenario I would not overwrite the columns if the user wants to keep grouping columns and there are duplicate column names, but keep the current behavior. But I am open to other opinions.

CC @ExpandingMan

nalimilan · 2019-01-21T12:41:24Z

I'm fine with adding an argument, but I don't really like the current behavior with duplicate columns.

A safer approach in the perspective of releasing 1.0 would be to throw an error by default when there are duplicate columns that aren't equal to the grouping columns, so that we can switch to any behavior later if we want (or keep that behavior). One argument in favor of this is that I don't think adding the new columns with names generated by makeunique is useful in practice: you are more likely to either ignore these columns, or rename them (which can be done beforehand). Also, the current behavior doesn't protect you from mistakes since if you try to access the newly generated columns using their names, you will silently get the grouping columns.

bkamins · 2019-01-21T12:46:30Z

OK - would you be willing to make a PR (I guess it is better if you do it, as you know the split-apply-combine internals best).

#1555 would need heavy rebasing anyway.

And it would be great if skipmissing keyword agrument were made also consistent in by.

nalimilan · 2019-09-03T09:36:03Z

#1938 implements the discussed solution: stop adding grouping columns with makeunique=true, and throw an error if columns are not equal.

nalimilan added the intro issue label Sep 20, 2018

nalimilan added the Hacktoberfest label Oct 2, 2018

This was referenced Oct 4, 2018

standardized return values for groupby #1554

Closed

Improve performance of by() using NamedTuples #1520

Merged

bkamins mentioned this issue Nov 24, 2018

Support type-stable map and combine on GroupedDataFrame #1601

Merged

bkamins mentioned this issue Dec 4, 2018

Split-apply-combine todo #1616

Closed

8 tasks

bkamins mentioned this issue Jan 15, 2019

DataFrames.jl roadmap #1678

Closed

31 tasks

nalimilan mentioned this issue Jan 21, 2019

DataFrame for GroupedDataFrame #1689

Merged

nalimilan mentioned this issue Sep 3, 2019

Stop using makeunique=true for grouping keys in combine #1938

Merged

nalimilan closed this as completed in #1938 Sep 3, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

check for preserved index column during combine #1460

check for preserved index column during combine #1460

kleinschmidt commented Jul 19, 2018

bkamins commented Jul 27, 2018

bkamins commented Jan 21, 2019

bkamins commented Jan 21, 2019

nalimilan commented Jan 21, 2019

bkamins commented Jan 21, 2019

nalimilan commented Jan 21, 2019

bkamins commented Jan 21, 2019

nalimilan commented Sep 3, 2019

check for preserved index column during combine #1460

check for preserved index column during combine #1460

Comments

kleinschmidt commented Jul 19, 2018

bkamins commented Jul 27, 2018

bkamins commented Jan 21, 2019

bkamins commented Jan 21, 2019

nalimilan commented Jan 21, 2019

bkamins commented Jan 21, 2019

nalimilan commented Jan 21, 2019

bkamins commented Jan 21, 2019

nalimilan commented Sep 3, 2019