Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

check for preserved index column during combine #1460

Closed
kleinschmidt opened this issue Jul 19, 2018 · 8 comments · Fixed by #1938
Closed

check for preserved index column during combine #1460

kleinschmidt opened this issue Jul 19, 2018 · 8 comments · Fixed by #1938

Comments

@kleinschmidt
Copy link
Contributor

When using by with a function that returns a dataframe with the grouping columns still present, combine interprets it as a duplicate column and renames it with a warning. It would be nicer if we could check for whether the grouping column is present and all elements equal to the group values, and if so ignore it.

current behavior:

julia> by(d, :c1) do df
           filter(row -> row[:c2] == maximum(df[:c2]), df)
       end
WARNING: Duplicate variable names are deprecated: pass makeunique=true to add a suffix automatically.
...
2×3 DataFrames.DataFrame
│ Row │ c1 │ c1_1 │ c2 │
├─────┼────┼──────┼────┤
│ 1   │ a  │ a    │ 3  │
│ 2   │ b  │ b    │ 4  │

desired behavior:

2×2 DataFrames.DataFrame
│ Row │ c1 │ c2 │
├─────┼────┼────┤
│ 1   │ a  │ 3  │
│ 2   │ b  │ 4  │
@bkamins
Copy link
Member

bkamins commented Jul 27, 2018

MWE would even be:

by(identity, d, :c1)

with the same result.

Maybe a lesser evil (at a cost of performance) would be to do what you propose only if the columns have identical contents. If they would not maybe we should throw an error (this will be in the future - and a warning now).

@bkamins
Copy link
Member

bkamins commented Jan 21, 2019

@nalimilan I see two approaches here:

  1. do what you propose (silently overwrite the columns)
  2. add a keepgrouping keyword argument to combine and by; if it is set to true (the default) we keep the current behavior; if it is set to false we always drop them (this might be useful in some cases in general)

If we picked option 2., then at the same time we should add skipmissing keyword argument to by to clean up the API. What would you prefer?

@bkamins
Copy link
Member

bkamins commented Jan 21, 2019

By "drop them" I mean that we do not append grouping columns to the result.

@nalimilan
Copy link
Member

Option 2. is #1555 with a different default. I'm not completely sure which option is best. Behavior 1. sounds the most useful in practice, with the possible drawback that one could overwrite grouping columns accidentally and get incorrect results. We had discussed the option of checking whether columns are equal and throw an error if they aren't (possibly with an argument to allow overwriting even if different).

@bkamins
Copy link
Member

bkamins commented Jan 21, 2019

Ah - right. We also should keep column order in mind (when we add grouping columns they come first).

Given the new split-apply-combine API (not using a SubDataFrame by default) I think that in practice it is better to have a keyword argument like in #1555, but set to keep the grouping columns by default (as I have proposed in option 2).

Also if we agree on this scenario I would not overwrite the columns if the user wants to keep grouping columns and there are duplicate column names, but keep the current behavior. But I am open to other opinions.

CC @ExpandingMan

@nalimilan
Copy link
Member

I'm fine with adding an argument, but I don't really like the current behavior with duplicate columns.

A safer approach in the perspective of releasing 1.0 would be to throw an error by default when there are duplicate columns that aren't equal to the grouping columns, so that we can switch to any behavior later if we want (or keep that behavior). One argument in favor of this is that I don't think adding the new columns with names generated by makeunique is useful in practice: you are more likely to either ignore these columns, or rename them (which can be done beforehand). Also, the current behavior doesn't protect you from mistakes since if you try to access the newly generated columns using their names, you will silently get the grouping columns.

@bkamins
Copy link
Member

bkamins commented Jan 21, 2019

OK - would you be willing to make a PR (I guess it is better if you do it, as you know the split-apply-combine internals best).

#1555 would need heavy rebasing anyway.

And it would be great if skipmissing keyword agrument were made also consistent in by.

@nalimilan
Copy link
Member

#1938 implements the discussed solution: stop adding grouping columns with makeunique=true, and throw an error if columns are not equal.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants