Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling of duplicate column names by join #1333

Closed
bkamins opened this issue Dec 29, 2017 · 11 comments · Fixed by #2313
Closed

Handling of duplicate column names by join #1333

bkamins opened this issue Dec 29, 2017 · 11 comments · Fixed by #2313
Labels
non-breaking The proposed change is not breaking priority
Milestone

Comments

@bkamins
Copy link
Member

bkamins commented Dec 29, 2017

Following the discussion in #1308 (comment) here is the satus what other packages do:

  • Pandas has lsuffix and rsuffix arguments
  • R base has suffixes argument
  • dplyr has suffix

I guess we should add something similar. Any comments?

@bkamins
Copy link
Member Author

bkamins commented Dec 29, 2017

As a side note base R and Pandas may produce duplicates by adding a suffix, whereas dplyr is careful enough to avoid them.

@ararslan
Copy link
Member

How annoying would it be to simply throw an error for duplicate column names to force the user to handle them manually? We could also prefix duplicate column names with the name of the table they came from, e.g. joining A and B where both have a column c will result in a table with A_c and B_c or something like that.

@bkamins
Copy link
Member Author

bkamins commented Dec 29, 2017

Now we have makeunique that allows or disallows duplicates. I guess it is annoying to disallow it as every other framework has this functionality.

Here is a snippet that could be considered to fix the names by suffixing if makeunique is set to true (it seems to be a standard to use suffix not a prefix):

function fixnames(left, right; lsuffix="_left", rsuffix="_right")
    lsuffix == rsuffix && throw(ArgumentError("left and right suffix must be different"))
    common = Set(intersect(left, right))
    for (x, suffix) in [(left, lsuffix), (right, rsuffix)]
        for i in eachindex(x)
            if x[i] in common
                x[i] = Symbol(string(x[i], suffix))
            end
        end
    end
    pool = Set{Symbol}()
    for (x, suffix) in [(left, lsuffix), (right, rsuffix)]
        for (i, s) in enumerate(x)
            while s in pool
                s = Symbol(string(s, suffix))
            end
            x[i] = s
            push!(pool, s)
        end
    end
end

@nalimilan
Copy link
Member

@ararslan I think it's annoying to handle manually, because you have to find which columns are present in the two inputs, skipping columns you're joining on.

Regarding the API, why not use suffix as in dplyr, and make it a pair? That would be consistent with what we do with on (#1297).

@bkamins bkamins mentioned this issue Jan 15, 2019
31 tasks
@bkamins
Copy link
Member Author

bkamins commented Feb 12, 2020

CC @oxinabox - this is directly relevant to what you have asked. Do you have a preference for API?

@bkamins
Copy link
Member Author

bkamins commented Feb 12, 2020

Note that in some joins we allow more than 2 data frames.

@bkamins bkamins added the non-breaking The proposed change is not breaking label Feb 12, 2020
@bkamins bkamins added this to the 1.x milestone Apr 4, 2020
@bkamins
Copy link
Member Author

bkamins commented Jun 8, 2020

When working on this the following case should be tested against as it is potentially confusing I think:

julia> df1 = DataFrame(a=[1,2], b=["x","y"])
2×2 DataFrame
│ Row │ a     │ b      │
│     │ Int64 │ String │
├─────┼───────┼────────┤
│ 1   │ 1     │ x      │
│ 2   │ 2     │ y      │

julia> df2 = DataFrame(b=[1,2], c=["x","y"])
2×2 DataFrame
│ Row │ b     │ c      │
│     │ Int64 │ String │
├─────┼───────┼────────┤
│ 1   │ 1     │ x      │
│ 2   │ 2     │ y      │

julia> leftjoin(df1, df2, on=:a => :b)
2×3 DataFrame
│ Row │ a     │ b      │ c       │
│     │ Int64 │ String │ String? │
├─────┼───────┼────────┼─────────┤
│ 1   │ 1     │ x      │ x       │
│ 2   │ 2     │ y      │ y       │

Also a general question is if we rename columns with suffixes should on columns be also altered.

@bkamins
Copy link
Member Author

bkamins commented Jun 9, 2020

I think we should go for suffix with a vector of suffixes following the rules:

  • on columns are not suffixed
  • all other columns are suffixed if suffix is passed
  • deduplication is made after adding of the suffix
  • technical columns, if generated are not suffixed

Any comments on this?

@oxinabox
Copy link
Contributor

oxinabox commented Jun 9, 2020

that sounds right to me.
Though a few points:

Having to give a vector, of 1 suffix per joined table.

This seems like it is going to get annoying in the cases where you are just joining 2 things.
I suggest that leftjoin and rightjoin also accept a single suffix, which they only apply to the right/left columns.
but also allow 2 names for that case.
and if you have more than 2 tables being joined, and you want to rename then you need to use the 1 rename per column including original

Suffix only:

The choice of suffix only seems off to me.
For a few reasons:
firstly, I have hundreds of lines of code right now that does prefix, and the fact that people i work with have kind of on there own just decided to use prefix makes me think that it is no less natural than suffix.

Further depending what your table names are prefix or suffix may be more natural.
Example,
lets say I have produce info for a number of resturants
Say: McDannys, BurgerDuke, Wendals
with columns: product, price, calories
then I will join on product e.g. burger, fries, cola.
and I would like my columns to end-up as McDannys_price, BurgerDuke_price, Wendal_price etc

I suggest instead we should accept a (vector of) function like rename does.
Arguably we should also take a list/dict of pairs like rename does
We could consider defining prefix(pre)=x->pre*x and suffix(post)=x*pos
then we could have prefix("McDannys_)

also means we can use identity for "leave it alone"
which probably matters for bigger vectors.

technical columns, if generated are not suffixed

I don't know what a technical column is.

@bkamins
Copy link
Member Author

bkamins commented Jun 9, 2020

Technical columns: currently it is only column specified by indicator kwarg in some of joins.

Regarding suffix - I agree with your point. I am a bit afraid to allow such flexibility as rename does. It seems to me that going for renaming function like in unstack should be enough. OK?

@oxinabox
Copy link
Contributor

oxinabox commented Jun 9, 2020

Regarding suffix - I agree with your point. I am a bit afraid to allow such flexibility as rename does. It seems to me that going for renaming function like in unstack should be enough. OK?

Yes, i think symmetry with unstack is desirable, and indeed after unstacking one is likely to join and so it kind of makes sense that the same piece of information could be renamed either.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
non-breaking The proposed change is not breaking priority
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants