Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add all keyword argument to nonunique #2238

Closed
CameronBieganek opened this issue May 8, 2020 · 8 comments
Closed

Add all keyword argument to nonunique #2238

CameronBieganek opened this issue May 8, 2020 · 8 comments
Labels
feature non-breaking The proposed change is not breaking
Milestone

Comments

@CameronBieganek
Copy link

Often when you use nonunique(df, cols), you want to be able to look at the rows that are non-unique according to cols to see if there are differences in the columns other than cols. It would be handy if there were an all keyword argument to nonunique that returns all the duplicates. (Right now the first occurrence of a row is not included in the output.) If all is true, it might actually make more sense to return a vector of vectors of indices, like this:

[
    [2, 3],
    [7, 10, 11],
    [15, 21]
]
@bkamins
Copy link
Member

bkamins commented May 8, 2020

Makes sense - I often missed this functionality

@bkamins bkamins added feature non-breaking The proposed change is not breaking labels May 8, 2020
@bkamins bkamins added this to the 1.x milestone May 8, 2020
@bkamins
Copy link
Member

bkamins commented May 8, 2020

The only thing is that now you can write:

@pipe df |> groupby(_, cols) |> combine(x -> nrow(x) > 1 ? x : DataFrame(), _, ungroup=false)

To get exactly what you ask for (and immediately see the duplicate rows as separate data frames)

@nalimilan
Copy link
Member

nonunique returns a Boolean vector with one element per row. It would be weird to me to have it return a vector with one Vector{Int} element per unique row just because you change an argument. This sounds more like a grouping operation to me as @bkamins showed.

@CameronBieganek
Copy link
Author

Yeah, I agree it makes sense to maintain the type of the output as a vector of Bools. What could be done is the following:

julia> df = DataFrame(a = [1, 2, 2], b = [3, 4, 4]);

julia> nonunique(df; all=true)
3-element Array{Bool,1}:
 0
 1
 1

@bkamins
Copy link
Member

bkamins commented May 18, 2020

Now as I think of it I would rather do:

groupindices(groupby(df, cols))

this should give you exactly what you want. Right? (the result structure is different, but you get all the information you require)

@CameronBieganek
Copy link
Author

Well, your solution above that returns a grouped data frame actually worked well for me. :)
I think groupindices(groupby(df, cols)) doesn't work as nicely, because it returns the indices for groups you don't care about:

julia> df = DataFrame(a = [1, 2, 2, 3], b = [4, 5, 5, 6])
4×2 DataFrame
│ Row │ a     │ b     │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 114     │
│ 225     │
│ 325     │
│ 436     │

julia> groupindices(groupby(df, :a))
4-element Array{Union{Missing, Int64},1}:
 1
 2
 2
 3

If one were to use nonunique, one would probably want this to work:

julia> df = DataFrame(a = [1, 2, 2, 3], b = [4, 5, 5, 6]);

julia> df[nonunique(df; all=true), :]
2×2 DataFrame
│ Row │ a     │ b     │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 125     │
│ 225

@bkamins bkamins modified the milestones: 1.x, 1.5 Sep 14, 2022
@bkamins
Copy link
Member

bkamins commented Dec 2, 2022

We could define duplicates function as follows:

function duplicates(df::AbstractDataFrame, cols, duplicatecolname=:duplicates)
    tmp_df = select(df, cols, copycols=!(df isa DataFrame))
    insertcols!(tmp_df, duplicatecolname => axes(df, 1))
    return combine(groupby(tmp_df, 1:ncol(tmp_df)-1), duplicatecolname => (x -> length(x) == 1 ? typeof(x)[] : [x]) => duplicatecolname)
end

example:

julia> df = DataFrame(a = [1, 2, 2, 3, 3], b = [4, 5, 5, 6, 7])
5×2 DataFrame
 Row │ a      b
     │ Int64  Int64
─────┼──────────────
   1 │     1      4
   2 │     2      5
   3 │     2      5
   4 │     3      6
   5 │     3      7

julia> duplicates(df, :a)
2×2 DataFrame
 Row │ a      duplicates
     │ Int64  SubArray…
─────┼───────────────────
   1 │     2  [2, 3]
   2 │     3  [4, 5]

julia> duplicates(df, :)
1×3 DataFrame
 Row │ a      b      duplicates
     │ Int64  Int64  SubArray…
─────┼──────────────────────────
   1 │     2      5  [2, 3]

Do we think it is a useful addition?

@bkamins
Copy link
Member

bkamins commented Feb 5, 2023

Fixed in #3260

@bkamins bkamins closed this as completed Feb 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature non-breaking The proposed change is not breaking
Projects
None yet
Development

No branches or pull requests

3 participants