Add `all` keyword argument to `nonunique` #2238

CameronBieganek · 2020-05-08T15:31:19Z

Often when you use nonunique(df, cols), you want to be able to look at the rows that are non-unique according to cols to see if there are differences in the columns other than cols. It would be handy if there were an all keyword argument to nonunique that returns all the duplicates. (Right now the first occurrence of a row is not included in the output.) If all is true, it might actually make more sense to return a vector of vectors of indices, like this:

[
    [2, 3],
    [7, 10, 11],
    [15, 21]
]

The text was updated successfully, but these errors were encountered:

bkamins · 2020-05-08T15:33:06Z

Makes sense - I often missed this functionality

bkamins · 2020-05-08T15:35:02Z

The only thing is that now you can write:

@pipe df |> groupby(_, cols) |> combine(x -> nrow(x) > 1 ? x : DataFrame(), _, ungroup=false)

To get exactly what you ask for (and immediately see the duplicate rows as separate data frames)

nalimilan · 2020-05-18T20:32:48Z

nonunique returns a Boolean vector with one element per row. It would be weird to me to have it return a vector with one Vector{Int} element per unique row just because you change an argument. This sounds more like a grouping operation to me as @bkamins showed.

CameronBieganek · 2020-05-18T20:52:49Z

Yeah, I agree it makes sense to maintain the type of the output as a vector of Bools. What could be done is the following:

julia> df = DataFrame(a = [1, 2, 2], b = [3, 4, 4]);

julia> nonunique(df; all=true)
3-element Array{Bool,1}:
 0
 1
 1

bkamins · 2020-05-18T21:21:16Z

Now as I think of it I would rather do:

groupindices(groupby(df, cols))

this should give you exactly what you want. Right? (the result structure is different, but you get all the information you require)

CameronBieganek · 2020-05-18T21:47:01Z

Well, your solution above that returns a grouped data frame actually worked well for me. :)
I think groupindices(groupby(df, cols)) doesn't work as nicely, because it returns the indices for groups you don't care about:

julia> df = DataFrame(a = [1, 2, 2, 3], b = [4, 5, 5, 6])
4×2 DataFrame
│ Row │ a     │ b     │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 1     │ 4     │
│ 2   │ 2     │ 5     │
│ 3   │ 2     │ 5     │
│ 4   │ 3     │ 6     │

julia> groupindices(groupby(df, :a))
4-element Array{Union{Missing, Int64},1}:
 1
 2
 2
 3

If one were to use nonunique, one would probably want this to work:

julia> df = DataFrame(a = [1, 2, 2, 3], b = [4, 5, 5, 6]);

julia> df[nonunique(df; all=true), :]
2×2 DataFrame
│ Row │ a     │ b     │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 2     │ 5     │
│ 2   │ 2     │ 5     │

bkamins · 2022-12-02T21:58:20Z

We could define duplicates function as follows:

function duplicates(df::AbstractDataFrame, cols, duplicatecolname=:duplicates)
    tmp_df = select(df, cols, copycols=!(df isa DataFrame))
    insertcols!(tmp_df, duplicatecolname => axes(df, 1))
    return combine(groupby(tmp_df, 1:ncol(tmp_df)-1), duplicatecolname => (x -> length(x) == 1 ? typeof(x)[] : [x]) => duplicatecolname)
end

example:

julia> df = DataFrame(a = [1, 2, 2, 3, 3], b = [4, 5, 5, 6, 7])
5×2 DataFrame
 Row │ a      b
     │ Int64  Int64
─────┼──────────────
   1 │     1      4
   2 │     2      5
   3 │     2      5
   4 │     3      6
   5 │     3      7

julia> duplicates(df, :a)
2×2 DataFrame
 Row │ a      duplicates
     │ Int64  SubArray…
─────┼───────────────────
   1 │     2  [2, 3]
   2 │     3  [4, 5]

julia> duplicates(df, :)
1×3 DataFrame
 Row │ a      b      duplicates
     │ Int64  Int64  SubArray…
─────┼──────────────────────────
   1 │     2      5  [2, 3]

Do we think it is a useful addition?

bkamins · 2023-02-05T08:44:48Z

Fixed in #3260

bkamins added feature non-breaking The proposed change is not breaking labels May 8, 2020

bkamins added this to the 1.x milestone May 8, 2020

bkamins modified the milestones: 1.x, 1.5 Sep 14, 2022

bkamins closed this as completed Feb 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `all` keyword argument to `nonunique` #2238

Add `all` keyword argument to `nonunique` #2238

CameronBieganek commented May 8, 2020

bkamins commented May 8, 2020

bkamins commented May 8, 2020

nalimilan commented May 18, 2020

CameronBieganek commented May 18, 2020

bkamins commented May 18, 2020

CameronBieganek commented May 18, 2020

bkamins commented Dec 2, 2022

bkamins commented Feb 5, 2023

Add all keyword argument to nonunique #2238

Add all keyword argument to nonunique #2238

Comments

CameronBieganek commented May 8, 2020

bkamins commented May 8, 2020

bkamins commented May 8, 2020

nalimilan commented May 18, 2020

CameronBieganek commented May 18, 2020

bkamins commented May 18, 2020

CameronBieganek commented May 18, 2020

bkamins commented Dec 2, 2022

bkamins commented Feb 5, 2023

Add `all` keyword argument to `nonunique` #2238

Add `all` keyword argument to `nonunique` #2238