Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simple uniqueness checks for sorting-related functions #3312

Merged
merged 31 commits into from
Jun 19, 2023
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
Show all changes
31 commits
Select commit Hold shift + click to select a range
376b569
Basic support for checkunique kwarg in sorting-related functions
alonsoC1s Apr 7, 2023
b9fc81d
Added tests for the checkunique kwarg
alonsoC1s Apr 8, 2023
c35b338
Fixing Tests + Changing Error to ArgumentError
alonsoC1s Apr 8, 2023
9de0421
Accounting for the case with order(...) clauses
alonsoC1s Apr 9, 2023
01761ee
Improved tests for selectors with checkunique
alonsoC1s Apr 11, 2023
add0ad6
Fixed broken uniqueness testing
alonsoC1s Apr 11, 2023
e310687
Merge branch 'JuliaData:main' into simple_uniqueness_sorting
alonsoC1s Apr 12, 2023
7de6e70
Added missing tests for sort
alonsoC1s Apr 12, 2023
262db54
Fixed Docstrings + Replace == with ===
alonsoC1s Apr 13, 2023
d433cf1
Apply suggestions from code review
alonsoC1s Apr 19, 2023
f7d2ad9
Fixed sorting with order kwarg
alonsoC1s Apr 21, 2023
bdff445
Fixed complex sorting order detection
alonsoC1s Apr 21, 2023
d68dda9
Improved tests for complex order clauses
alonsoC1s Apr 22, 2023
651eab6
Merge branch 'main' into simple_uniqueness_sorting
alonsoC1s Apr 22, 2023
f49023d
Fixed logic error in is_complex + added tests
alonsoC1s Apr 24, 2023
73a31bb
Apply suggestions from code review
alonsoC1s Apr 28, 2023
5ef1e4d
More robust complexity checks for Order.Orderings + tests
alonsoC1s May 18, 2023
e5bd016
Update src/abstractdataframe/sort.jl
alonsoC1s May 18, 2023
bac221b
Merge branch 'main' into simple_uniqueness_sorting
alonsoC1s May 18, 2023
b672135
Added error message for unsupported Order types
alonsoC1s May 18, 2023
8746370
Added complexity checks and tests for other subtypes of Order
alonsoC1s May 21, 2023
1576c07
Merge branch 'main' into simple_uniqueness_sorting
alonsoC1s Jun 1, 2023
2a71db8
Finalizing checks for multiple Ordering subtypes + tests
alonsoC1s Jun 6, 2023
ce79d07
Update src/abstractdataframe/sort.jl
bkamins Jun 12, 2023
536926a
Update src/abstractdataframe/sort.jl
alonsoC1s Jun 13, 2023
cf59d15
Merge branch 'main' into simple_uniqueness_sorting
alonsoC1s Jun 13, 2023
d333474
Stylistic changes
alonsoC1s Jun 13, 2023
ac25bb5
Merge branch 'main' into simple_uniqueness_sorting
alonsoC1s Jun 16, 2023
e002b7b
Update src/abstractdataframe/sort.jl
bkamins Jun 16, 2023
56c228e
Removed explanation from sort docstring
alonsoC1s Jun 16, 2023
929cacd
Apply suggestions from code review
bkamins Jun 18, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
81 changes: 47 additions & 34 deletions src/abstractdataframe/sort.jl
Original file line number Diff line number Diff line change
Expand Up @@ -341,10 +341,15 @@ If `rev` is `true`, reverse sorting is performed. To enable reverse sorting only
for some columns, pass `order(c, rev=true)` in `cols`, with `c` the
corresponding column index (see example below).

Since having repeated elements makes multiple sorting orders valid, if `checkunique`
is `true` some basic uniqueness checks are made. If duplicate elements are
alonsoC1s marked this conversation as resolved.
Show resolved Hide resolved
found, an `ArgumentError` will be thrown.

The `by` keyword allows providing a function that will be applied to each
cell before comparison; the `lt` keyword allows providing a custom "less
than" function. If both `by` and `lt` are specified, the `lt` function is
applied to the result of the `by` function.
applied to the result of the `by` function. Neither `by` nor `lt` can be used
if `checkunique` is `true`, including inside `order(...)` clauses.

All the keyword arguments can be either a single value, which is applied to
alonsoC1s marked this conversation as resolved.
Show resolved Hide resolved
all columns, or a vector of length equal to the number of columns that the
Expand Down Expand Up @@ -407,17 +412,7 @@ function Base.issorted(df::AbstractDataFrame, cols=All();
if cols isa MultiColumnIndex && !(cols isa AbstractVector)
cols = index(df)[cols]
end
if checkunique
newcols = Int[]

for col in cols
push!(newcols, index(df)[(_getcol(col))])
end
if !allunique(df, newcols)
throw(ArgumentError("Non-unique elements found. Multiple orders " *
"are valid"))
end
end
checkunique && _perform_uniqueness_checks(df, cols, lt, by)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about calling this _check_unique?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like that name feels a bit similar to allunique in the sense that, without context, it sounds like they do the exact same thing

if cols isa ColumnIndex
return issorted(df[!, cols], lt=to_scalar(lt), by=to_scalar(by),
rev=to_scalar(rev), order=to_scalar(order))
Expand Down Expand Up @@ -603,17 +598,7 @@ function Base.sortperm(df::AbstractDataFrame, cols=All();
end
ord = ordering(df, cols, lt, by, rev, order)
_alg = Sort.defalg(df, ord; alg=alg, cols=cols)
if checkunique
newcols = Int[]

for col in cols
push!(newcols, index(df)[(_getcol(col))])
end
if !allunique(df, newcols)
throw(ArgumentError("Non-unique elements found. Multiple orders " *
"are valid"))
end
end
checkunique && _perform_uniqueness_checks(df, cols, lt, by)
return _sortperm(df, _alg, ord)
end

Expand Down Expand Up @@ -720,17 +705,7 @@ function Base.sort!(df::AbstractDataFrame, cols=All();
end
ord = ordering(df, cols, lt, by, rev, order)
_alg = Sort.defalg(df, ord; alg=alg, cols=cols)
if checkunique
newcols = Int[]

for col in cols
push!(newcols, index(df)[(_getcol(col))])
end
if !allunique(df, newcols)
throw(ArgumentError("Non-unique elements found. Multiple orders " *
"are valid"))
end
end
checkunique && _perform_uniqueness_checks(df, cols, lt, by)
return sort!(df, _alg, ord)
end

Expand All @@ -744,3 +719,41 @@ function Base.sort!(df::AbstractDataFrame, a::Base.Sort.Algorithm,
end
permute!(df, _sortperm(df, a, o))
end

# Internal function that aids in uniqueness checks
function _perform_uniqueness_checks(df, cols, lt, by)
if !(lt == isless && by == identity)
alonsoC1s marked this conversation as resolved.
Show resolved Hide resolved
throw(ArgumentError("Passing either lt or by along with checkunique=" *
"true is not supported."))
end
# Easiest case, cols contains numeric indexes already
alonsoC1s marked this conversation as resolved.
Show resolved Hide resolved
if cols isa AbstractVector{<:ColumnIndex}
by_or_lt_set = false
col_idxs = cols
# Second easiest, multicol index (no vector with orders clauses mixed in)
elseif cols isa MultiColumnIndex && !(cols isa AbstractVector) || cols isa ColumnIndex
alonsoC1s marked this conversation as resolved.
Show resolved Hide resolved
by_or_lt_set = false
col_idxs = index(df)[cols]
elseif cols isa UserColOrdering
by_or_lt_set = any(haskey(cols.kwargs, key) for key in [:by, :lt])
alonsoC1s marked this conversation as resolved.
Show resolved Hide resolved
col_idxs = index(df)[(_getcol(cols))]
alonsoC1s marked this conversation as resolved.
Show resolved Hide resolved
# Multicol indexes mixed in
elseif cols isa AbstractVector
alonsoC1s marked this conversation as resolved.
Show resolved Hide resolved
newcols = Int[]
by_or_lt_set = false
for col in cols
if col isa UserColOrdering
by_or_lt_set = any(haskey(col.kwargs, key) for key in [:by, :lt])
alonsoC1s marked this conversation as resolved.
Show resolved Hide resolved
end

alonsoC1s marked this conversation as resolved.
Show resolved Hide resolved
push!(newcols, index(df)[(_getcol(col))])
alonsoC1s marked this conversation as resolved.
Show resolved Hide resolved
end
col_idxs = newcols
end
if by_or_lt_set
throw(ArgumentError("Order clauses with either by or lt set in combination " *
"with checkunique=true are not supported"))
end
!allunique(df, col_idxs) && throw(ArgumentError("Non-unique elements found. " *
"Multiple orders are valid"))
alonsoC1s marked this conversation as resolved.
Show resolved Hide resolved
end
64 changes: 60 additions & 4 deletions test/sort.jl
Original file line number Diff line number Diff line change
Expand Up @@ -6,18 +6,17 @@ using DataFrames, Random, Test, CategoricalArrays
dv1 = [9, 1, 8, missing, 3, 3, 7, missing]
dv2 = [9, 1, 8, missing, 3, 3, 7, missing]
dv3 = Vector{Union{Int, Missing}}(1:8)
dv4 = 8:-1:1
cv1 = CategoricalArray(dv1, ordered=true)

d = DataFrame(dv1=dv1, dv2=dv2, dv3=dv3, cv1=cv1)
d = DataFrame(dv1=dv1, dv2=dv2, dv3=dv3, dv4=dv4, cv1=cv1)

@test sort(DataFrame()) == DataFrame()
@test sort!(DataFrame()) == DataFrame()
@test isempty(sortperm(DataFrame()))
@test issorted(DataFrame())
@test sortperm(d) == sortperm(dv1)
@test sortperm(d[:, [:dv3, :dv1]]) == sortperm(dv3)
@test_throws ArgumentError sortperm(d, :cv1, checkunique=true)
@test_throws ArgumentError sortperm(d, [:cv1, :dv1], checkunique=true)
@test sort(d, :dv1)[!, :dv3] == sort(d, "dv1")[!, "dv3"] == sortperm(dv1)
@test sort(d, :dv2)[!, :dv3] == sortperm(dv1)
@test sort(d, :cv1)[!, :dv3] == sortperm(dv1)
Expand All @@ -32,7 +31,6 @@ using DataFrames, Random, Test, CategoricalArrays
@test issorted(sort(df, rev=true), rev=true)
@test issorted(sort(df, [:chrom, :pos])[:, [:chrom, :pos]])
@test issorted(sort(df, ["chrom", "pos"])[:, ["chrom", "pos"]])
@test_throws ArgumentError issorted(sort(df), :rank, checkunique=true)

ds = sort(df, [order(:rank, rev=true), :chrom, :pos])
@test issorted(ds, [order(:rank, rev=true), :chrom, :pos])
Expand Down Expand Up @@ -145,6 +143,64 @@ using DataFrames, Random, Test, CategoricalArrays
end
end

@testset "correctness of checkunique keyword" begin
dv1 = [9, 1, 8, missing, 3, 3, 7, missing]
dv2 = [9, 1, 8, missing, 3, 3, 7, missing]
dv3 = Vector{Union{Int, Missing}}(1:8)
dv4 = 8:-1:1
cv1 = CategoricalArray(dv1, ordered=true)

d = DataFrame(dv1=dv1, dv2=dv2, dv3=dv3, dv4=dv4, cv1=cv1)

## logic:
### Test each every selector in the following order:
### Symbol, String, Vect{ColumnIndex}, Order, Vect{ColIndex, Order}

# issorted
@test_throws ArgumentError issorted(d, :dv1, checkunique=true)
@test issorted(d, :dv3, checkunique=true)
@test issorted(d, "dv3", checkunique=true)
@test issorted(d, ["dv3", "dv4"], checkunique=true)
@test issorted(d, :dv4, rev = true, checkunique=true)
@test issorted(d, order(:dv4, rev=true), checkunique=true)
@test_throws ArgumentError issorted(d, order(:dv4, by=x -> -x), checkunique=true)
@test_throws ArgumentError issorted(d, order(:dv4, lt= >), checkunique=true)
@test issorted(d, [:dv3, order(:dv4, rev=true)], checkunique=true)
@test issorted(d, [:dv3, :dv4], rev = [false, true], checkunique=true)
@test_throws ArgumentError issorted(d, :dv3, by=x-> -x, checkunique=true)
@test_throws ArgumentError issorted(d, :dv3, lt = >, checkunique=true)
@test issorted(d, [order(:dv3, rev=false), order(:dv4, rev=true)], checkunique=true)
@test issorted(d, [order(:dv3, by=identity), order(:dv4, rev=true)], checkunique=true)

# sort
bkamins marked this conversation as resolved.
Show resolved Hide resolved
@test_throws ArgumentError sort(d, :dv1, checkunique=true)
@test_throws ArgumentError sort(d, "dv1", checkunique=true)
@test_throws ArgumentError sort(d, 1, checkunique=true)
@test_throws ArgumentError sort(d, [:dv1, :dv2], checkunique=true)
@test_throws ArgumentError sort(d, ["dv1", "dv2"], checkunique=true)
@test_throws ArgumentError sort(d, order(:dv1, rev=true), checkunique=true)
@test_throws ArgumentError sort(d, order(:dv1, by=x -> -x), checkunique=true)
@test_throws ArgumentError sort(d, order(:dv1, lt= >), checkunique=true)
@test_throws ArgumentError sort(d, [:dv2, order(:dv1, rev=true)], checkunique=true)
@test_throws ArgumentError sort(d, [:dv1, :dv2], rev = [false, false], checkunique=true)
@test_throws ArgumentError sort(d, :dv1, by = x -> -x, checkunique=true)
@test_throws ArgumentError sort(d, :dv1, lt = >, checkunique=true)

# sortperm
@test_throws ArgumentError sortperm(d, :dv1, checkunique=true)
@test_throws ArgumentError sortperm(d, "dv1", checkunique=true)
@test_throws ArgumentError sortperm(d, 1, checkunique=true)
@test_throws ArgumentError sortperm(d, [:dv1, :dv2], checkunique=true)
@test_throws ArgumentError sortperm(d, ["dv1", "dv2"], checkunique=true)
@test_throws ArgumentError sortperm(d, order(:dv1, rev=true), checkunique=true)
@test_throws ArgumentError sortperm(d, order(:dv1, by=x -> -x), checkunique=true)
@test_throws ArgumentError sortperm(d, order(:dv1, lt= >), checkunique=true)
@test_throws ArgumentError sortperm(d, [:dv2, order(:dv1, rev=true)], checkunique=true)
@test_throws ArgumentError sortperm(d, [:dv1, :dv2], rev = [false, false], checkunique=true)
@test_throws ArgumentError sortperm(d, :dv1, by = x -> -x, checkunique=true)
@test_throws ArgumentError sortperm(d, :dv1, lt = >, checkunique=true)
end

@testset "non standard selectors" begin
Random.seed!(1234)
df = DataFrame(rand(1:2, 1000, 4), :auto)
Expand Down