Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simple uniqueness checks for sorting-related functions #3312

Merged
merged 31 commits into from
Jun 19, 2023

Conversation

alonsoC1s
Copy link
Contributor

As discussed in #2159, this PR implements basic uniqueness checks for the functions issorted, sortperm, sort and sort! via a boolean kwarg set to false by default. The uniqueness checks do not yet take into the account the possible effects of the by and lt keyword args.

This PR also adds tests for the new functionality, modifies the relevant docstrings, and adds a short description of the new functionality to NEWS.md

NEWS.md Outdated Show resolved Hide resolved
test/sort.jl Outdated Show resolved Hide resolved
@bkamins
Copy link
Member

bkamins commented Apr 8, 2023

Thank you for working on it.

The PR should be implemented in a way that if non standard by or lt is passed (either as kwarg or in order) and checkunique=true then error should be thrown.

@alonsoC1s
Copy link
Contributor Author

Thanks for the comments, I'll get to work on them :)

@alonsoC1s
Copy link
Contributor Author

... implemented in a way that if non standard by or lt is passed (either as kwarg or in order) and checkunique=true then error should be thrown.

I'm sorry, I don't quite get what you mean by that. What would be an example of non-standard by or lt?

@bkamins
Copy link
Member

bkamins commented Apr 8, 2023

What would be an example of non-standard by or lt?

as commented in the issue e.g. [order(:x, by = x->-x), order(:y, rev=true)]

@alonsoC1s
Copy link
Contributor Author

alonsoC1s commented Apr 9, 2023

I've addressed the case where order(...) clauses are part of cols. Is this enought to cover the case of non-standard by and lt?

I'm aware much better tests are needed, I'm waiting to implement them after the discussion about the other thing

@bkamins
Copy link
Member

bkamins commented Apr 10, 2023

I'm aware much better tests are needed,

In general, as you probably noticed, we care a lot about test coverage. The issues like the ones I raised in my comments would be caught by proper tests.
We do not insist on applying test driven development process, but in general the end outcome should be thoroughly tested.

Thank you!

@bkamins
Copy link
Member

bkamins commented Apr 12, 2023

also can you please sync the PR with main branch (I can sync it if you prefer).

@bkamins bkamins added this to the 1.6 milestone Apr 13, 2023
@alonsoC1s
Copy link
Contributor Author

I've just added the checks for the subtypes we had discussed + a fallback for other subtypes of Order that might appear. Currently, the only one not covered is Perm

@alonsoC1s
Copy link
Contributor Author

Also, I wanted to comment on something I came across while reading for the PR. I think we should mention that the detection of complex order clauses will not realize that for instance, x -> x is in essence the same as identity, given that the === comparison cannot pick on those kinds of differences. Would it be worth it to mention this on the documentation, or does it go too deep into implementation details that might be superfluous to the user?

@bkamins
Copy link
Member

bkamins commented May 25, 2023

Would it be worth it to mention this on the documentation, or does it go too deep into implementation details that might be superfluous to the user?

I think we can skip this. Users that would understand the difference will anyway assume that this will work this way.

@alonsoC1s
Copy link
Contributor Author

I think all thats left is to get the coverage back up, I did miss some lines in the coverage report. Other than that, is there something else you would like addressed?

@bkamins
Copy link
Member

bkamins commented May 30, 2023

Looks good. Just please update the tests and push them to GitHub and I will do a final review.

@alonsoC1s
Copy link
Contributor Author

Status update: While testing I noticed something that could be problematic and I'm still working on it

Namely: Whenever sort!(df, Alg, Order) is called directly by the user (which correct me if I'm wrong, but isn't documented in the main page so it's probably uncommon) we run into the same issues about not having access to which cols are to be sorted and checked for uniqueness. For instance, using it like

d = DataFrame(x = fill(1, 4))
ord = Perm(ForwardOrdering(), [1, 2, 3])
alg = Sort.defalg(d, or; alg=nothing, cols=:x)

leaves no opportunity for the uniqueness checks inside the function:

sort!(d, alg, ord)

I thought it might be uncommon for the end user to create a Perm object manually, but this is what happens behind the scenes when calling sort!(df, cols, alg, lt, by, ...) on a single column of df. This should not be a problem in theory given that by that point the other sort! function is called with checkunique=false, but I hesitate to leave it like that and give an error like "uniqueness check not supported when using Perm" because I don't know if I'm missing a situation where this can happen after calling sort!(df, alg, ord) manually with some other hand-made order and not seem obvious why the error message mentions Perm. I don't know if I'm conveying properly my thoughts on this

@bkamins
Copy link
Member

bkamins commented Jun 5, 2023

Can we make sort!(df, Alg, Order) not accept the uniqueness check? (i.e. make all uniqueness checks in other functions calling it, and disallow passing the kwarg for uniqueness check to this function - and document it?)

@alonsoC1s
Copy link
Contributor Author

Yeah, thats a possibility. It would simplify the current changes in fact
Where would we document this? I was checking and I think the sort!(df, Alg, Order) method is not documented (at all), so perhaps this it not an issue and most users don't call that particular method, only sort!(df, cols, lt, ...)?

@bkamins
Copy link
Member

bkamins commented Jun 6, 2023

Where would we document this?

Probalby adding a comment above the function definition is enough. In the docstring we do not describe this method.

@alonsoC1s
Copy link
Contributor Author

@bkamins I reverted the uniqueness checks for sort!(df, Alg, Order) as planned, added the capability of checking complexity of Perm objects, and covered all added lines (I think). Let me know if there is anything in my code that can be improved

@bkamins
Copy link
Member

bkamins commented Jun 12, 2023

Looks good. Thank you! Let us wait for a few days to see if there are any more comments on this PR. If not I will merge it.

Copy link
Member

@nalimilan nalimilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few stylistics comments.

Comment on lines 351 to 352
`lt` keywords are being used, as their application can create duplicate items
inadvertently. Similarly, the use of `order(...)` clauses that specify either
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the reasoning behind "as their application can create duplicate items inadvertently". Doesn't that make that argument precisely useful to protect against inadvertent introduction of duplicate elements? We don't have to support this right away, but no need to provide misleading reasons for that. ;-)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the comments!
My goal was to explain that the disallowed combination is not unsupported because we ran out of time, but rather than the process of sorting after the use of the by and lt functions is a bit unintuitive, as the actual sorting happens on numbers that the user never sees. This would make it harder(er) to reason about why a particular dataframe and an exact copy look different even after being sorted the same way (which I think isn't supposed to happen because sorting is stable, but was something brought up on the issue that motivates this PR)

In other words, the message tries to say "we assume you are concerned about duplicates, and it isn't always obvious why the use of by and lt can create "invisible duplicates" at the time of sorting, and that is why we still don't support it at the moment"

I don't know if I convey that properly in the current docstring, I'd love to hear your take

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just propose to remove the justification part. It is not strictly needed in the docstring (it needs to describe precisely what is being supported). It is just not supported now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good idea. I'll remove them and rework the docstrings a bit

src/abstractdataframe/sort.jl Outdated Show resolved Hide resolved
src/abstractdataframe/sort.jl Outdated Show resolved Hide resolved
test/sort.jl Outdated Show resolved Hide resolved
@bkamins bkamins merged commit a3faa1e into JuliaData:main Jun 19, 2023
@bkamins
Copy link
Member

bkamins commented Jun 19, 2023

Thank you! I hope you enjoyed it!

@alonsoC1s alonsoC1s deleted the simple_uniqueness_sorting branch June 20, 2023 10:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants