Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sort! to give warning if resulting sorting order is not fully determined #2159

Closed
jmboehm opened this issue Mar 21, 2020 · 39 comments
Closed
Labels
feature help wanted This issue is up for grabs if someone is interested in working on it non-breaking The proposed change is not breaking
Milestone

Comments

@jmboehm
Copy link

jmboehm commented Mar 21, 2020

I would suggest that sort and sort! give warnings or an error if the resulting sorting order is not uniquely determined. I think the current behavior may be a source of bugs if the user sort the DataFrame in a not fully determined way and then proceeds to do operations that depend on the order of the rows. See also the discussion here.

Main disadvantage: it's costly to check whether the requested sorting operation fully determines the row order.

I would propose to do this via a keyword argument, for instance require_unique, which could take values true (in which case sort/sort! throws an error in the row order is not fully determined), :warn (which triggers a warning), or false, which disables checking altogether.

(I'm not entirely happy with this being a Union{Bool, Symbol}, perhaps there is a nicer solution.)

I'm not sure what the default should be: either :warn or false. I'm leaning towards :warn, but if you think that the cost is not worth it, I'd be okay with false.

@bkamins bkamins added decision non-breaking The proposed change is not breaking labels Mar 21, 2020
@bkamins
Copy link
Member

bkamins commented Mar 21, 2020

Thank you for raising this point.

By default, unless you override the sorting algorithm by some of your choice, sort in DataFrames.jl uses a stable sorting algorithm, so the order after sorting is determined "as if" the last sorting column were row index for each row. So essentially this is equivalent to stable option of sort in Stata.

Also if the order of rows determines the result of some computations it is normal to assume that all columns that are relevant should be included in the sort specification (and if the order is determined by the original row order then exactly because of this DataFrames.jl uses stable sorting algorithms to preserve this invariant).

The kind of check you ask for is not a standard functionality provided sort that I am aware of in any of the software packages. In DataFrames.jl normally you check for what you have asked for with any(nonunique(df, cols)), where cols are columns you sort on, which is easy enough to write I think before sorting if you need to check it.

I have looked at the discussion in Discourse, but I have not found there information why would people expect sort to perform uniqueness check as a part of this function. I am asking, because it is not a problem to add such a kwarg to sort, but if the functionality would be hardly ever used in practice it is cleaner to have two specialized functions that "do one thing right" (one for checking uniqueness and the other for sorting).

There could be one argument in favour of adding such a kwarg to sort/sort! that such a check can be performed faster in this function in comparison to the cost of any(nonunique(df, cols)). However, the time difference will be anyway dwarfed by the cost of sort. Also this benefit would be present only if we do not error. If you want to error then you have to perform this check before sorting in sort! (e.g. we cannot error half-way of sorting or when we are done sorting).

So in summary - could you please give some deeper rationale behind this issue?

@pdeffebach
Copy link
Contributor

By default, unless you override the sorting algorithm by some of your choice, sort in DataFrames.jl uses a stable sorting algorithm, so the order after sorting is determined "as if" the last sorting column were row index for each row.

Even if the algorithm is stable, the underlying data may change in unexpected ways, say, a sort is performed somewhere before this is called or new data is added.

However I broadly agree with Bogumil. In Stata I would generally perform some sort of assert that _N == 1.

@bkamins
Copy link
Member

bkamins commented Mar 22, 2020

Yes - I agree that stability of an algorithm does not solve all problems, but I just wanted to be explicit what we provide now.

Just to make clear why I question the proposal: in the long term it is much better if all functionalities that we have in DataFrames.jl are actively used by the users. The best way to achieve this is to have functions that do one thing and compose them. What you ask for is very easy to achieve in one line of code already so the question is if it is worth to add it as a special feature of sort/sort!.

@jmboehm
Copy link
Author

jmboehm commented Mar 23, 2020

Perhaps I've been misunderstanding the purpose of the DataFrames.jl package. If the purpose is to implement structures of in-memory tabular data and provide efficient low-level access to them, then please feel free to disregard my proposal. If, however, the purpose is to provide functions that the user is expected to use in their daily data wrangling tasks, then I would ask to give the proposal another thought.

The problem that I'm describing is not a conceptual one, but a practical one: namely that the user may believe that the column names fully determines the row order, but that in practice it does not, and the user is unaware of that. Of course the check whether the sorting keywords fully determine the row order could always be performed prior or subsequent to the sorting operation, but in a situation such as the one described above it's unlikely that the user will actually perform the check. In my view, high-level functions should help the user to avoid such mistakes.

The stability of the algorithm is nice, but would only be a solution to the problem understands what determined the original row order-- which is not necessarily the case.

@bkamins bkamins added breaking The proposed change is breaking. and removed non-breaking The proposed change is not breaking labels Mar 23, 2020
@bkamins
Copy link
Member

bkamins commented Mar 23, 2020

I would ask to give the proposal another thought.

The proposal is considered (I am not closing it but just challenge it to make sure we end up with the right decision).

Therefore to rephrase my concerns. Can you give an example of another package that:

  1. (weaker condition) allows doing this check;
  2. (stronger condition) does this check by default.

I do not say that DataFrames.jl must just blindly follow what is provided elsewhere - we of course can introduce new features other people did not think about - however, the above is a kind of test if this functionality is "a common practice" (then the threshold for adoption is lowered) or "a new feature" (then we should think twice before accepting it).

Also - can you give an example when the user:

  1. reasonably might sort on several columns that do not determine the order uniquely
  2. then perform a computation that relies on some "orderedness" of the rows (other than stable sorting order from the source)

(as I have noted earlier - for me such a situation means that the user has omitted a column in sorting specification, but maybe I am missing something here)

The reason I am asking is that we have to weigh your statement:

namely that the user may believe that the column names fully determines the row order, but that in practice it does not, and the user is unaware of that

vs what I would normally expect (but I might be wrong in my thinking here)

doing a sort that does not uniquely specify the order of elements is a common operation in every programming language and it is performed without a warning and this is something to be normally expected.

The issue is that making a warning/error as an opt-in option (i.e. by default nothing is communicated) does not solve the problem you raise (then one can just use nonunique as I have written).

So we would have to make it an error/warning by default to have any benefit from this feature. However, normally people do not consider duplicates to be a warning/error situation (unless we learn - per my question above - that other packages behave this way) when sorting but a normal and expected scenario (in particular this change then would be breaking as it would affect old codes that were working in the past without a warning/error).

@pdeffebach
Copy link
Contributor

@jmboehm can you just show a Stata package that does this for you? DataFrames is still playing "catch up" with existing data libraries so we want to be extra confident before we add features that other packages don't have.

@jmboehm
Copy link
Author

jmboehm commented Mar 25, 2020

Thanks to both of you for your thoughts on this.

I've looked at several packages and couldn't find implementations of sorting (either by itself or as part of a split-apply-combine) that allowed for simultaneous checking whether the sorting keys uniquely determine the row order. It may well be just me who thinks that this could be useful.

Also - can you give an example when the user:

  1. reasonably might sort on several columns that do not determine the order uniquely
  2. then perform a computation that relies on some "orderedness" of the rows (other than stable > sorting order from the source)

Here is an example that I was facing a few days ago. I have GPS tracker data from a panel of trucks. Each observation consists of a date-time (day/hour/minute/second), a truck id, and a location. These observations are recorded when certain events take place (speeding, turning, time elapsed since the last observation, etc). I want to compute the distance travelled since the last observation.

Now I may think that the truck id and date-time uniquely determine the observation, and run something like (in Stata pseudo-code)

bysort truckid (datetime): gen distance_travelled = distance(location[_n],location[_n-1])

Unfortunately, the same truck may have more than one observation within the same second, so truckid and datetime are not uniquely identifying observations-- instead I should use the original ordering of the dataset, which happens to be chronological within truckid.

You are, of course, right that I have omitted an important variable in the sorting here, namely the row number of the original dataset. But without being aware of the multiplicity of observations within truckid and datetime, you won't know. Now, experienced researchers are aware of such issues, but you see such problems often in code written by research assistants.

I understand that there's a tradeoff here, and I leave it up to you to decide which way to go. No offense if you think the cost is higher than the benefit.

@bkamins
Copy link
Member

bkamins commented Mar 25, 2020

Thank you for a detailed investigation. This is much appreciated.

Unfortunately, the same truck may have more than one observation within the same second, so truckid and datetime are not uniquely identifying observations

An interesting point is that last week I had an identical situation (also GPS data and also with second granularity timestamp) and everyone was surprised that we have duplicates if we do not include an additional tracking column.

So in summary - for sure I do not want to close this issue, as it is a valid one, given the discussion.
However, as there are pros and cons of it I would keep it open for some time to wait for opinions on the preferred behavior.

@bkamins
Copy link
Member

bkamins commented Apr 4, 2020

@nalimilan - there is no rush with this issue, but can you please have a look at it at some moment and give us your opinion. Thank you.

@nalimilan
Copy link
Member

I'm hesitant. AFAICT the main interest of adding such an argument to sort! is to make the check more convenient, in particular you wouldn't have to repeat the names of the columns. So you would write sort!(df, cols, checkunique=true) instead of sort!(df, cols); @assert !any(nonunique(df[!, cols])).

Given that adding an argument isn't costly, why not.

@bkamins bkamins added non-breaking The proposed change is not breaking and removed breaking The proposed change is breaking. labels Apr 6, 2020
@bkamins bkamins added this to the 1.x milestone Apr 6, 2020
@bkamins
Copy link
Member

bkamins commented Apr 6, 2020

@jmboehm - so the recommendation is:

  1. We can add checkunique::Bool kwarg do sort and sort!
  2. Its default value should be false (to be non breaking), but it will be in the documentation so at least users will be warned about this possibility

Please feel free to go ahead and make a PR adding this feature if you have time to spend on it. Typically PR should include the following elements:

  1. changes to the functionality
  2. tests of the new functionality (including "normal" and "hard" cases)
  3. update of the docstrings
  4. update of the manual

Thank you!

(if you do not feel like making this PR it is queued as "nice to have but not urgent" and I will implement it at some time in the future)

@bkamins bkamins added feature help wanted This issue is up for grabs if someone is interested in working on it and removed decision labels Apr 6, 2020
@alonsoC1s
Copy link
Contributor

@bkamins I think I might be able to submit a PR for this. I only have one question: The warning would be given by @warn which requires Logging to be in scope, or is there a prefered approach?

@bkamins
Copy link
Member

bkamins commented Apr 4, 2023

@alonsoC1s - thank you for looking into this issue. Before proceeding could you please comment what solution you propose to implement?

The reason why Logging is not in scope is that in DataFrames.jl we never print warnings normally. This would be the first functionality where we would opt-in for such a behavior (see @nalimilan comment above). That is why it is important to first understand what you propose to implement.

@alonsoC1s
Copy link
Contributor

Of course
Then plan is to add the keyword argument checkunique::Bool to both sort and sort! as implemented in src/sort.jl. The default value of the argument would be false. The plan is to warn (either by using Logging or some other method) whenever there are duplicate values in the columns being sorted by evaluating any(nonunique(df[!, cols])).

Then, adding tests, updating documentation, etc...

@bkamins
Copy link
Member

bkamins commented Apr 4, 2023

  1. You need to be careful when determining cols in sorting as the API allowing for specifying sorting order is rich.
  2. You should cover sort, sort!, sortperm, and issorted for completeness.
  3. I understand that the consensus were that we need three values for the keyword argument:
    • do not run uniqueness check;
    • run uniqueness check and error in case of duplicates;
    • run uniqueness check and warn in case of duplicates;
  4. There is a allunique function, so you can run allunique(df, cols) to perform the check. However, in general it is not enough as uniqeness depends on the by and lt arguments in sorting. Consider:
julia> df = DataFrame(x=1.3:0.1:2.2)
10×1 DataFrame
 Row │ x
     │ Float64
─────┼─────────
   1 │     1.3
   2 │     1.4
   3 │     1.5
   4 │     1.6
   5 │     1.7
   6 │     1.8
   7 │     1.9
   8 │     2.0
   9 │     2.1
  10 │     2.2

julia> sort(shuffle(df), by=round)
10×1 DataFrame
 Row │ x
     │ Float64
─────┼─────────
   1 │     1.3
   2 │     1.4
   3 │     1.9
   4 │     2.0
   5 │     1.5
   6 │     1.7
   7 │     2.1
   8 │     2.2
   9 │     1.8
  10 │     1.6

julia> sort(shuffle(df), by=round)
10×1 DataFrame
 Row │ x
     │ Float64
─────┼─────────
   1 │     1.3
   2 │     1.4
   3 │     2.0
   4 │     1.7
   5 │     1.6
   6 │     1.9
   7 │     1.8
   8 │     1.5
   9 │     2.1
  10 │     2.2

These are my considerations I think we need to settle in the design discussion before going for implementation.

@alonsoC1s
Copy link
Contributor

  1. Understood. Then perhaps column checking should be done at the lowest level of called functions (_sortperm if I understand correctly)
  2. Totally agree
  3. We could use a symbol as the keyword argument instead of a Bool to allow for the other possibilities
  4. I'm still thinking of a way to avoid the pitfalls you mentioned

@bkamins
Copy link
Member

bkamins commented Apr 4, 2023

Thank you for looking into it. (the reason why this issue is open for so long is exactly that it is not clear how to make it useful and at the same time ensure that it is fully correct in all cases)

@alonsoC1s
Copy link
Contributor

Regarding point 4, the only way I can think of right now to check for repeated elements while taking into account the effects of the by kwarg is to apply the function specified in by, and check uniqueness there. Unfortunately, as a side effect, we would have to make a copy of the columns, which can be expensive. Is this acceptable, assuming the copy only happens whenever we're asked to check for uniqueness?

If so, I can get to work on some initial proposal. Or should I consider anything else before writing any code?

@bkamins
Copy link
Member

bkamins commented Apr 4, 2023

If so, I can get to work on some initial proposal.

This is the issue. I am OK that you start working on a proposal given that you accept that the proposal might be rejected.
I usually prefer to settle the design before writing code so that I am sure that we implement things that we want to have.

In general I see three options:

  1. easy: support the kwarg only for "simple" sorts, i.e. not doing any transformations (then you would throw an error if e.g. user passed by and wanted duplicate checking)
  2. complex: correctly handle duplicates that would be caused by by and lt (warning: this might turn out to be quite complex)
  3. staged: start with option 1. and later, in the next PR work on option 2. (i.e. get the functionality quick and later think of improving it)

@alonsoC1s
Copy link
Contributor

Sounds like the staged approach is the most sensible. The basic support for the keyword can be added without much fanfarre, and would allow us to evaluate other potential points of friction. As that initial PR is incorporated, tested, etc... I can happily do some tests on handling duplicates resulting from the use of by and lt, and comment on those

@nalimilan
Copy link
Member

As you know I'm not a fan of warnings as either there's a problem and we should throw an error so that it cannot go unnoticed, or there isn't and we shouldn't annoy users with messages. Do we really have a use case for warnings? Otherwise I'd rather just have a Boolean argument which would enable throwing an error.

@bkamins
Copy link
Member

bkamins commented Apr 6, 2023

Do we really have a use case for warnings?

I know and that is why I am hesitant with this functionality. However, if someone explicitly opts-in for checkunique=:warn keyword argument then probably this is exactly what user wants.

The question to users (@alonsoC1s @jmboehm ) is indeed they envision that they would use this functionality (the thing I want to avoid - as stated above - is adding functionality that in the end would not be used).

@jmboehm
Copy link
Author

jmboehm commented Apr 6, 2023

I agree with the argument that @warn may not be noticed when not run "interactively", and that throwing an error might be therefore preferable.

@alonsoC1s
Copy link
Contributor

As a user I can't remember trying to use sort in a way which duplicate elements caused some unintuitive error. But I agree with the discussion that happened before I joined the thread, if it's opt-in and there is no extra cost to be incurred by default, it is nice to have

Regarding the use of warnings, I think it would be perfectly acceptable to plain error-out whenever the kwarg is set to true (or some symbol for that matter), since in this case the existence of duplicates is a problem big enough that the user had to do some extra reading on ways to handle them. If the user then decides it's not that big of a deal, they can handle the error as they choose on their end

@bkamins
Copy link
Member

bkamins commented Apr 6, 2023

OK - so I understand the conclusion is that we start with Bool argument which defaults to false, and if set to true throws an error.

@alonsoC1s
Copy link
Contributor

I think we're all on the same page. If so, I can start with the basic implementation

@alonsoC1s
Copy link
Contributor

I just opened PR #3312. I already have some ideas as to how the enhanced uniqueness check could be implemented. Should we discuss it in this thread or open a second issue?

@bkamins
Copy link
Member

bkamins commented Apr 8, 2023

we can discuss it here

@alonsoC1s
Copy link
Contributor

I think an elegant solution would be to extend allunique. The signature of this new method would be

allunique(df::AbstractDataFrame,
	  cols=:,
	  lt::Union{Function, AbstractVector{<:Function}}=isless,
	  by::Union{Function, AbstractVector{<:Function}}=identity,
	  rev::Union{Bool, AbstractVector{Bool}}=false,
	  order::Union{Ordering, AbstractVector{<:Ordering}}=Forward
)

That way sort.jl would stay mostly as it is in PR #3312 (specifically, as of commit b9fc81d). I think that approach does have some challenges though. Namely, how to repurpose row_group_slots! (I understand hashing is utilized to check for duplicates in a very efficient way) to work with the possible effects of by withouth having to store a copy of the dataframe. Please share your thoughts on this, I realize this might be a terribly misguided approach

@bkamins
Copy link
Member

bkamins commented Apr 8, 2023

  1. Changing allunique will be hard. row_group_slots! is complex and should not be touched in general.
  2. Also, as I have commented above, you would have to handle cases like cols being [order(:x, by = x->-x), order(:y, rev=true)] which are even more complex

@alonsoC1s
Copy link
Contributor

  1. I was hoping to get away without modifying row_group_slots!, I agree we shouldn't touch it. Maybe changing this line:
    return row_group_slots!(ntuple(i -> udf[!, i], ncol(udf)), Val(false), nothing, false, nothing, true)[1] == nrow(df)

applying the by function efficiently. I have no idea of this can be done in a smart way though

  1. I might be mistaken, but by the time allunique is called we are already dealing with column indices, right? That might make it a bit easier

@bkamins
Copy link
Member

bkamins commented Apr 8, 2023

I might be mistaken, but by the time allunique is called we are already dealing with column indices, right?

Have you checked this? I looked at your code and allunique(df, cols) is called eg in line https://github.com/JuliaData/DataFrames.jl/pull/3312/files#diff-ff50299f4c5fb3277058f514863b25153f68ae16ca3c63bd8b225e11c46de838R410 where it is not guaranteed that cols is a plain columns selector.

@alonsoC1s
Copy link
Contributor

You're right, I hadn't considered that. Thank you for bringing it up

@alonsoC1s
Copy link
Contributor

While working on the fix I noticed something. issorted(df, [Not(:x), order(:y, rev=true)]) throws a MethodError. Maybe it would be best to catch those cases and let the user know that is specificaly not allowed?

@bkamins
Copy link
Member

bkamins commented Apr 10, 2023

[Not(:x), order(:y, rev=true)]

There is a much wider class of disallowed sorting column specifications. Error message can be improved, but I would do it in a separate PR.

@alonsoC1s
Copy link
Contributor

Using the current changes in the proposed PR as a baseline, I think the implementation of uniqueness checks taking into account the effects of by and lt could be done by transforming the dataframe in the _perform_uniqueness_checks function. Naturally this means we would have to make a copy of ths subdataframe, which is not ideal. But, since this is an opt-in check, I think the added computational cost is justifiable

@bkamins
Copy link
Member

bkamins commented Apr 14, 2023

Yes. This would be acceptable I think.

As I have commented - my only hesitation is for adding functionalities that no one would use anyway. Do you think you would ever consider to use it?

@alonsoC1s
Copy link
Contributor

For me personally, the need for something like this hasn't come up in my workflows. This might be useful in some cases like the ones discussed above, but I feel like those are not very common. Concern for duplicates and the need for more complex sorting functionality sounds very very niche. That being said, I couldn't speak for most users, perhaps it would be useful and people aren't using it more as most other major data-wrangling packages don't provide the option either

Maybe asking on Discourse would help gauge the usefulness beyond our own experiences

@bkamins
Copy link
Member

bkamins commented Jun 19, 2023

Fixed in #3312

@bkamins bkamins closed this as completed Jun 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature help wanted This issue is up for grabs if someone is interested in working on it non-breaking The proposed change is not breaking
Projects
None yet
Development

No branches or pull requests

5 participants