-
Notifications
You must be signed in to change notification settings - Fork 368
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sort! to give warning if resulting sorting order is not fully determined #2159
Comments
Thank you for raising this point. By default, unless you override the sorting algorithm by some of your choice, Also if the order of rows determines the result of some computations it is normal to assume that all columns that are relevant should be included in the sort specification (and if the order is determined by the original row order then exactly because of this DataFrames.jl uses stable sorting algorithms to preserve this invariant). The kind of check you ask for is not a standard functionality provided I have looked at the discussion in Discourse, but I have not found there information why would people expect sort to perform uniqueness check as a part of this function. I am asking, because it is not a problem to add such a kwarg to sort, but if the functionality would be hardly ever used in practice it is cleaner to have two specialized functions that "do one thing right" (one for checking uniqueness and the other for sorting). There could be one argument in favour of adding such a kwarg to So in summary - could you please give some deeper rationale behind this issue? |
Even if the algorithm is stable, the underlying data may change in unexpected ways, say, a However I broadly agree with Bogumil. In Stata I would generally perform some sort of |
Yes - I agree that stability of an algorithm does not solve all problems, but I just wanted to be explicit what we provide now. Just to make clear why I question the proposal: in the long term it is much better if all functionalities that we have in DataFrames.jl are actively used by the users. The best way to achieve this is to have functions that do one thing and compose them. What you ask for is very easy to achieve in one line of code already so the question is if it is worth to add it as a special feature of |
Perhaps I've been misunderstanding the purpose of the DataFrames.jl package. If the purpose is to implement structures of in-memory tabular data and provide efficient low-level access to them, then please feel free to disregard my proposal. If, however, the purpose is to provide functions that the user is expected to use in their daily data wrangling tasks, then I would ask to give the proposal another thought. The problem that I'm describing is not a conceptual one, but a practical one: namely that the user may believe that the column names fully determines the row order, but that in practice it does not, and the user is unaware of that. Of course the check whether the sorting keywords fully determine the row order could always be performed prior or subsequent to the sorting operation, but in a situation such as the one described above it's unlikely that the user will actually perform the check. In my view, high-level functions should help the user to avoid such mistakes. The stability of the algorithm is nice, but would only be a solution to the problem understands what determined the original row order-- which is not necessarily the case. |
The proposal is considered (I am not closing it but just challenge it to make sure we end up with the right decision). Therefore to rephrase my concerns. Can you give an example of another package that:
I do not say that DataFrames.jl must just blindly follow what is provided elsewhere - we of course can introduce new features other people did not think about - however, the above is a kind of test if this functionality is "a common practice" (then the threshold for adoption is lowered) or "a new feature" (then we should think twice before accepting it). Also - can you give an example when the user:
(as I have noted earlier - for me such a situation means that the user has omitted a column in sorting specification, but maybe I am missing something here) The reason I am asking is that we have to weigh your statement:
vs what I would normally expect (but I might be wrong in my thinking here)
The issue is that making a warning/error as an opt-in option (i.e. by default nothing is communicated) does not solve the problem you raise (then one can just use So we would have to make it an error/warning by default to have any benefit from this feature. However, normally people do not consider duplicates to be a warning/error situation (unless we learn - per my question above - that other packages behave this way) when sorting but a normal and expected scenario (in particular this change then would be breaking as it would affect old codes that were working in the past without a warning/error). |
@jmboehm can you just show a Stata package that does this for you? DataFrames is still playing "catch up" with existing data libraries so we want to be extra confident before we add features that other packages don't have. |
Thanks to both of you for your thoughts on this. I've looked at several packages and couldn't find implementations of sorting (either by itself or as part of a split-apply-combine) that allowed for simultaneous checking whether the sorting keys uniquely determine the row order. It may well be just me who thinks that this could be useful.
Here is an example that I was facing a few days ago. I have GPS tracker data from a panel of trucks. Each observation consists of a date-time (day/hour/minute/second), a truck id, and a location. These observations are recorded when certain events take place (speeding, turning, time elapsed since the last observation, etc). I want to compute the distance travelled since the last observation. Now I may think that the truck id and date-time uniquely determine the observation, and run something like (in Stata pseudo-code)
Unfortunately, the same truck may have more than one observation within the same second, so You are, of course, right that I have omitted an important variable in the sorting here, namely the row number of the original dataset. But without being aware of the multiplicity of observations within I understand that there's a tradeoff here, and I leave it up to you to decide which way to go. No offense if you think the cost is higher than the benefit. |
Thank you for a detailed investigation. This is much appreciated.
An interesting point is that last week I had an identical situation (also GPS data and also with second granularity timestamp) and everyone was surprised that we have duplicates if we do not include an additional tracking column. So in summary - for sure I do not want to close this issue, as it is a valid one, given the discussion. |
@nalimilan - there is no rush with this issue, but can you please have a look at it at some moment and give us your opinion. Thank you. |
I'm hesitant. AFAICT the main interest of adding such an argument to Given that adding an argument isn't costly, why not. |
@jmboehm - so the recommendation is:
Please feel free to go ahead and make a PR adding this feature if you have time to spend on it. Typically PR should include the following elements:
Thank you! (if you do not feel like making this PR it is queued as "nice to have but not urgent" and I will implement it at some time in the future) |
@bkamins I think I might be able to submit a PR for this. I only have one question: The warning would be given by |
@alonsoC1s - thank you for looking into this issue. Before proceeding could you please comment what solution you propose to implement? The reason why |
Of course Then, adding tests, updating documentation, etc... |
These are my considerations I think we need to settle in the design discussion before going for implementation. |
|
Thank you for looking into it. (the reason why this issue is open for so long is exactly that it is not clear how to make it useful and at the same time ensure that it is fully correct in all cases) |
Regarding point 4, the only way I can think of right now to check for repeated elements while taking into account the effects of the If so, I can get to work on some initial proposal. Or should I consider anything else before writing any code? |
This is the issue. I am OK that you start working on a proposal given that you accept that the proposal might be rejected. In general I see three options:
|
Sounds like the staged approach is the most sensible. The basic support for the keyword can be added without much fanfarre, and would allow us to evaluate other potential points of friction. As that initial PR is incorporated, tested, etc... I can happily do some tests on handling duplicates resulting from the use of |
As you know I'm not a fan of warnings as either there's a problem and we should throw an error so that it cannot go unnoticed, or there isn't and we shouldn't annoy users with messages. Do we really have a use case for warnings? Otherwise I'd rather just have a Boolean argument which would enable throwing an error. |
I know and that is why I am hesitant with this functionality. However, if someone explicitly opts-in for The question to users (@alonsoC1s @jmboehm ) is indeed they envision that they would use this functionality (the thing I want to avoid - as stated above - is adding functionality that in the end would not be used). |
I agree with the argument that |
As a user I can't remember trying to use Regarding the use of warnings, I think it would be perfectly acceptable to plain error-out whenever the kwarg is set to |
OK - so I understand the conclusion is that we start with |
I think we're all on the same page. If so, I can start with the basic implementation |
I just opened PR #3312. I already have some ideas as to how the enhanced uniqueness check could be implemented. Should we discuss it in this thread or open a second issue? |
we can discuss it here |
I think an elegant solution would be to extend
That way |
|
applying the
|
Have you checked this? I looked at your code and |
You're right, I hadn't considered that. Thank you for bringing it up |
While working on the fix I noticed something. |
There is a much wider class of disallowed sorting column specifications. Error message can be improved, but I would do it in a separate PR. |
Using the current changes in the proposed PR as a baseline, I think the implementation of uniqueness checks taking into account the effects of |
Yes. This would be acceptable I think. As I have commented - my only hesitation is for adding functionalities that no one would use anyway. Do you think you would ever consider to use it? |
For me personally, the need for something like this hasn't come up in my workflows. This might be useful in some cases like the ones discussed above, but I feel like those are not very common. Concern for duplicates and the need for more complex sorting functionality sounds very very niche. That being said, I couldn't speak for most users, perhaps it would be useful and people aren't using it more as most other major data-wrangling packages don't provide the option either Maybe asking on Discourse would help gauge the usefulness beyond our own experiences |
Fixed in #3312 |
I would suggest that
sort
andsort!
give warnings or an error if the resulting sorting order is not uniquely determined. I think the current behavior may be a source of bugs if the user sort the DataFrame in a not fully determined way and then proceeds to do operations that depend on the order of the rows. See also the discussion here.Main disadvantage: it's costly to check whether the requested sorting operation fully determines the row order.
I would propose to do this via a keyword argument, for instance
require_unique
, which could take valuestrue
(in which casesort
/sort!
throws an error in the row order is not fully determined),:warn
(which triggers a warning), orfalse
, which disables checking altogether.(I'm not entirely happy with this being a
Union{Bool, Symbol}
, perhaps there is a nicer solution.)I'm not sure what the default should be: either
:warn
orfalse
. I'm leaning towards:warn
, but if you think that the cost is not worth it, I'd be okay withfalse
.The text was updated successfully, but these errors were encountered: