Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix overflows in quantile #145

Merged
merged 3 commits into from
Jul 29, 2023
Merged

Fix overflows in quantile #145

merged 3 commits into from
Jul 29, 2023

Conversation

nalimilan
Copy link
Member

The a + γ*(b-a) introduced by JuliaLang/julia#16572 has the advantage that it increases with γ even when a and b are very close, but it has the drawback that it is not robust to overflow. This is likely to happen in practice with small integer and floating point types.

Conversely, the (1-γ)*a + γ*b which is currently used only for non-finite quantities is robust to overflow but may not always increase with γ as when a and b are very close or (more frequently) equal since precision loss can give a slightly smaller value for a larger γ. This can be problematic as it breaks an expected invariant.

So keep using the a + γ*(b-a) formula when a ≈ b, in which case it's almost like returning either a or b but less arbitrary.

Fixes #144.

The `a + γ*(b-a)` introduced by JuliaLang/julia#16572 has the advantage that it
increases with `γ` even when `a` and `b` are very close, but it has the drawback
that it is not robust to overflow. This is likely to happen in practice with
small integer and floating point types.

Conversely, the `(1-γ)*a + γ*b` which is currently used only for non-finite quantities
is robust to overflow but may not always increase with `γ` as when `a` and `b`
are very close or (more frequently) equal since precision loss can give a slightly smaller
value for a larger `γ`. This can be problematic as it breaks an expected invariant.

So keep using the `a + γ*(b-a)` formula when `a ≈ b`, in which case it's almost
like returning either `a` or `b` but less arbitrary.
@codecov-commenter
Copy link

codecov-commenter commented Jul 1, 2023

Codecov Report

Patch coverage: 100.00% and project coverage change: +0.01% 🎉

Comparison is base (bb7063d) 96.98% compared to head (000d4c1) 96.99%.

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #145      +/-   ##
==========================================
+ Coverage   96.98%   96.99%   +0.01%     
==========================================
  Files           1        1              
  Lines         431      433       +2     
==========================================
+ Hits          418      420       +2     
  Misses         13       13              
Files Changed Coverage Δ
src/Statistics.jl 96.99% <100.00%> (+0.01%) ⬆️

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@bkamins bkamins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It required also testing if the function is non-decreasing if we increase b and switch the formula, but I tested it and it holds.

@nalimilan
Copy link
Member Author

nalimilan commented Jul 1, 2023

It required also testing if the function is non-decreasing if we increase b and switch the formula, but I tested it and it holds.

It's already covered by the test added a long time ago by JuliaLang/julia#16572. That's how I realized the problem. ;-)

EDIT: You mean γ, not b?

@bkamins
Copy link
Contributor

bkamins commented Jul 2, 2023

In general I mean that it should be monotonic in a, b and γ and I checked all. The tests did not cover all cases fully. The reason is that e.g. you need to test case when float(a) ≈ float(b) but !(float(a) ≈ nextfloat(float(b))) (and similarly !(prevfloat(float(a)) ≈ float(b)) for monotonicity for various γ; I think it is not covered but I checked it).

@nalimilan
Copy link
Member Author

OK. So you mean two tests like this are needed?

    @test issorted(quantile([1.0, 1.0+eps(), 1.0+2eps(), 1.0+3eps()], range(0, 1, length=100)))
    @test issorted(quantile([1.0, 1.0+2eps(), 1.0+4eps(), 1.0+6eps()], range(0, 1, length=100)))

@bkamins
Copy link
Contributor

bkamins commented Jul 28, 2023

Yes - something like this (this is not strictly needed 😄, but I run such tests and they were OK).

@nalimilan nalimilan merged commit 35ca0a0 into master Jul 29, 2023
11 checks passed
@nalimilan nalimilan deleted the nl/quantile branch July 29, 2023 21:32
nalimilan added a commit that referenced this pull request Oct 2, 2023
Before #145 `Date` and `DateTime` were supported with `quantile` as long
as the cut point falls between two equal values. Restore this behavior
as some code may rely on this given that it is the most common situation
with large datasets.
nalimilan added a commit that referenced this pull request Oct 2, 2023
Before #145 `Date` and `DateTime` were supported with `quantile` as long
as the cut point falls between two equal values. Restore this behavior
as some code may rely on this given that it is the most common situation
with large datasets.
nalimilan added a commit that referenced this pull request Oct 3, 2023
Before #145 `Date` and `DateTime` were supported with `quantile` as long
as the cut point falls between two equal values. Restore this behavior
as some code may rely on this given that it is the most common situation
with large datasets.
nalimilan added a commit that referenced this pull request Nov 5, 2023
Before #145 `Date` and `DateTime` were supported with `quantile` as long
as the cut point falls between two equal values. Restore this behavior
as some code may rely on this given that it is the most common situation
with large datasets.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Incorrect quantiles for floating-point and integer arrays
3 participants