-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Upgraded to t-digest 3.3. #3634
Conversation
❌ Gradle Check failure 2a5132460f22fcf4bc4a0f6f1e267467ffb5d9c4 |
Signed-off-by: dblock <dblock@dblock.org>
Ok, doesn't look so simple...
|
Signed-off-by: dblock <dblock@dblock.org>
|
Signed-off-by: dblock <dblock@dblock.org>
Signed-off-by: dblock <dblock@dblock.org>
@tdunning Would you mind assisting here a bit please? The upgrade from 3.2 to 3.3 produces different percentiles in some scenarios given very simple data. Neither the old data nor the new data is "correct", but that is expected I imagine given that we use t-digest. You can see raw data in https://github.com/opensearch-project/OpenSearch/blob/37651e9b5fe914a99f0abe0a36e10bd46d958691/rest-api-spec/src/main/resources/rest-api-spec/test/search.aggregation/180_percentiles_tdigest_metric.yml and the diff in this PR for how those changed. The data is just 4 values: 1, 51, 101 and 151, Google Sheets results below.
I didn't expect such a big difference in a dot release. At the very least I'd like to understand whether this is expected, and whether this is going to have to be released as a major breaking change for OpenSearch users. More different results with smaller tests: bb9e8f2. |
Signed-off-by: dblock <dblock@dblock.org>
Signed-off-by: dblock <dblock@dblock.org>
Signed-off-by: dblock <dblock@dblock.org>
I am not entirely clear about how to read the difference here.
I think that what you are saying is that given samples [1, 51, 101, 151]
the 25th, 50th and 75th percentiles changed from [26,76,126] to
[51,101,151].
Is that correct?
If so, this was a bug fix involved in cleaning up the behavior of the
system in small count cases. The problem is that with just four data
points, we don't need to summarize the data at all. As such, any quantiles
that involve interpolation between observed values are simply wrong.
…On Tue, Jun 21, 2022 at 10:55 AM Daniel (dB.) Doubrovkine < ***@***.***> wrote:
@tdunning <https://github.com/tdunning> Would you mind assisting here a
bit please?
The upgrade from 3.2 to 3.3 produces different percentiles in some
scenarios given very simple data. You can see raw data in
https://github.com/opensearch-project/OpenSearch/blob/37651e9b5fe914a99f0abe0a36e10bd46d958691/rest-api-spec/src/main/resources/rest-api-spec/test/search.aggregation/180_percentiles_tdigest_metric.yml
and the diff in this PR for how those changed. The data is just 4 values:
1, 51, 101 and 151.
I didn't expect such a big difference in a dot release. At the very least
I'd like to understand whether this is expected, and whether this is going
to have to be released as a major breaking change for OpenSearch users.
—
Reply to this email directly, view it on GitHub
<#3634 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAB5E6XFO6CHQPJNE2LVYX3VQH62BANCNFSM5ZJKGMWQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Sorry... I missed your very fine explanation. I understand now that you were doing a regression test against previous behavior and were surprised at a change in this behavior. The fact is, however, this old behavior was a bug. That bug was fixed. If we look at the quantile curve for your data, we see this: The circles indicate an open boundary and the filled dots indicate a closed one. Because we retain all of the data, we can't in good faith interpolate. The only question is whether the quantile at exactly 0.25 should be 1 or 51. In t-digest, I settled on the lower value. The old code was interpolating and was just wrong.
|
In case you are curious, a similar issue arises with the cdf function. There, the graph for you data looks like this: Here, what I have chosen is to use the mid point when you ask for the CDF at exactly a sample point. This gets a bit fancier when there are multiple samples at just the same point. In general, I take the CDF to be the |
@tdunning Thank you! This is super clear. |
❌ Gradle Check failure 633595636b789bec83313b4ee346d3d114e6c3f5 |
Signed-off-by: dblock <dblock@dblock.org>
@kartg I do care about users more than merge conflicts, but I hear you. Any feelings about user impact? |
One of the functional tests from OpenSearch Dashboards displayed the incorrect value [link it issue]. We will update the value on |
Origin: opensearch-project/OpenSearch#3634 The previous value was actually incorrect after OpenSearch bumped t-digest the value is now the correct value. Issue: opensearch-project#1821 Signed-off-by: Kawika Avilla <kavilla414@gmail.com>
Then shouldn't the result be [1, 51, 101]? [1, 51, 101] is the result I get from Mathematica as well:
|
Origin: opensearch-project/OpenSearch#3634 The previous value was actually incorrect after OpenSearch bumped t-digest the value is now the correct value. Issue: opensearch-project#1821 Signed-off-by: Kawika Avilla <kavilla414@gmail.com>
Origin: opensearch-project/OpenSearch#3634 The previous value was actually incorrect after OpenSearch bumped t-digest the value is now the correct value. Issue: opensearch-project#1821 Signed-off-by: Kawika Avilla <kavilla414@gmail.com>
* [Tests] update expected value for percentile ranks Origin: opensearch-project/OpenSearch#3634 The previous value was actually incorrect after OpenSearch bumped t-digest the value is now the correct value. Issue: #1821 Signed-off-by: Kawika Avilla <kavilla414@gmail.com> * skip inconsistent values Signed-off-by: Kawika Avilla <kavilla414@gmail.com> * use slice Signed-off-by: Kawika Avilla <kavilla414@gmail.com>
So when updated the values for this test it seemed to get inconsistent values for the 50th, 75th, and 95th percentile for example: https://github.com/opensearch-project/OpenSearch-Dashboards/runs/7120932868?check_suite_focus=true and |
@kavilla Are you sure? I felt like I was getting something similar, but turned out the tests were seeded with some random value. In any case if you are sure, open a new issue? |
I am happy to comment on the t-digest side of things if somebody can say what the test is actually doing. |
@tdunning Could you check out the above, please? |
@kavilla Want to open an issue in t-digest re: ^ ? |
The problem here is that the inverse CDF (aka quantile) is not a function.
For the example you give with observations at [1, 51, 101, 151], the CDF is
well behaved and looks like this:
![image](https://user-images.githubusercontent.com/250490/199615932-8508dffb-6df0-462e-9b2a-3843cfdd804c.png)
This does have discontinuities at each observed value, of course, and we
could adopt a variety of conventions to define the value of the CDF exactly
at each sampled value. One convention defines the CDF as segments that are
open on the left, but closed on the right. This would make the value of
CDF(1) be 0. The opposite convention has all segments closed on the left
and open on the right so CDF(1) would be 0.25. These conventions are
asymmetric, however, which can lead to surprising results if, for instance,
you negate all sampled values. I would prefer that CDF(samples, x) ==
CDF(-samples, -x) so t-digest uses the convention that the CDF is midway
between these alternative conventions so that CDF(1) = 0.125 instead of
either 0 or 0.25.
Resolving the value at these discontinuities does not solve the problem you
are talking about. As you can see, the value of the CDF is constant for all
values in the open interval (51, 101). This means that there is no unique
value for the inverse CDF at 0.5. Any value in (51, 101) could be used.
Mathematic uses 51. Julia and R use 76. T-digest uses 101.
…On Wed, Jul 6, 2022 at 11:04 AM Daniel (dB.) Doubrovkine < ***@***.***> wrote:
I settled on the lower value.
Then shouldn't the result be [1, 51, 101]?
[1, 51, 101] is the result I get from Mathematica as well:
vec = {1, 51, 101, 151};
Quantile[vec, #] & /@ {1/4, 1/2, 3/4}
{1, 51, 101}
@tdunning <https://github.com/tdunning> Could you check out the above,
please?
—
Reply to this email directly, view it on GitHub
<#3634 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAB5E6XAUISG2R7HV62BBRDVSXDC7ANCNFSM5ZJKGMWQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
@tdunning The images didn't make it to GitHub, if you care to edit, but thanks for your explanation! |
From https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/quantile and https://mathworld.wolfram.com/Quantile.html, we could choose between 9 standardized definitions. To make sure we are comparing the same things - especially in unit tests - we probably should decide on a default one, and optionally enable the choice of the others types. Mathematica uses type 1 by default, R uses type 7 by default, but both provide the option to choose. |
Sorry about that. The image is very similar to what I posted in an earlier comment. I have edited my reply but in editing the response, I had difficulty getting the image to show up. |
So I have been experimenting a fair bit with the Julia implementation (easier than playing with the Java version because interactive). I have changed the problem in question a tiny bit to make it more clear what is happening. I am using points at My first experiment was to verify that the The point of real interest, however, was to determine how the I would contend that it is hard to do better than this due to inevitable floating point limits. @dblock , @kavilla , @sharp-pixel what do you think? Also, I looked into the R and Julia implementations of the quantile function. In fact, they are trying to estimate the theoretical distribution rather than the empirical inverse cdf. This is a different problem entirely. Adding the Julia The result is far from the empirical inverse cdf function. |
Thanks @tdunning. It seems Julia uses type 7 (from https://docs.julialang.org/en/v1/stdlib/Statistics/#Statistics.quantile!), so we just need to pick one quantile function type as the default, document it, and optionally have the choice to override the type. |
I am not so sure of that. The different types of quantile estimation are all geared toward estimating a population quantile function assuming that the data we have is only a sample of that population. That's an important problem. But it isn't what t-digest is intended to do. Instead, t-digest is intended to estimate the cdf and inverse cdf of the data we are given as it actually is. This refers to the empirical distribution as opposed to the population CDF. This is much simpler in many ways than trying to estimate the population, but it can be confusing because of the collision on the name "quantile". There is clearly a problem here (user confusion is indisputably a problem), but I really think that the correct action here is to fix documentation on both |
I opened #5115 to discuss the user-facing aspects of this. Because there's no one correct result at the edges, I think we could support multiple strategies to everybody's satisfaction. |
Signed-off-by: dblock dblock@dblock.org
Description
The upgrade to t-digest 3.3 fixes a number of bugs in calculating percentiles.
Looking at sample output, the old version 3.2 was interpolating data (see #3634 (comment) for an explanation) and producing different (wrong) results, especially in the small sample size. For example, given input of [1, 51, 101, 151] the 25th, 50th and 75th percentiles changed from [26,76,126] to [51,101,151] with this upgrade.
The tests in this PR have been adjusted to reflect the new expected percentiles, and I added both a 2.x mixed cluster test, and made all the other tests select a 3.x node to preserve a trail of this change. I also corrected the assumption that the number of centroids is
<=
that the number of data points, not=
.Because results change significantly I think this is a 3.x change and should not be back-ported, but open to other considerations.
There are many changes between t-digest 3.2 and 3.3, see tdunning/t-digest#194.
Issues Resolved
Closes #1756.
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.