-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
median absolute deviation agg #34482
Conversation
This commit adds a new single value metric aggregation that calculates the statistic called median absolute deviation, which is a measure of variability that works on more types of data than standard deviation Our calculation of MAD is approximated using t-digests. In the collect phase, we collect each value visited into a t-digest. In the reduce phase, we merge all value t-digests, then create a t-digest of deviations using the first t-digest's median and centroids
Pinging @elastic/es-search-aggs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andyb-elastic I left some comments but I think this is a good start
for (InternalAggregation aggregation : aggregations) { | ||
final InternalMedianAbsoluteDeviation magAgg = (InternalMedianAbsoluteDeviation) aggregation; | ||
if (valueMerged == null) { | ||
valueMerged = new TDigestState(magAgg.valuesSketch.compression()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To avoid this check I think we can just initialise the valueMerged
to an empty T-Digest outside the loop and merge each response into that?
Releasables.close(valueSketches); | ||
} | ||
|
||
public static double computeMedianAbsoluteDeviation(TDigestState valuesSketch) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally I think this static method should live in InternalMedianAbsoluteDeviation
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
++ Seems more like an Internal
'y method, even though it's used in multiple places
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
|
||
@Override | ||
public double getMedianAbsoluteDeviation() { | ||
return computeMedianAbsoluteDeviation(valuesSketch); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a relatively expensive operation, I think we should calculate this value once and store it rather than calculating it each time the getter is called, especially as the value cannot change. Maybe we should calculate this value in the constructor and store it along side the sketch or calculate it in the reduce method and store it alongside the sketch at that point?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely agree, I had a couple ways of doing it before but nixed them because they were gross. Don't know why doing it in the constructor didn't occur to me, that's much cleaner
public InternalAggregation doReduce(List<InternalAggregation> aggregations, ReduceContext reduceContext) { | ||
TDigestState valueMerged = null; | ||
for (InternalAggregation aggregation : aggregations) { | ||
final InternalMedianAbsoluteDeviation magAgg = (InternalMedianAbsoluteDeviation) aggregation; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the code in The aggregator factory below it looks like magAgg.valuesSketch
can be null if the agg is unmapped, I think we need to guard against NPEs here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure, I think it gets protected against the valuesSketch being null in MedianAbsoluteDeviationAggregator#buildAggregation
because if we never built a sketch for the bucket it's requesting, we just create an empty one. Is there another way the internal aggregation gets constructed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The typo definitely needs fixing though
|
||
public static double computeMedianAbsoluteDeviation(TDigestState valuesSketch) { | ||
|
||
if (valuesSketch.size() == 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since unmapped aggs have a null
sketch I think we need a null check in this method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume the unmapped and partially unmapped tests in the integration test fail at the moment because of this? If not, we should work out why and ensure they are correct and have a test that tests for NPEs here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method actually has the same behavior (returning NaN) without this check, because we create an empty approximatedDeviationsSketch
, and calling #quantile
on an empty sketch returns NaN. I added this check to make it really obvious what happens without data points, since I don't think that #quantile
returning NaN is documented
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I explained why I don't think valuesSketch
can be null here in another comment, the unmapped and partially unmapped tests were passing when I committed this
* Set the compression factor of the t-digest sketches used | ||
*/ | ||
public MedianAbsoluteDeviationAggregationBuilder compression(double compression) { | ||
if (compression < 0.0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really a MAD issue just something that caught my eye. I know we allow zero as a valid compression in TDigest percentiles... but I'm wondering if we shouldn't allow it?
A quick skim of AVLTreeDigest shows compression
used as a denominator and in a multiplication. The agg still "works" with 0
compression, but I'm thinking that will cause a compression on each new point? Should we allow that sort of behavior?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to not allowing compression of 0, it seems dangerous and trappy to me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely agree
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really this agg's accuracy is pretty bad with any compression less than 20 or so, but it kind of feels like restricting it that high is a little heavy handed. But there's no reason why anyone should be using compression 0
* move mad calculation method to internal agg * make more explicit claims about nullability (?) in internal agg and aggregator * make more explicit how we determine the aggregator can produce a result for a given bucket * disallow compression = 0
I made some changes to make more clear what is expected to be null or non null in the internal agg + aggregator |
|
||
@Override | ||
protected int doHashCode() { | ||
return Objects.hash(valuesSketch, medianAbsoluteDeviation); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if the MAD should be included here and equals 🤔
I believe equal values sketches implies equal MAD
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed
Also looks like we have some yaml rest tests for percentiles, would it be appropriate to add some for this agg? |
Jenkins run gradle build tests please |
Yes I think we should have YAML tests for this aggregation to test that the aggregation works end to end as expected |
I added a yaml test Another change I pushed worth noting - before I opened this I changed all of the class names to use the full "median absolute deviation" rather than the abbreviation "mad". Today I realized I'd forgotten to do it with the dsl name for the agg too, so it's now |
It looks like the failures for aeaf714 in https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+pull-request/707/ have been occurring in master so I think they're unrelated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some tiny nits. I think this looks good! Ping me when the docs are up and I'll take a final look :)
...main/java/org/elasticsearch/search/aggregations/metrics/InternalMedianAbsoluteDeviation.java
Outdated
Show resolved
Hide resolved
...main/java/org/elasticsearch/search/aggregations/metrics/InternalMedianAbsoluteDeviation.java
Outdated
Show resolved
Hide resolved
...org/elasticsearch/search/aggregations/metrics/MedianAbsoluteDeviationAggregationBuilder.java
Outdated
Show resolved
Hide resolved
...in/java/org/elasticsearch/search/aggregations/metrics/MedianAbsoluteDeviationAggregator.java
Outdated
Show resolved
Hide resolved
...in/java/org/elasticsearch/search/aggregations/metrics/MedianAbsoluteDeviationAggregator.java
Outdated
Show resolved
Hide resolved
.../org/elasticsearch/search/aggregations/metrics/MedianAbsoluteDeviationAggregatorFactory.java
Show resolved
Hide resolved
...r/src/test/java/org/elasticsearch/search/aggregations/metrics/MedianAbsoluteDeviationIT.java
Show resolved
Hide resolved
...ain/resources/rest-api-spec/test/search.aggregation/270_median_absolute_deviation_metric.yml
Show resolved
Hide resolved
@polyfractal thanks for the comments, I think I've addressed all of them. I'll ping again after pushing the docs |
@polyfractal docs are pushed. I cut a significant amount of content that I'd originally intended, mostly
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Docs look good! Left a small comment, but feel free to ignore it if it doesn't fit in, or too hard to describe, etc.
I think this is good on my end. 👍
docs/reference/aggregations/metrics/median-absolute-deviation-aggregation.asciidoc
Show resolved
Hide resolved
treated. By default they will be ignored but it is also possible to treat them | ||
as if they had a value. | ||
|
||
Let's be optimistic and assume some reviewers loved the product so much that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
:D
Just a comment on this; I think it's fine to give some indication of how approximate the algo is, just so the user has some notion of what to expect. Even if we don't continually run the benchmarks it might be useful for them to know the ballpark. "1-5%", etc is about as specific as we get for the other approximate aggs (and disclaimers about being dependent on data) |
Thanks, I wrote essentially that. In my benchmarking we always got within 1% with compression = 1000 (what I set the default as) but I noted 5% to be safe |
Jenkins run gradle build tests please |
This commit adds a new single value metric aggregation that calculates the statistic called median absolute deviation, which is a measure of variability that works on more types of data than standard deviation Our calculation of MAD is approximated using t-digests. In the collect phase, we collect each value visited into a t-digest. In the reduce phase, we merge all value t-digests, then create a t-digest of deviations using the first t-digest's median and centroids
This commit adds a new single value metric aggregation that calculates the statistic called median absolute deviation, which is a measure of variability that works on more types of data than standard deviation Our calculation of MAD is approximated using t-digests. In the collect phase, we collect each value visited into a t-digest. In the reduce phase, we merge all value t-digests, then create a t-digest of deviations using the first t-digest's median and centroids
This commit adds a new single value metric aggregation that calculates
the statistic called median absolute deviation, which is a measure of
variability that works on more types of data than standard deviation
Our calculation of MAD is approximated using t-digests. In the collect
phase, we collect each value visited into a t-digest. In the reduce
phase, we merge all value t-digests, then create a t-digest of
deviations using the first t-digest's median and centroids
For #26681
Still writing the docs, but wanted to get feedback on the code