median absolute deviation agg #34482

andyb-elastic · 2018-10-15T18:52:08Z

This commit adds a new single value metric aggregation that calculates
the statistic called median absolute deviation, which is a measure of
variability that works on more types of data than standard deviation

Our calculation of MAD is approximated using t-digests. In the collect
phase, we collect each value visited into a t-digest. In the reduce
phase, we merge all value t-digests, then create a t-digest of
deviations using the first t-digest's median and centroids

For #26681

Still writing the docs, but wanted to get feedback on the code

This commit adds a new single value metric aggregation that calculates the statistic called median absolute deviation, which is a measure of variability that works on more types of data than standard deviation Our calculation of MAD is approximated using t-digests. In the collect phase, we collect each value visited into a t-digest. In the reduce phase, we merge all value t-digests, then create a t-digest of deviations using the first t-digest's median and centroids

elasticmachine · 2018-10-15T18:52:10Z

Pinging @elastic/es-search-aggs

colings86

@andyb-elastic I left some comments but I think this is a good start

colings86 · 2018-10-16T08:26:28Z

...main/java/org/elasticsearch/search/aggregations/metrics/InternalMedianAbsoluteDeviation.java

+        for (InternalAggregation aggregation : aggregations) {
+            final InternalMedianAbsoluteDeviation magAgg = (InternalMedianAbsoluteDeviation) aggregation;
+            if (valueMerged == null) {
+                valueMerged = new TDigestState(magAgg.valuesSketch.compression());


To avoid this check I think we can just initialise the valueMerged to an empty T-Digest outside the loop and merge each response into that?

colings86 · 2018-10-16T08:29:15Z

...in/java/org/elasticsearch/search/aggregations/metrics/MedianAbsoluteDeviationAggregator.java

+        Releasables.close(valueSketches);
+    }
+
+    public static double computeMedianAbsoluteDeviation(TDigestState valuesSketch) {


Personally I think this static method should live in InternalMedianAbsoluteDeviation.

++ Seems more like an Internal'y method, even though it's used in multiple places

colings86 · 2018-10-16T08:31:36Z

...main/java/org/elasticsearch/search/aggregations/metrics/InternalMedianAbsoluteDeviation.java

+
+    @Override
+    public double getMedianAbsoluteDeviation() {
+        return computeMedianAbsoluteDeviation(valuesSketch);


This is a relatively expensive operation, I think we should calculate this value once and store it rather than calculating it each time the getter is called, especially as the value cannot change. Maybe we should calculate this value in the constructor and store it along side the sketch or calculate it in the reduce method and store it alongside the sketch at that point?

Definitely agree, I had a couple ways of doing it before but nixed them because they were gross. Don't know why doing it in the constructor didn't occur to me, that's much cleaner

colings86 · 2018-10-16T08:35:01Z

...main/java/org/elasticsearch/search/aggregations/metrics/InternalMedianAbsoluteDeviation.java

+    public InternalAggregation doReduce(List<InternalAggregation> aggregations, ReduceContext reduceContext) {
+        TDigestState valueMerged = null;
+        for (InternalAggregation aggregation : aggregations) {
+            final InternalMedianAbsoluteDeviation magAgg = (InternalMedianAbsoluteDeviation) aggregation;


From the code in The aggregator factory below it looks like magAgg.valuesSketch can be null if the agg is unmapped, I think we need to guard against NPEs here?

I'm not sure, I think it gets protected against the valuesSketch being null in MedianAbsoluteDeviationAggregator#buildAggregation because if we never built a sketch for the bucket it's requesting, we just create an empty one. Is there another way the internal aggregation gets constructed?

The typo definitely needs fixing though

colings86 · 2018-10-16T08:35:58Z

...in/java/org/elasticsearch/search/aggregations/metrics/MedianAbsoluteDeviationAggregator.java

+
+    public static double computeMedianAbsoluteDeviation(TDigestState valuesSketch) {
+
+        if (valuesSketch.size() == 0) {


Since unmapped aggs have a null sketch I think we need a null check in this method?

I assume the unmapped and partially unmapped tests in the integration test fail at the moment because of this? If not, we should work out why and ensure they are correct and have a test that tests for NPEs here

This method actually has the same behavior (returning NaN) without this check, because we create an empty approximatedDeviationsSketch, and calling #quantile on an empty sketch returns NaN. I added this check to make it really obvious what happens without data points, since I don't think that #quantile returning NaN is documented

I explained why I don't think valuesSketch can be null here in another comment, the unmapped and partially unmapped tests were passing when I committed this

polyfractal · 2018-10-18T15:56:46Z

...org/elasticsearch/search/aggregations/metrics/MedianAbsoluteDeviationAggregationBuilder.java

+     * Set the compression factor of the t-digest sketches used
+     */
+    public MedianAbsoluteDeviationAggregationBuilder compression(double compression) {
+        if (compression < 0.0) {


Not really a MAD issue just something that caught my eye. I know we allow zero as a valid compression in TDigest percentiles... but I'm wondering if we shouldn't allow it?

A quick skim of AVLTreeDigest shows compression used as a denominator and in a multiplication. The agg still "works" with 0 compression, but I'm thinking that will cause a compression on each new point? Should we allow that sort of behavior?

+1 to not allowing compression of 0, it seems dangerous and trappy to me

Definitely agree

Really this agg's accuracy is pretty bad with any compression less than 20 or so, but it kind of feels like restricting it that high is a little heavy handed. But there's no reason why anyone should be using compression 0

* move mad calculation method to internal agg * make more explicit claims about nullability (?) in internal agg and aggregator * make more explicit how we determine the aggregator can produce a result for a given bucket * disallow compression = 0

andyb-elastic · 2018-10-19T22:32:20Z

I made some changes to make more clear what is expected to be null or non null in the internal agg + aggregator

andyb-elastic · 2018-10-19T22:33:16Z

...main/java/org/elasticsearch/search/aggregations/metrics/InternalMedianAbsoluteDeviation.java

+
+    @Override
+    protected int doHashCode() {
+        return Objects.hash(valuesSketch, medianAbsoluteDeviation);


Not sure if the MAD should be included here and equals 🤔

I believe equal values sketches implies equal MAD

andyb-elastic · 2018-10-19T22:37:27Z

Also looks like we have some yaml rest tests for percentiles, would it be appropriate to add some for this agg?

andyb-elastic · 2018-10-19T22:49:08Z

Jenkins run gradle build tests please

colings86 · 2018-10-22T16:14:33Z

Also looks like we have some yaml rest tests for percentiles, would it be appropriate to add some for this agg?

Yes I think we should have YAML tests for this aggregation to test that the aggregation works end to end as expected

andyb-elastic · 2018-10-22T22:39:07Z

I added a yaml test

Another change I pushed worth noting - before I opened this I changed all of the class names to use the full "median absolute deviation" rather than the abbreviation "mad". Today I realized I'd forgotten to do it with the dsl name for the agg too, so it's now median_absolute_deviation. I like that for consistency but it is a little more verbose than the other aggs

andyb-elastic · 2018-10-23T18:10:47Z

It looks like the failures for aeaf714 in https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+pull-request/707/ have been occurring in master so I think they're unrelated

polyfractal

Left some tiny nits. I think this looks good! Ping me when the docs are up and I'll take a final look :)

...main/java/org/elasticsearch/search/aggregations/metrics/InternalMedianAbsoluteDeviation.java

...org/elasticsearch/search/aggregations/metrics/MedianAbsoluteDeviationAggregationBuilder.java

...in/java/org/elasticsearch/search/aggregations/metrics/MedianAbsoluteDeviationAggregator.java

.../org/elasticsearch/search/aggregations/metrics/MedianAbsoluteDeviationAggregatorFactory.java

...r/src/test/java/org/elasticsearch/search/aggregations/metrics/MedianAbsoluteDeviationIT.java

...ain/resources/rest-api-spec/test/search.aggregation/270_median_absolute_deviation_metric.yml

andyb-elastic · 2018-10-24T22:02:06Z

@polyfractal thanks for the comments, I think I've addressed all of them. I'll ping again after pushing the docs

andyb-elastic · 2018-10-26T21:22:49Z

@polyfractal docs are pushed. I cut a significant amount of content that I'd originally intended, mostly

More details about the algorithm, since it felt like it was getting too far into implementation details
Rehashing of details about tdigest approximation, which seemed like it was better to link to the sections in the percentiles agg
Discussion of the approximation accuracy resulting from the benchmarking I did - while I'd like to give the user a sense of how it performs on the distributions we tested it seems irresponsible to make claims about it given that we won't be retesting it over time

polyfractal

Docs look good! Left a small comment, but feel free to ignore it if it doesn't fit in, or too hard to describe, etc.

I think this is good on my end. 👍

docs/reference/aggregations/metrics/median-absolute-deviation-aggregation.asciidoc

polyfractal · 2018-10-29T22:05:01Z

docs/reference/aggregations/metrics/median-absolute-deviation-aggregation.asciidoc

+treated. By default they will be ignored but it is also possible to treat them
+as if they had a value.
+
+Let's be optimistic and assume some reviewers loved the product so much that


polyfractal · 2018-10-29T22:20:46Z

Discussion of the approximation accuracy resulting from the benchmarking I did - while I'd like to give the user a sense of how it performs on the distributions we tested it seems irresponsible to make claims about it given that we won't be retesting it over time

Just a comment on this; I think it's fine to give some indication of how approximate the algo is, just so the user has some notion of what to expect. Even if we don't continually run the benchmarks it might be useful for them to know the ballpark. "1-5%", etc is about as specific as we get for the other approximate aggs (and disclaimers about being dependent on data)

andyb-elastic · 2018-10-29T23:05:39Z

Thanks, I wrote essentially that. In my benchmarking we always got within 1% with compression = 1000 (what I set the default as) but I noted 5% to be safe

andyb-elastic · 2018-10-30T04:10:13Z

Jenkins run gradle build tests please

This commit adds a new single value metric aggregation that calculates the statistic called median absolute deviation, which is a measure of variability that works on more types of data than standard deviation Our calculation of MAD is approximated using t-digests. In the collect phase, we collect each value visited into a t-digest. In the reduce phase, we merge all value t-digests, then create a t-digest of deviations using the first t-digest's median and centroids

andyb-elastic · 2018-10-30T19:57:22Z

master: b8280ea
6.x: 3fd9dd4

andyb-elastic added >feature :Analytics/Aggregations Aggregations v7.0.0 v6.5.0 labels Oct 15, 2018

andyb-elastic requested a review from polyfractal October 15, 2018 18:52

fix testAllAggsAreBeingTested

689db76

colings86 reviewed Oct 16, 2018

View reviewed changes

$polyfractal$

polyfractal reviewed Oct 18, 2018

View reviewed changes

code review changes

1720716

* move mad calculation method to internal agg * make more explicit claims about nullability (?) in internal agg and aggregator * make more explicit how we determine the aggregator can produce a result for a given bucket * disallow compression = 0

andyb-elastic commented Oct 19, 2018

View reviewed changes

andyb-elastic added 4 commits October 22, 2018 13:38

equality/hashcode only uses value sketch

b32f875

yaml test

53a4ce3

use full name in dsl

90f21c0

Merge branch 'master' into feature-mad-agg

01f7e81

skip rest test on 6.x for now

aeaf714

Merge branch 'master' into feature-mad-agg

22deb45

$polyfractal$

polyfractal reviewed Oct 23, 2018

View reviewed changes

andyb-elastic added 4 commits October 24, 2018 14:22

reduce visibility of some classes

8229c63

use parsefield in builder

7ef661e

remove extra whitespace

d2af85c

yml test case where we match no docs

036fc0f

colings86 added v6.6.0 and removed v6.5.0 labels Oct 25, 2018

andyb-elastic added 4 commits October 26, 2018 13:18

mad docs

c80b025

Merge branch 'master' into feature-mad-agg

2f2f3ce

fix callout

c6427d9

bump version skip

496f27b

$polyfractal$

polyfractal approved these changes Oct 29, 2018

View reviewed changes

andyb-elastic added 3 commits October 29, 2018 16:00

explain robustness and add note about accuracy

14636c2

Merge branch 'master' into feature-mad-agg

cb4a574

disclaimer

1287036

andyb-elastic merged commit b8280ea into elastic:master Oct 30, 2018

andyb-elastic mentioned this pull request Oct 30, 2018

backport to 6.x: median absolute deviation agg #35090

Merged

andyb-elastic mentioned this pull request Oct 30, 2018

Median Absolute Deviation #26681

Closed

codebrain mentioned this pull request Jan 25, 2019

[meta] 6.6.0 Release elastic/elasticsearch-net#3552

Closed

48 tasks

colings86 added v7.0.0-beta1 and removed v7.0.0 labels Feb 7, 2019

codebrain mentioned this pull request Mar 19, 2019

[meta] 6.7.0 Release elastic/elasticsearch-net#3615

Closed

24 tasks


		public static double computeMedianAbsoluteDeviation(TDigestState valuesSketch) {

		if (valuesSketch.size() == 0) {

median absolute deviation agg #34482

median absolute deviation agg #34482

Conversation

andyb-elastic commented Oct 15, 2018

elasticmachine commented Oct 15, 2018

colings86 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andyb-elastic commented Oct 19, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andyb-elastic commented Oct 19, 2018

andyb-elastic commented Oct 19, 2018

colings86 commented Oct 22, 2018

andyb-elastic commented Oct 22, 2018

andyb-elastic commented Oct 23, 2018

polyfractal left a comment

Choose a reason for hiding this comment

andyb-elastic commented Oct 24, 2018

andyb-elastic commented Oct 26, 2018

polyfractal left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

polyfractal commented Oct 29, 2018

andyb-elastic commented Oct 29, 2018

andyb-elastic commented Oct 30, 2018

andyb-elastic commented Oct 30, 2018

$@polyfractal$ polyfractal left a comment

$@polyfractal$ polyfractal left a comment