Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

median absolute deviation agg #34482

Merged
merged 20 commits into from
Oct 30, 2018
Merged

Conversation

andyb-elastic
Copy link
Contributor

This commit adds a new single value metric aggregation that calculates
the statistic called median absolute deviation, which is a measure of
variability that works on more types of data than standard deviation

Our calculation of MAD is approximated using t-digests. In the collect
phase, we collect each value visited into a t-digest. In the reduce
phase, we merge all value t-digests, then create a t-digest of
deviations using the first t-digest's median and centroids

For #26681

Still writing the docs, but wanted to get feedback on the code

This commit adds a new single value metric aggregation that calculates
the statistic called median absolute deviation, which is a measure of
variability that works on more types of data than standard deviation

Our calculation of MAD is approximated using t-digests. In the collect
phase, we collect each value visited into a t-digest. In the reduce
phase, we merge all value t-digests, then create a t-digest of
deviations using the first t-digest's median and centroids
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search-aggs

Copy link
Contributor

@colings86 colings86 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andyb-elastic I left some comments but I think this is a good start

for (InternalAggregation aggregation : aggregations) {
final InternalMedianAbsoluteDeviation magAgg = (InternalMedianAbsoluteDeviation) aggregation;
if (valueMerged == null) {
valueMerged = new TDigestState(magAgg.valuesSketch.compression());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To avoid this check I think we can just initialise the valueMerged to an empty T-Digest outside the loop and merge each response into that?

Releasables.close(valueSketches);
}

public static double computeMedianAbsoluteDeviation(TDigestState valuesSketch) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally I think this static method should live in InternalMedianAbsoluteDeviation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++ Seems more like an Internal'y method, even though it's used in multiple places

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍


@Override
public double getMedianAbsoluteDeviation() {
return computeMedianAbsoluteDeviation(valuesSketch);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a relatively expensive operation, I think we should calculate this value once and store it rather than calculating it each time the getter is called, especially as the value cannot change. Maybe we should calculate this value in the constructor and store it along side the sketch or calculate it in the reduce method and store it alongside the sketch at that point?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely agree, I had a couple ways of doing it before but nixed them because they were gross. Don't know why doing it in the constructor didn't occur to me, that's much cleaner

public InternalAggregation doReduce(List<InternalAggregation> aggregations, ReduceContext reduceContext) {
TDigestState valueMerged = null;
for (InternalAggregation aggregation : aggregations) {
final InternalMedianAbsoluteDeviation magAgg = (InternalMedianAbsoluteDeviation) aggregation;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the code in The aggregator factory below it looks like magAgg.valuesSketch can be null if the agg is unmapped, I think we need to guard against NPEs here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure, I think it gets protected against the valuesSketch being null in MedianAbsoluteDeviationAggregator#buildAggregation because if we never built a sketch for the bucket it's requesting, we just create an empty one. Is there another way the internal aggregation gets constructed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The typo definitely needs fixing though


public static double computeMedianAbsoluteDeviation(TDigestState valuesSketch) {

if (valuesSketch.size() == 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since unmapped aggs have a null sketch I think we need a null check in this method?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume the unmapped and partially unmapped tests in the integration test fail at the moment because of this? If not, we should work out why and ensure they are correct and have a test that tests for NPEs here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method actually has the same behavior (returning NaN) without this check, because we create an empty approximatedDeviationsSketch, and calling #quantile on an empty sketch returns NaN. I added this check to make it really obvious what happens without data points, since I don't think that #quantile returning NaN is documented

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I explained why I don't think valuesSketch can be null here in another comment, the unmapped and partially unmapped tests were passing when I committed this

* Set the compression factor of the t-digest sketches used
*/
public MedianAbsoluteDeviationAggregationBuilder compression(double compression) {
if (compression < 0.0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really a MAD issue just something that caught my eye. I know we allow zero as a valid compression in TDigest percentiles... but I'm wondering if we shouldn't allow it?

A quick skim of AVLTreeDigest shows compression used as a denominator and in a multiplication. The agg still "works" with 0 compression, but I'm thinking that will cause a compression on each new point? Should we allow that sort of behavior?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to not allowing compression of 0, it seems dangerous and trappy to me

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely agree

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really this agg's accuracy is pretty bad with any compression less than 20 or so, but it kind of feels like restricting it that high is a little heavy handed. But there's no reason why anyone should be using compression 0

* move mad calculation method to internal agg
* make more explicit claims about nullability (?) in internal agg and
aggregator
* make more explicit how we determine the aggregator can produce a
result for a given bucket
* disallow compression = 0
@andyb-elastic
Copy link
Contributor Author

I made some changes to make more clear what is expected to be null or non null in the internal agg + aggregator


@Override
protected int doHashCode() {
return Objects.hash(valuesSketch, medianAbsoluteDeviation);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if the MAD should be included here and equals 🤔

I believe equal values sketches implies equal MAD

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed

@andyb-elastic
Copy link
Contributor Author

Also looks like we have some yaml rest tests for percentiles, would it be appropriate to add some for this agg?

@andyb-elastic
Copy link
Contributor Author

Jenkins run gradle build tests please

@colings86
Copy link
Contributor

Also looks like we have some yaml rest tests for percentiles, would it be appropriate to add some for this agg?

Yes I think we should have YAML tests for this aggregation to test that the aggregation works end to end as expected

@andyb-elastic
Copy link
Contributor Author

I added a yaml test

Another change I pushed worth noting - before I opened this I changed all of the class names to use the full "median absolute deviation" rather than the abbreviation "mad". Today I realized I'd forgotten to do it with the dsl name for the agg too, so it's now median_absolute_deviation. I like that for consistency but it is a little more verbose than the other aggs

@andyb-elastic
Copy link
Contributor Author

It looks like the failures for aeaf714 in https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+pull-request/707/ have been occurring in master so I think they're unrelated

Copy link
Contributor

@polyfractal polyfractal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some tiny nits. I think this looks good! Ping me when the docs are up and I'll take a final look :)

@andyb-elastic
Copy link
Contributor Author

@polyfractal thanks for the comments, I think I've addressed all of them. I'll ping again after pushing the docs

@colings86 colings86 added v6.6.0 and removed v6.5.0 labels Oct 25, 2018
@andyb-elastic
Copy link
Contributor Author

@polyfractal docs are pushed. I cut a significant amount of content that I'd originally intended, mostly

  • More details about the algorithm, since it felt like it was getting too far into implementation details
  • Rehashing of details about tdigest approximation, which seemed like it was better to link to the sections in the percentiles agg
  • Discussion of the approximation accuracy resulting from the benchmarking I did - while I'd like to give the user a sense of how it performs on the distributions we tested it seems irresponsible to make claims about it given that we won't be retesting it over time

Copy link
Contributor

@polyfractal polyfractal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docs look good! Left a small comment, but feel free to ignore it if it doesn't fit in, or too hard to describe, etc.

I think this is good on my end. 👍

treated. By default they will be ignored but it is also possible to treat them
as if they had a value.

Let's be optimistic and assume some reviewers loved the product so much that
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:D

@polyfractal
Copy link
Contributor

Discussion of the approximation accuracy resulting from the benchmarking I did - while I'd like to give the user a sense of how it performs on the distributions we tested it seems irresponsible to make claims about it given that we won't be retesting it over time

Just a comment on this; I think it's fine to give some indication of how approximate the algo is, just so the user has some notion of what to expect. Even if we don't continually run the benchmarks it might be useful for them to know the ballpark. "1-5%", etc is about as specific as we get for the other approximate aggs (and disclaimers about being dependent on data)

@andyb-elastic
Copy link
Contributor Author

Thanks, I wrote essentially that. In my benchmarking we always got within 1% with compression = 1000 (what I set the default as) but I noted 5% to be safe

@andyb-elastic
Copy link
Contributor Author

Jenkins run gradle build tests please

@andyb-elastic andyb-elastic merged commit b8280ea into elastic:master Oct 30, 2018
andyb-elastic added a commit to andyb-elastic/elasticsearch that referenced this pull request Oct 30, 2018
This commit adds a new single value metric aggregation that calculates
the statistic called median absolute deviation, which is a measure of
variability that works on more types of data than standard deviation

Our calculation of MAD is approximated using t-digests. In the collect
phase, we collect each value visited into a t-digest. In the reduce
phase, we merge all value t-digests, then create a t-digest of
deviations using the first t-digest's median and centroids
andyb-elastic added a commit that referenced this pull request Oct 30, 2018
This commit adds a new single value metric aggregation that calculates
the statistic called median absolute deviation, which is a measure of
variability that works on more types of data than standard deviation

Our calculation of MAD is approximated using t-digests. In the collect
phase, we collect each value visited into a t-digest. In the reduce
phase, we merge all value t-digests, then create a t-digest of
deviations using the first t-digest's median and centroids
@andyb-elastic
Copy link
Contributor Author

master: b8280ea
6.x: 3fd9dd4

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants