CCS: sibling pipeline aggregations don't work when minimizing roundtrips #40059

chrisronline · 2019-03-14T16:57:40Z

Elasticsearch version (bin/elasticsearch --version): latest master snapshot (hash: 8839a72)

Plugins installed: []

JVM version (java -version): java version "11.0.1" 2018-10-16 LTS

OS version (uname -a if on a Unix-like system): Darwin Kernel Version 18.2.0

Description of the problem including expected versus actual behavior:

When performing a CCS search request, an error occurs if there is a pipeline aggregation in the query.

The exact error is:

   │      java.lang.UnsupportedOperationException: Not supported
   │      	at org.elasticsearch.search.aggregations.pipeline.InternalSimpleValue.doReduce(InternalSimpleValue.java:80) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
   │      	at org.elasticsearch.search.aggregations.pipeline.InternalSimpleValue.doReduce(InternalSimpleValue.java:34) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
   │      	at org.elasticsearch.search.aggregations.InternalAggregation.reduce(InternalAggregation.java:135) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
   │      	at org.elasticsearch.search.aggregations.InternalAggregations.reduce(InternalAggregations.java:90) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
   │      	at org.elasticsearch.action.search.SearchResponseMerger.getMergedResponse(SearchResponseMerger.java:193) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
   │      	at org.elasticsearch.action.search.TransportSearchAction$3.createFinalResponse(TransportSearchAction.java:389) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
   │      	at org.elasticsearch.action.search.TransportSearchAction$3.createFinalResponse(TransportSearchAction.java:379) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]

Steps to reproduce:

Start two ES instances, ensuring they have unique cluster.names
Bulk index documents into both:

POST _bulk
{ "index": { "_index": "sales" } }
{ "price": 10, "payment_type": "credit_card" }
{ "index": { "_index": "sales" } }
{ "price": 20, "payment_type": "cash" }

Configure one ES instance to talk to the other through CCS

POST _cluster/settings
{
  "persistent": {
    "cluster": {
      "remote": {
        "other": {
          "seeds": [
            "{other_es_instance_address}"
          ]
        }
      }
    }
  }
}

Perform a CCS search that uses a pipeline aggregations from the same ES instance

POST *:sales,sales/_search
{
  "size": 0,
  "aggs": {
    "sales_per_type": {
      "terms": {
        "field": "payment_type.keyword"
      },
      "aggs": {
        "sales": {
          "sum": {
            "field": "price"
          }
        }
      }
    },
    "max_per_type": {
      "max_bucket": {
        "buckets_path": "sales_per_type>sales"
      }
    }
  }
}

It's worth noting that the same query does not incur an error when used in either of these contexts:
POST sales/_search
POST *:sales/_search

The error only occurs when doing them together: POST *:sales,sales/_search.

It's also worth noting that the query does not incur an error when you remove the pipeline aggregation:

POST *:sales,sales/_search
{
  "size": 0,
  "aggs": {
    "sales_per_type": {
      "terms": {
        "field": "payment_type.keyword"
      },
      "aggs": {
        "sales": {
          "sum": {
            "field": "price"
          }
        }
      }
    }
  }
}

This is currently BREAKING the Kibana Stack Monitoring app, as we run these sorts of pipeline aggregations which are currently broken with CCS enabled.

The work done in this PR might be related to this, as the description seems like it touches the code around this

The text was updated successfully, but these errors were encountered:

cachedout · 2019-03-14T17:06:05Z

@javanna and @jimczi This is a pretty serious bug for @elastic/stack-monitoring and is potentially impacting ~~customers~~ (edit: we found this in master but aren't sure which versions it affects). We'd love some help tracking this down ASAP if you have a little bit of time. Thanks.

elasticmachine · 2019-03-14T17:22:16Z

Pinging @elastic/es-search

cachedout · 2019-03-14T19:18:56Z

Potentially related: #40067

javanna · 2019-03-14T19:43:25Z

This manifests when trying to minimize roundtrips (see #32125). I have tested pipeline aggregations and they worked fine in my tests, but this one is clearly a bug that was missed which needs fixing. I can see how a bunch of pipeline aggs are affected by this though. Until this gets fixed, please provide the minimize_roundtrips parameter set to false to the search request.

chrisronline · 2019-03-14T19:45:52Z

@javanna I think the parameter is named ccs_minimize_roundtrips

javanna · 2019-03-14T19:46:20Z

yes it is thanks ;)

Today a coordinating node forces a final reduction of sibling pipeline aggregators whenever reducing aggs, unless it is reducing aggs incrementally. This works well for incremental reduction of aggs, but breaks CCS when minimizing roundtrips as each cluster ends up reducing its own pipeline aggregators locally while that should only be done by the CCS coordinating node later. This causes issues as after their reduction, pipeline aggs cannot be further reduced, which is what happens with CCS causing errors like "java.lang.UnsupportedOperationException: Not supported" being returned. Each coordinating node should rather honour the reduce context flag that indicates whether we are executing a final reduce or not. If not, it should leave the sibling pipeline aggregations alone. Note that his bug affects only pipeline aggs that don't have a parent in the aggs tree, while all the others work well. Relates to elastic#40059 but does not fix it yet, as the CCS coordinating node also needs to be adapted to recreate sibling pipeline aggregators from the request.

…40101) Today a coordinating node forces a final reduction of sibling pipeline aggregators whenever reducing aggs, unless it is reducing aggs incrementally. This works well for incremental reduction of aggs, but breaks CCS when minimizing roundtrips as each cluster ends up reducing its own pipeline aggregators locally while that should only be done by the CCS coordinating node later. This causes issues as after their reduction, pipeline aggs cannot be further reduced, which is what happens with CCS causing errors like "java.lang.UnsupportedOperationException: Not supported" being returned. Each coordinating node should rather honour the reduce context flag that indicates whether we are executing a final reduce or not. If not, it should leave the sibling pipeline aggregations alone. Note that his bug affects only pipeline aggs that don't have a parent in the aggs tree, while all the others work well. Relates to #40059 but does not fix it yet, as the CCS coordinating node also needs to be adapted to recreate sibling pipeline aggregators from the request.

We currently convert pipeline aggregators to their corresponding InternalAggregation instance as part of the final reduction phase. They arrive to the coordinating node as part of QuerySearchResult objects fom the shards and, despite we may incrementally reduce aggs (hence we may have some non-final reduce and the final one later) all the reduction phases happen on the same node. With CCS minimizing roundtrips though, each cluster performs its own non-final reduction, and then serializes the results back to the CCS coordinating node which will perform the final coordination. This breaks the assumptions made up until now around reductions happening all on the same node. With elastic#40101 we have made sure that top-level pipeline aggs are not reduced as part of the non-final reduction. The next step is to make sure that they don't get lost, meaning that each coordinating node needs to send them back to the CCS coordinating node as part of the top-level `InternalAggregations` object. Closes elastic#40059

…0177) We currently convert pipeline aggregators to their corresponding InternalAggregation instance as part of the final reduction phase. They arrive to the coordinating node as part of QuerySearchResult objects fom the shards and, despite we may incrementally reduce aggs (hence we may have some non-final reduce and the final one later) all the reduction phases happen on the same node. With CCS minimizing roundtrips though, each cluster performs its own non-final reduction, and then serializes the results back to the CCS coordinating node which will perform the final coordination. This breaks the assumptions made up until now around reductions happening all on the same node. With #40101 we have made sure that top-level pipeline aggs are not reduced as part of the non-final reduction. The next step is to make sure that they don't get lost, meaning that each coordinating node needs to send them back to the CCS coordinating node as part of the top-level `InternalAggregations` object. Closes #40059

…lastic#40101) Today a coordinating node forces a final reduction of sibling pipeline aggregators whenever reducing aggs, unless it is reducing aggs incrementally. This works well for incremental reduction of aggs, but breaks CCS when minimizing roundtrips as each cluster ends up reducing its own pipeline aggregators locally while that should only be done by the CCS coordinating node later. This causes issues as after their reduction, pipeline aggs cannot be further reduced, which is what happens with CCS causing errors like "java.lang.UnsupportedOperationException: Not supported" being returned. Each coordinating node should rather honour the reduce context flag that indicates whether we are executing a final reduce or not. If not, it should leave the sibling pipeline aggregations alone. Note that his bug affects only pipeline aggs that don't have a parent in the aggs tree, while all the others work well. Relates to elastic#40059 but does not fix it yet, as the CCS coordinating node also needs to be adapted to recreate sibling pipeline aggregators from the request.

…astic#40177) We currently convert pipeline aggregators to their corresponding InternalAggregation instance as part of the final reduction phase. They arrive to the coordinating node as part of QuerySearchResult objects fom the shards and, despite we may incrementally reduce aggs (hence we may have some non-final reduce and the final one later) all the reduction phases happen on the same node. With CCS minimizing roundtrips though, each cluster performs its own non-final reduction, and then serializes the results back to the CCS coordinating node which will perform the final coordination. This breaks the assumptions made up until now around reductions happening all on the same node. With elastic#40101 we have made sure that top-level pipeline aggs are not reduced as part of the non-final reduction. The next step is to make sure that they don't get lost, meaning that each coordinating node needs to send them back to the CCS coordinating node as part of the top-level `InternalAggregations` object. Closes elastic#40059