sorted cardinality results don't include the largest bucket #67782

LeeDr · 2021-01-20T17:30:26Z

Elasticsearch version (bin/elasticsearch --version): 8.0.0, 7.12.0, 7.11.0

Plugins installed: [] none, default distribution

JVM version (java -version): built-in JDK

OS version (uname -a if on a Unix-like system): all (this is my current master source running but this impacts 7.x and 7.11 branches as well)

  "name" : "LEEDR-XPS",
  "cluster_name" : "es-test-cluster",
  "cluster_uuid" : "s-GANDlNSZ2nNdr00SQw3g",
  "version" : {
    "number" : "8.0.0-SNAPSHOT",
    "build_flavor" : "oss",
    "build_type" : "zip",
    "build_hash" : "3454a094f73e7696446dbd2c0525041293dd4460",
    "build_date" : "2021-01-19T19:31:16.897887417Z",
    "build_snapshot" : true,
    "lucene_version" : "8.8.0",
    "minimum_wire_compatibility_version" : "7.12.0",
    "minimum_index_compatibility_version" : "7.0.0"
  },
  "tagline" : "You Know, for Search"
}

Description of the problem including expected versus actual behavior: A cardinality agg with split by terms is no longer returning the term with the largest result count. Results vary based on the "size".

Almost 3 years ago we automated the Shakespeare Kibana getting started tutorial https://www.elastic.co/guide/en/kibana/6.8/tutorial-load-dataset.html
The test has been passing with the same expected results until about Oct 29, 2020 when the results returned by the aggregation changed. Unfortunately the test was skipped to allow Kibana to take the new Elasticsearch snapshot and wasn't investigated until now.

Steps to reproduce:

Please include a minimal but complete recreation of the problem,
including (e.g.) index creation, mappings, settings, query etc. The easier
you make for us to reproduce it, the more likely that somebody will take the
time to look at it.

download this data https://download.elastic.co/demos/kibana/gettingstarted/shakespeare_6.0.json
create this mapping;

PUT /shakespeare
{
 "mappings": {
   "properties": {
     "speaker": {
       "type": "keyword"
     },
     "play_name": {
       "type": "keyword"
     },
     "line_id": {
       "type": "integer"
     },
     "speech_number": {
       "type": "integer"
     }
   }
 }
}

Load the data;
curl -H 'Content-Type: application/x-ndjson' -XPOST 'localhost:9200/shakespeare/doc/_bulk?pretty' --data-binary @shakespeare_6.0.json
count the docs to make sure we have the same data curl -XGET 'localhost:9220/shakespeare/_count' "count":111396
run the same query as the Kibana visualization test;

GET /shakespeare/_search
{
  "aggs": {
    "2": {
      "terms": {
        "field": "play_name",
        "order": {
          "1": "desc"
        },
        "size": 5
      },
      "aggs": {
        "1": {
          "cardinality": {
            "field": "speaker"
          }
        }
      }
    }
  },
  "size": 0,
  "fields": [],
  "script_fields": {},
  "stored_fields": [
    "*"
  ],
  "_source": {
    "excludes": []
  },
  "query": {
    "bool": {
      "must": [],
      "filter": [
        {
          "match_all": {}
        }
      ],
      "should": [],
      "must_not": []
    }
  }
}

The results I get on latest master are incorrect;

{
  "took" : 12,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 10000,
      "relation" : "gte"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "2" : {
      "doc_count_error_upper_bound" : -1,
      "sum_other_doc_count" : 94454,
      "buckets" : [
        {
          "key" : "Henry VI Part 2",
          "doc_count" : 3334,
          "1" : {
            "value" : 65
          }
        },
        {
          "key" : "Coriolanus",
          "doc_count" : 3992,
          "1" : {
            "value" : 62
          }
        },
        {
          "key" : "Antony and Cleopatra",
          "doc_count" : 3862,
          "1" : {
            "value" : 55
          }
        },
        {
          "key" : "Henry VI Part 1",
          "doc_count" : 2983,
          "1" : {
            "value" : 53
          }
        },
        {
          "key" : "Julius Caesar",
          "doc_count" : 2771,
          "1" : {
            "value" : 51
          }
        }
      ]
    }
  }
}

If we increase the terms agg size to 12 we get results that show the largest bucket value of 71 which is what the Kibana test has expected since it was written almost 3 years ago and is what 7.10 shows;

      "buckets" : [
        {
          "key" : "Richard III",
          "doc_count" : 3941,
          "1" : {
            "value" : 71
          }
        },

Provide logs (if relevant):

The text was updated successfully, but these errors were encountered:

elasticmachine · 2021-01-20T17:30:28Z

Pinging @elastic/es-analytics-geo (Team:Analytics)

nik9000 · 2021-01-21T15:42:50Z

Here is a copy-and-paste-able bash reproduction:

[ ! -f shakespeare_6.0.json ] && wget https://download.elastic.co/demos/kibana/gettingstarted/shakespeare_6.0.json
curl -XDELETE -uelastic:password localhost:9200/shakespeare?pretty
curl -XPUT -uelastic:password -HContent-Type:application/json localhost:9200/shakespeare?pretty -d'{
  "mappings": {
    "properties": {
      "speaker": {
        "type": "keyword"
      },
      "play_name": {
        "type": "keyword"
      },
      "line_id": {
        "type": "integer"
      },
      "speech_number": {
        "type": "integer"
      }
    }
  }
}'

curl -H 'Content-Type: application/x-ndjson' -XPOST -uelastic:password 'localhost:9200/shakespeare/_bulk?pretty&refresh' --data-binary @shakespeare_6.0.json


curl -XPOST -HContent-Type:application/json -uelastic:password localhost:9200/shakespeare/_search?pretty -d'{
  "aggs": {
    "2": {
      "terms": {
        "field": "play_name",
        "order": {
          "1": "desc"
        },
        "size": 5
      },
      "aggs": {
        "1": {
          "cardinality": {
            "field": "speaker"
          }
        }
      }
    }
  },
  "size": 0
}'

For those following along at home, the test finds the play with the most distinct speakers in shakespear's corpus.

nik9000 · 2021-01-21T16:49:36Z

For what it is worth you could get failures of this sometimes even without bugs if documents wind up in the wrong spot. But on a single shard index it should always pass because cardinality is only an estimate when you have more than a bunch of distinct values. There is a bug and I'm tracking it down now.

The cardinality agg delays calculating stuff until just before it is needed. Before elastic#64016 it used the `postCollect` phase to do this work which was perfect for the `terms` agg but we decided that `postCollect` was dangerous because some aggs, notably the `parent` and `child` aggs need to know which children to build and they *can't* during `postCollect`. After elastic#64016 we built the cardinality agg results when we built the buckets. But we if you sort on the cardinality agg then you need to do the `postCollect` stuff in order to know which buckets to build! So you have a chicken and egg problem. Sort of. This change splits the difference by running the delayed cardinality agg stuff as soon as you *either* try to build the buckets *or* read the cardinality for use with sorting. This works, but is a little janky and feels wrong. It feels like we could make a structural fix to the way we read metric values from aggs before building the buckets that would make this sort of bug much more difficult to cause. But any sort of solution to this is a larger structural change. So this fixes the bug in the quick and janky way and we hope to do a more structural fix to the way we read metrics soon. Closes elastic#67782

LeeDr · 2021-01-21T18:45:55Z

Reference to the Kibana issue: elastic/kibana#82206

nik9000 · 2021-01-21T18:50:40Z

#67839 makes the right results:

        {
          "key" : "Richard III",
          "doc_count" : 3941,
          "1" : {
            "value" : 71
          }
        },
        {
          "key" : "Henry VI Part 2",
          "doc_count" : 3334,
          "1" : {
            "value" : 65
          }
        },
        {
          "key" : "Coriolanus",
          "doc_count" : 3992,
          "1" : {
            "value" : 62
          }
        },

The cardinality agg delays calculating stuff until just before it is needed. Before #64016 it used the `postCollect` phase to do this work which was perfect for the `terms` agg but we decided that `postCollect` was dangerous because some aggs, notably the `parent` and `child` aggs need to know which children to build and they *can't* during `postCollect`. After #64016 we built the cardinality agg results when we built the buckets. But we if you sort on the cardinality agg then you need to do the `postCollect` stuff in order to know which buckets to build! So you have a chicken and egg problem. Sort of. This change splits the difference by running the delayed cardinality agg stuff as soon as you *either* try to build the buckets *or* read the cardinality for use with sorting. This works, but is a little janky and feels wrong. It feels like we could make a structural fix to the way we read metric values from aggs before building the buckets that would make this sort of bug much more difficult to cause. But any sort of solution to this is a larger structural change. So this fixes the bug in the quick and janky way and we hope to do a more structural fix to the way we read metrics soon. Closes #67782

The cardinality agg delays calculating stuff until just before it is needed. Before elastic#64016 it used the `postCollect` phase to do this work which was perfect for the `terms` agg but we decided that `postCollect` was dangerous because some aggs, notably the `parent` and `child` aggs need to know which children to build and they *can't* during `postCollect`. After elastic#64016 we built the cardinality agg results when we built the buckets. But we if you sort on the cardinality agg then you need to do the `postCollect` stuff in order to know which buckets to build! So you have a chicken and egg problem. Sort of. This change splits the difference by running the delayed cardinality agg stuff as soon as you *either* try to build the buckets *or* read the cardinality for use with sorting. This works, but is a little janky and feels wrong. It feels like we could make a structural fix to the way we read metric values from aggs before building the buckets that would make this sort of bug much more difficult to cause. But any sort of solution to this is a larger structural change. So this fixes the bug in the quick and janky way and we hope to do a more structural fix to the way we read metrics soon. Closes elastic#67782

The cardinality agg delays calculating stuff until just before it is needed. Before #64016 it used the `postCollect` phase to do this work which was perfect for the `terms` agg but we decided that `postCollect` was dangerous because some aggs, notably the `parent` and `child` aggs need to know which children to build and they *can't* during `postCollect`. After #64016 we built the cardinality agg results when we built the buckets. But we if you sort on the cardinality agg then you need to do the `postCollect` stuff in order to know which buckets to build! So you have a chicken and egg problem. Sort of. This change splits the difference by running the delayed cardinality agg stuff as soon as you *either* try to build the buckets *or* read the cardinality for use with sorting. This works, but is a little janky and feels wrong. It feels like we could make a structural fix to the way we read metric values from aggs before building the buckets that would make this sort of bug much more difficult to cause. But any sort of solution to this is a larger structural change. So this fixes the bug in the quick and janky way and we hope to do a more structural fix to the way we read metrics soon. Closes #67782

LeeDr added >bug >regression :Analytics/Aggregations Aggregations Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) needs:triage Requires assignment of a team area label labels Jan 20, 2021

nik9000 self-assigned this Jan 20, 2021

nik9000 removed the needs:triage Requires assignment of a team area label label Jan 20, 2021

LeeDr added the blocker label Jan 20, 2021

nik9000 mentioned this issue Jan 21, 2021

Fix sorting terms by cardinality agg #67839

Merged

nik9000 closed this as completed in #67839 Jan 26, 2021

nik9000 mentioned this issue Jan 26, 2021

Fix sorting terms by cardinality agg (backport of #67839) #67991

Merged

nik9000 mentioned this issue Jan 26, 2021

Fix sorting terms by cardinality agg (backport of #67839) #67992

Merged

LeeDr mentioned this issue Jan 26, 2021

unskip getting_started/shakespeare test elasticsearch 64016 elastic/kibana#89346

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sorted cardinality results don't include the largest bucket #67782

sorted cardinality results don't include the largest bucket #67782

LeeDr commented Jan 20, 2021

elasticmachine commented Jan 20, 2021

nik9000 commented Jan 21, 2021

nik9000 commented Jan 21, 2021

LeeDr commented Jan 21, 2021

nik9000 commented Jan 21, 2021

sorted cardinality results don't include the largest bucket #67782

sorted cardinality results don't include the largest bucket #67782

Comments

LeeDr commented Jan 20, 2021

elasticmachine commented Jan 20, 2021

nik9000 commented Jan 21, 2021

nik9000 commented Jan 21, 2021

LeeDr commented Jan 21, 2021

nik9000 commented Jan 21, 2021