Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sorted cardinality results don't include the largest bucket #67782

Closed
LeeDr opened this issue Jan 20, 2021 · 5 comments · Fixed by #67839
Closed

sorted cardinality results don't include the largest bucket #67782

LeeDr opened this issue Jan 20, 2021 · 5 comments · Fixed by #67839
Assignees
Labels
:Analytics/Aggregations Aggregations blocker >bug >regression Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo)

Comments

@LeeDr
Copy link

LeeDr commented Jan 20, 2021

Elasticsearch version (bin/elasticsearch --version): 8.0.0, 7.12.0, 7.11.0

Plugins installed: [] none, default distribution

JVM version (java -version): built-in JDK

OS version (uname -a if on a Unix-like system): all (this is my current master source running but this impacts 7.x and 7.11 branches as well)

  "name" : "LEEDR-XPS",
  "cluster_name" : "es-test-cluster",
  "cluster_uuid" : "s-GANDlNSZ2nNdr00SQw3g",
  "version" : {
    "number" : "8.0.0-SNAPSHOT",
    "build_flavor" : "oss",
    "build_type" : "zip",
    "build_hash" : "3454a094f73e7696446dbd2c0525041293dd4460",
    "build_date" : "2021-01-19T19:31:16.897887417Z",
    "build_snapshot" : true,
    "lucene_version" : "8.8.0",
    "minimum_wire_compatibility_version" : "7.12.0",
    "minimum_index_compatibility_version" : "7.0.0"
  },
  "tagline" : "You Know, for Search"
}

Description of the problem including expected versus actual behavior: A cardinality agg with split by terms is no longer returning the term with the largest result count. Results vary based on the "size".

Almost 3 years ago we automated the Shakespeare Kibana getting started tutorial https://www.elastic.co/guide/en/kibana/6.8/tutorial-load-dataset.html
The test has been passing with the same expected results until about Oct 29, 2020 when the results returned by the aggregation changed. Unfortunately the test was skipped to allow Kibana to take the new Elasticsearch snapshot and wasn't investigated until now.

Steps to reproduce:

Please include a minimal but complete recreation of the problem,
including (e.g.) index creation, mappings, settings, query etc. The easier
you make for us to reproduce it, the more likely that somebody will take the
time to look at it.

  1. download this data https://download.elastic.co/demos/kibana/gettingstarted/shakespeare_6.0.json
  2. create this mapping;
PUT /shakespeare
{
 "mappings": {
   "properties": {
     "speaker": {
       "type": "keyword"
     },
     "play_name": {
       "type": "keyword"
     },
     "line_id": {
       "type": "integer"
     },
     "speech_number": {
       "type": "integer"
     }
   }
 }
}
  1. Load the data;
    curl -H 'Content-Type: application/x-ndjson' -XPOST 'localhost:9200/shakespeare/doc/_bulk?pretty' --data-binary @shakespeare_6.0.json
  2. count the docs to make sure we have the same data curl -XGET 'localhost:9220/shakespeare/_count' "count":111396
  3. run the same query as the Kibana visualization test;
GET /shakespeare/_search
{
  "aggs": {
    "2": {
      "terms": {
        "field": "play_name",
        "order": {
          "1": "desc"
        },
        "size": 5
      },
      "aggs": {
        "1": {
          "cardinality": {
            "field": "speaker"
          }
        }
      }
    }
  },
  "size": 0,
  "fields": [],
  "script_fields": {},
  "stored_fields": [
    "*"
  ],
  "_source": {
    "excludes": []
  },
  "query": {
    "bool": {
      "must": [],
      "filter": [
        {
          "match_all": {}
        }
      ],
      "should": [],
      "must_not": []
    }
  }
}

The results I get on latest master are incorrect;

{
  "took" : 12,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 10000,
      "relation" : "gte"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "2" : {
      "doc_count_error_upper_bound" : -1,
      "sum_other_doc_count" : 94454,
      "buckets" : [
        {
          "key" : "Henry VI Part 2",
          "doc_count" : 3334,
          "1" : {
            "value" : 65
          }
        },
        {
          "key" : "Coriolanus",
          "doc_count" : 3992,
          "1" : {
            "value" : 62
          }
        },
        {
          "key" : "Antony and Cleopatra",
          "doc_count" : 3862,
          "1" : {
            "value" : 55
          }
        },
        {
          "key" : "Henry VI Part 1",
          "doc_count" : 2983,
          "1" : {
            "value" : 53
          }
        },
        {
          "key" : "Julius Caesar",
          "doc_count" : 2771,
          "1" : {
            "value" : 51
          }
        }
      ]
    }
  }
}

If we increase the terms agg size to 12 we get results that show the largest bucket value of 71 which is what the Kibana test has expected since it was written almost 3 years ago and is what 7.10 shows;

      "buckets" : [
        {
          "key" : "Richard III",
          "doc_count" : 3941,
          "1" : {
            "value" : 71
          }
        },

Provide logs (if relevant):

@LeeDr LeeDr added >bug >regression :Analytics/Aggregations Aggregations Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) needs:triage Requires assignment of a team area label labels Jan 20, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-analytics-geo (Team:Analytics)

@nik9000 nik9000 self-assigned this Jan 20, 2021
@nik9000 nik9000 removed the needs:triage Requires assignment of a team area label label Jan 20, 2021
@LeeDr LeeDr added the blocker label Jan 20, 2021
@nik9000
Copy link
Member

nik9000 commented Jan 21, 2021

Here is a copy-and-paste-able bash reproduction:

[ ! -f shakespeare_6.0.json ] && wget https://download.elastic.co/demos/kibana/gettingstarted/shakespeare_6.0.json
curl -XDELETE -uelastic:password localhost:9200/shakespeare?pretty
curl -XPUT -uelastic:password -HContent-Type:application/json localhost:9200/shakespeare?pretty -d'{
  "mappings": {
    "properties": {
      "speaker": {
        "type": "keyword"
      },
      "play_name": {
        "type": "keyword"
      },
      "line_id": {
        "type": "integer"
      },
      "speech_number": {
        "type": "integer"
      }
    }
  }
}'

curl -H 'Content-Type: application/x-ndjson' -XPOST -uelastic:password 'localhost:9200/shakespeare/_bulk?pretty&refresh' --data-binary @shakespeare_6.0.json


curl -XPOST -HContent-Type:application/json -uelastic:password localhost:9200/shakespeare/_search?pretty -d'{
  "aggs": {
    "2": {
      "terms": {
        "field": "play_name",
        "order": {
          "1": "desc"
        },
        "size": 5
      },
      "aggs": {
        "1": {
          "cardinality": {
            "field": "speaker"
          }
        }
      }
    }
  },
  "size": 0
}'

For those following along at home, the test finds the play with the most distinct speakers in shakespear's corpus.

@nik9000
Copy link
Member

nik9000 commented Jan 21, 2021

For what it is worth you could get failures of this sometimes even without bugs if documents wind up in the wrong spot. But on a single shard index it should always pass because cardinality is only an estimate when you have more than a bunch of distinct values. There is a bug and I'm tracking it down now.

nik9000 added a commit to nik9000/elasticsearch that referenced this issue Jan 21, 2021
The cardinality agg delays calculating stuff until just before it is
needed. Before elastic#64016 it used the `postCollect` phase to do this work
which was perfect for the `terms` agg but we decided that `postCollect`
was dangerous because some aggs, notably the `parent` and `child` aggs
need to know which children to build and they *can't* during
`postCollect`. After elastic#64016 we built the cardinality agg results when we
built the buckets. But we if you sort on the cardinality agg then you
need to do the `postCollect` stuff in order to know which buckets
to build! So you have a chicken and egg problem. Sort of.

This change splits the difference by running the delayed cardinality agg
stuff as soon as you *either* try to build the buckets *or* read the
cardinality for use with sorting. This works, but is a little janky and
feels wrong. It feels like we could make a structural fix to the way we
read metric values from aggs before building the buckets that would make
this sort of bug much more difficult to cause. But any sort of solution
to this is a larger structural change. So this fixes the bug in the
quick and janky way and we hope to do a more structural fix to the way
we read metrics soon.

Closes elastic#67782
@LeeDr
Copy link
Author

LeeDr commented Jan 21, 2021

Reference to the Kibana issue: elastic/kibana#82206

@nik9000
Copy link
Member

nik9000 commented Jan 21, 2021

#67839 makes the right results:

        {
          "key" : "Richard III",
          "doc_count" : 3941,
          "1" : {
            "value" : 71
          }
        },
        {
          "key" : "Henry VI Part 2",
          "doc_count" : 3334,
          "1" : {
            "value" : 65
          }
        },
        {
          "key" : "Coriolanus",
          "doc_count" : 3992,
          "1" : {
            "value" : 62
          }
        },

nik9000 added a commit that referenced this issue Jan 26, 2021
The cardinality agg delays calculating stuff until just before it is
needed. Before #64016 it used the `postCollect` phase to do this work
which was perfect for the `terms` agg but we decided that `postCollect`
was dangerous because some aggs, notably the `parent` and `child` aggs
need to know which children to build and they *can't* during
`postCollect`. After #64016 we built the cardinality agg results when we
built the buckets. But we if you sort on the cardinality agg then you
need to do the `postCollect` stuff in order to know which buckets
to build! So you have a chicken and egg problem. Sort of.

This change splits the difference by running the delayed cardinality agg
stuff as soon as you *either* try to build the buckets *or* read the
cardinality for use with sorting. This works, but is a little janky and
feels wrong. It feels like we could make a structural fix to the way we
read metric values from aggs before building the buckets that would make
this sort of bug much more difficult to cause. But any sort of solution
to this is a larger structural change. So this fixes the bug in the
quick and janky way and we hope to do a more structural fix to the way
we read metrics soon.

Closes #67782
nik9000 added a commit to nik9000/elasticsearch that referenced this issue Jan 26, 2021
The cardinality agg delays calculating stuff until just before it is
needed. Before elastic#64016 it used the `postCollect` phase to do this work
which was perfect for the `terms` agg but we decided that `postCollect`
was dangerous because some aggs, notably the `parent` and `child` aggs
need to know which children to build and they *can't* during
`postCollect`. After elastic#64016 we built the cardinality agg results when we
built the buckets. But we if you sort on the cardinality agg then you
need to do the `postCollect` stuff in order to know which buckets
to build! So you have a chicken and egg problem. Sort of.

This change splits the difference by running the delayed cardinality agg
stuff as soon as you *either* try to build the buckets *or* read the
cardinality for use with sorting. This works, but is a little janky and
feels wrong. It feels like we could make a structural fix to the way we
read metric values from aggs before building the buckets that would make
this sort of bug much more difficult to cause. But any sort of solution
to this is a larger structural change. So this fixes the bug in the
quick and janky way and we hope to do a more structural fix to the way
we read metrics soon.

Closes elastic#67782
nik9000 added a commit to nik9000/elasticsearch that referenced this issue Jan 26, 2021
The cardinality agg delays calculating stuff until just before it is
needed. Before elastic#64016 it used the `postCollect` phase to do this work
which was perfect for the `terms` agg but we decided that `postCollect`
was dangerous because some aggs, notably the `parent` and `child` aggs
need to know which children to build and they *can't* during
`postCollect`. After elastic#64016 we built the cardinality agg results when we
built the buckets. But we if you sort on the cardinality agg then you
need to do the `postCollect` stuff in order to know which buckets
to build! So you have a chicken and egg problem. Sort of.

This change splits the difference by running the delayed cardinality agg
stuff as soon as you *either* try to build the buckets *or* read the
cardinality for use with sorting. This works, but is a little janky and
feels wrong. It feels like we could make a structural fix to the way we
read metric values from aggs before building the buckets that would make
this sort of bug much more difficult to cause. But any sort of solution
to this is a larger structural change. So this fixes the bug in the
quick and janky way and we hope to do a more structural fix to the way
we read metrics soon.

Closes elastic#67782
nik9000 added a commit that referenced this issue Jan 26, 2021
The cardinality agg delays calculating stuff until just before it is
needed. Before #64016 it used the `postCollect` phase to do this work
which was perfect for the `terms` agg but we decided that `postCollect`
was dangerous because some aggs, notably the `parent` and `child` aggs
need to know which children to build and they *can't* during
`postCollect`. After #64016 we built the cardinality agg results when we
built the buckets. But we if you sort on the cardinality agg then you
need to do the `postCollect` stuff in order to know which buckets
to build! So you have a chicken and egg problem. Sort of.

This change splits the difference by running the delayed cardinality agg
stuff as soon as you *either* try to build the buckets *or* read the
cardinality for use with sorting. This works, but is a little janky and
feels wrong. It feels like we could make a structural fix to the way we
read metric values from aggs before building the buckets that would make
this sort of bug much more difficult to cause. But any sort of solution
to this is a larger structural change. So this fixes the bug in the
quick and janky way and we hope to do a more structural fix to the way
we read metrics soon.

Closes #67782
nik9000 added a commit that referenced this issue Jan 26, 2021
The cardinality agg delays calculating stuff until just before it is
needed. Before #64016 it used the `postCollect` phase to do this work
which was perfect for the `terms` agg but we decided that `postCollect`
was dangerous because some aggs, notably the `parent` and `child` aggs
need to know which children to build and they *can't* during
`postCollect`. After #64016 we built the cardinality agg results when we
built the buckets. But we if you sort on the cardinality agg then you
need to do the `postCollect` stuff in order to know which buckets
to build! So you have a chicken and egg problem. Sort of.

This change splits the difference by running the delayed cardinality agg
stuff as soon as you *either* try to build the buckets *or* read the
cardinality for use with sorting. This works, but is a little janky and
feels wrong. It feels like we could make a structural fix to the way we
read metric values from aggs before building the buckets that would make
this sort of bug much more difficult to cause. But any sort of solution
to this is a larger structural change. So this fixes the bug in the
quick and janky way and we hope to do a more structural fix to the way
we read metrics soon.

Closes #67782
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/Aggregations Aggregations blocker >bug >regression Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants