bucket_sort aggregation misplace the first bucket #36322

odespesse · 2018-12-06T18:12:18Z

Elasticsearch version 6.5.1

Plugins installed: []

JVM version JVM 1.8.0_192

OS version Debian 8.11

Description of the problem including expected versus actual behavior:

Actual behavior :
When sorting an aggregation with a bucket_sort based on its _count, if doc_count are equals the item that should be in first position is the last one, other items are in the right order.
Expected :
Every items should be in the right order, including the first one.

Steps to reproduce:

Create a basic index :

curl -XPUT 'http://localhost:9200/messages' -H 'Content-Type: application/json' -d '
{
  "settings": {
    "number_of_shards": 1,
    "number_of_replicas" : 0
  },
  "mappings": {
    "user": {
      "properties": {
        "rank": {
          "type": "keyword"
        }
      }
    }
  }
}'

Insert one user with a different rank each time (from a to d) :

user with rank a

curl -XPUT 'http://localhost:9200/messages/user/001' -H 'Content-Type: application/json' -d '
{
  "@timestamp": "2018-12-06T01:00:00+01:00",
  "rank": "a"
}'

user with rank b

curl -XPUT 'http://localhost:9200/messages/user/002' -H 'Content-Type: application/json' -d '
{
  "@timestamp": "2018-12-06T01:00:00+01:00",
  "rank": "b"
}'

user with rank c

curl -XPUT 'http://localhost:9200/messages/user/003' -H 'Content-Type: application/json' -d '
{
  "@timestamp": "2018-12-06T01:00:00+01:00",
  "rank": "c"
}'

user with rank d

curl -XPUT 'http://localhost:9200/messages/user/004' -H 'Content-Type: application/json' -d '
{
  "@timestamp": "2018-12-06T01:00:00+01:00",
  "rank": "d"
}'

Aggregate on the rank property and sort by count with a bucket_sort :

curl -XGET 'http://localhost:9200/messages/_search?pretty=true' -H 'Content-Type: application/json' -d '
    {
      "size": 0,
      "aggregations": {
        "top_by_color": {
          "composite": {
            "size": 1000,
            "sources": [
              {
                "rank": {
                  "terms": {
                    "field": "rank",
                    "missing_bucket": true,
                    "order": "asc"
                  }
                }
              }
            ]
          },
          "aggregations": {
            "top_bucket_sort": {
              "bucket_sort": {
                "sort": [
                  {
                    "_count": {
                      "order": "asc"
                    }
                  }
                ],
                "size": 1000
              }
            }
          }
        }
      }
    }'

Aggregation result is :

    {
      "took": 1,
      "timed_out": false,
      "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
      },
      "hits": {
        "total": 4,
        "max_score": 0,
        "hits": [

        ]
      },
      "aggregations": {
        "top_by_color": {
          "after_key": {
            "rank": "d"
          },
          "buckets": [
            {
              "key": {
                "rank": "b"
              },
              "doc_count": 1
            },
            {
              "key": {
                "rank": "c"
              },
              "doc_count": 1
            },
            {
              "key": {
                "rank": "d"
              },
              "doc_count": 1
            },
            {
              "key": {
                "rank": "a"
              },
              "doc_count": 1
            }
          ]
        }
      }
    }

We have ranks sorted has : b, c, d, a instead of a, b, c, d.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2018-12-07T07:39:16Z

Pinging @elastic/es-analytics-geo

jimczi · 2018-12-07T07:46:52Z

The sort order is based on count and there is no guarantee that equals element will not be reordered as a result of the sort. We could change to a stable sort or change the documentation to explain how the sort works but why are you using the composite aggregation and the bucket sort instead of a terms aggregation sorted by _count ? In the latter case you could add a tiebreaker to your sort based on the _key to make sure that equals count are sorted based on the value of the term ?

odespesse · 2018-12-07T12:50:35Z

This is a simplified version of my real use case.

I need to aggregate 3 different fields together. I tried with terms but the result of 3 nested aggregations is a pain to parse and the composite aggregation seems to be the best solution for this scenario.

What seems strange to me is that on every attempts I tried, everything is sorted as expected except the first one which is misplaced at the end. It is like if the internal ElasticSearch "loop" that sort every buckets do the right thing except for the first item.

jimczi · 2018-12-07T20:00:12Z

It is like if the internal ElasticSearch "loop" that sort every buckets do the right thing except for the first item.

I didn't look carefully the implementation of the priority queue we use to perform the sort but as I said in my previous comment the output of the sort is correct. Though I wonder why we use a priority queue rather than a list and then Collections#sort which provides stable sort.
@dimitris-athanasiou any reason to prefer a priority queue here ? I understand that it can be faster if the buckets are truncated but that seems an edge case ?

dimitris-athanasiou · 2018-12-07T20:44:39Z

@jimczi Indeed, the reason a priority queue was used was because it seemed suitable for also dealing with trimmed buckets in an efficient way. I vaguely recall discussing this and deciding it was a desirable trade-off against sort stability.

jimczi · 2018-12-10T16:52:19Z

We discussed this offline and we agreed that we should switch to a List and use Collections#sort. This should preserve the order for equal values and simplify the code since we don't really need a priority queue.

govi20 · 2018-12-11T17:21:40Z

I would like to work on it.

SivagurunathanV · 2018-12-18T01:55:28Z

I would like to work on this issue if @govi20 is not working on it.

polyfractal · 2018-12-18T03:20:26Z

Looks like someone just sent a PR for this (#36748)

If you're still interested in contributing, you can generally just leave a note that says "I'm going to start working on this" then raise a PR when you are ready. No need to get approval first.

Thanks for helping out! We have more "help wanted" and "good first issue" tickets if you want to look around for other bugs that need fixing :)

Razi007 · 2018-12-18T11:54:31Z

can we still work on issue or raise PR if there is already a PR raised.

polyfractal · 2018-12-18T12:30:11Z

I would avoid working on issues that have an active PR. Sometimes a PR might go dormant (contributor gets busy and abandons the PR, technical issues make it close, etc), in which case we'll close the PR and can be worked on again.

But if the PR is active and being worked on, it's probably best to choose a different issue to work on.

…aggregator (#36748) Update BucketSortPipelineAggregator to use a List and Collections.sort() for sorting instead of a priority queue. This preserves the order for equal values. Closes #36322.

jimczi added the :Analytics/Aggregations Aggregations label Dec 7, 2018

jimczi added the >feature label Dec 7, 2018

jimczi added good first issue low hanging fruit help wanted adoptme labels Dec 10, 2018

chatzikalymnios mentioned this issue Dec 18, 2018

Use List instead of priority queue for stable sorting in bucket sort aggregator #36748

Merged

dimitris-athanasiou closed this as completed in #36748 Jan 9, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bucket_sort aggregation misplace the first bucket #36322

bucket_sort aggregation misplace the first bucket #36322

odespesse commented Dec 6, 2018

elasticmachine commented Dec 7, 2018

jimczi commented Dec 7, 2018

odespesse commented Dec 7, 2018

jimczi commented Dec 7, 2018

dimitris-athanasiou commented Dec 7, 2018 •

edited

Loading

jimczi commented Dec 10, 2018

govi20 commented Dec 11, 2018

SivagurunathanV commented Dec 18, 2018

polyfractal commented Dec 18, 2018

Razi007 commented Dec 18, 2018

polyfractal commented Dec 18, 2018

bucket_sort aggregation misplace the first bucket #36322

bucket_sort aggregation misplace the first bucket #36322

Comments

odespesse commented Dec 6, 2018

elasticmachine commented Dec 7, 2018

jimczi commented Dec 7, 2018

odespesse commented Dec 7, 2018

jimczi commented Dec 7, 2018

dimitris-athanasiou commented Dec 7, 2018 • edited Loading

jimczi commented Dec 10, 2018

govi20 commented Dec 11, 2018

SivagurunathanV commented Dec 18, 2018

polyfractal commented Dec 18, 2018

Razi007 commented Dec 18, 2018

polyfractal commented Dec 18, 2018

dimitris-athanasiou commented Dec 7, 2018 •

edited

Loading