Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bucket_sort aggregation misplace the first bucket #36322

Closed
odespesse opened this issue Dec 6, 2018 · 11 comments
Closed

bucket_sort aggregation misplace the first bucket #36322

odespesse opened this issue Dec 6, 2018 · 11 comments

Comments

@odespesse
Copy link

Elasticsearch version 6.5.1

Plugins installed: []

JVM version JVM 1.8.0_192

OS version Debian 8.11

Description of the problem including expected versus actual behavior:

  • Actual behavior :
    When sorting an aggregation with a bucket_sort based on its _count, if doc_count are equals the item that should be in first position is the last one, other items are in the right order.
  • Expected :
    Every items should be in the right order, including the first one.

Steps to reproduce:

  1. Create a basic index :
curl -XPUT 'http://localhost:9200/messages' -H 'Content-Type: application/json' -d '
{
  "settings": {
    "number_of_shards": 1,
    "number_of_replicas" : 0
  },
  "mappings": {
    "user": {
      "properties": {
        "rank": {
          "type": "keyword"
        }
      }
    }
  }
}'
  1. Insert one user with a different rank each time (from a to d) :
  • user with rank a
curl -XPUT 'http://localhost:9200/messages/user/001' -H 'Content-Type: application/json' -d '
{
  "@timestamp": "2018-12-06T01:00:00+01:00",
  "rank": "a"
}'
  • user with rank b
curl -XPUT 'http://localhost:9200/messages/user/002' -H 'Content-Type: application/json' -d '
{
  "@timestamp": "2018-12-06T01:00:00+01:00",
  "rank": "b"
}'
  • user with rank c
curl -XPUT 'http://localhost:9200/messages/user/003' -H 'Content-Type: application/json' -d '
{
  "@timestamp": "2018-12-06T01:00:00+01:00",
  "rank": "c"
}'
  • user with rank d
curl -XPUT 'http://localhost:9200/messages/user/004' -H 'Content-Type: application/json' -d '
{
  "@timestamp": "2018-12-06T01:00:00+01:00",
  "rank": "d"
}'
  1. Aggregate on the rank property and sort by count with a bucket_sort :
curl -XGET 'http://localhost:9200/messages/_search?pretty=true' -H 'Content-Type: application/json' -d '
    {
      "size": 0,
      "aggregations": {
        "top_by_color": {
          "composite": {
            "size": 1000,
            "sources": [
              {
                "rank": {
                  "terms": {
                    "field": "rank",
                    "missing_bucket": true,
                    "order": "asc"
                  }
                }
              }
            ]
          },
          "aggregations": {
            "top_bucket_sort": {
              "bucket_sort": {
                "sort": [
                  {
                    "_count": {
                      "order": "asc"
                    }
                  }
                ],
                "size": 1000
              }
            }
          }
        }
      }
    }'

Aggregation result is :

    {
      "took": 1,
      "timed_out": false,
      "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
      },
      "hits": {
        "total": 4,
        "max_score": 0,
        "hits": [

        ]
      },
      "aggregations": {
        "top_by_color": {
          "after_key": {
            "rank": "d"
          },
          "buckets": [
            {
              "key": {
                "rank": "b"
              },
              "doc_count": 1
            },
            {
              "key": {
                "rank": "c"
              },
              "doc_count": 1
            },
            {
              "key": {
                "rank": "d"
              },
              "doc_count": 1
            },
            {
              "key": {
                "rank": "a"
              },
              "doc_count": 1
            }
          ]
        }
      }
    }

We have ranks sorted has : b, c, d, a instead of a, b, c, d.

@jimczi jimczi added the :Analytics/Aggregations Aggregations label Dec 7, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-analytics-geo

@jimczi
Copy link
Contributor

jimczi commented Dec 7, 2018

The sort order is based on count and there is no guarantee that equals element will not be reordered as a result of the sort. We could change to a stable sort or change the documentation to explain how the sort works but why are you using the composite aggregation and the bucket sort instead of a terms aggregation sorted by _count ? In the latter case you could add a tiebreaker to your sort based on the _key to make sure that equals count are sorted based on the value of the term ?

@odespesse
Copy link
Author

This is a simplified version of my real use case.

I need to aggregate 3 different fields together. I tried with terms but the result of 3 nested aggregations is a pain to parse and the composite aggregation seems to be the best solution for this scenario.

What seems strange to me is that on every attempts I tried, everything is sorted as expected except the first one which is misplaced at the end. It is like if the internal ElasticSearch "loop" that sort every buckets do the right thing except for the first item.

@jimczi
Copy link
Contributor

jimczi commented Dec 7, 2018

It is like if the internal ElasticSearch "loop" that sort every buckets do the right thing except for the first item.

I didn't look carefully the implementation of the priority queue we use to perform the sort but as I said in my previous comment the output of the sort is correct. Though I wonder why we use a priority queue rather than a list and then Collections#sort which provides stable sort.
@dimitris-athanasiou any reason to prefer a priority queue here ? I understand that it can be faster if the buckets are truncated but that seems an edge case ?

@jimczi jimczi added the >feature label Dec 7, 2018
@dimitris-athanasiou
Copy link
Contributor

dimitris-athanasiou commented Dec 7, 2018

@jimczi Indeed, the reason a priority queue was used was because it seemed suitable for also dealing with trimmed buckets in an efficient way. I vaguely recall discussing this and deciding it was a desirable trade-off against sort stability.

@jimczi
Copy link
Contributor

jimczi commented Dec 10, 2018

We discussed this offline and we agreed that we should switch to a List and use Collections#sort. This should preserve the order for equal values and simplify the code since we don't really need a priority queue.

@jimczi jimczi added good first issue low hanging fruit help wanted adoptme labels Dec 10, 2018
@govi20
Copy link

govi20 commented Dec 11, 2018

I would like to work on it.

@SivagurunathanV
Copy link
Contributor

I would like to work on this issue if @govi20 is not working on it.

@polyfractal
Copy link
Contributor

Looks like someone just sent a PR for this (#36748)

If you're still interested in contributing, you can generally just leave a note that says "I'm going to start working on this" then raise a PR when you are ready. No need to get approval first.

Thanks for helping out! We have more "help wanted" and "good first issue" tickets if you want to look around for other bugs that need fixing :)

@Razi007
Copy link

Razi007 commented Dec 18, 2018

can we still work on issue or raise PR if there is already a PR raised.

@polyfractal
Copy link
Contributor

I would avoid working on issues that have an active PR. Sometimes a PR might go dormant (contributor gets busy and abandons the PR, technical issues make it close, etc), in which case we'll close the PR and can be worked on again.

But if the PR is active and being worked on, it's probably best to choose a different issue to work on.

dimitris-athanasiou pushed a commit that referenced this issue Jan 9, 2019
…aggregator (#36748)

Update BucketSortPipelineAggregator to use a List and Collections.sort() for sorting instead of a priority queue. This preserves the order for equal values. Closes #36322.
dimitris-athanasiou pushed a commit that referenced this issue Jan 9, 2019
…aggregator (#36748)

Update BucketSortPipelineAggregator to use a List and Collections.sort() for sorting instead of a priority queue. This preserves the order for equal values. Closes #36322.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants