Select global top N records with grouping by multiple fields #121

ylwu-amzn · 2021-06-10T23:21:55Z

Is your feature request related to a problem? Please describe.
We need to query top N records by grouping by multiple fields and sorting by doc count. Checked the explained DSL, seems no sort logic. Does that mean SQL plugin will take in the query result and sort ? In this way, the sorted result may be not globally top records? For example, we have 10K documents. The query returns 1K documents by default, then SQL plugin just sorts these 1K documents without checking the other 9K documents.

My testing query

POST _opendistro/_sql/_explain
{
  "query": "SELECT count(value) as c FROM test_data2 group by field1, field2 order by c"
}

# Response
{
  "root": {
    "name": "ProjectOperator",
    "description": {
      "fields": "[c]"
    },
    "children": [
      {
        "name": "SortOperator",
        "description": {
          "sortList": {
            "count(value)": {
              "sortOrder": "ASC",
              "nullOrder": "NULL_FIRST"
            }
          }
        },
        "children": [
          {
            "name": "ElasticsearchIndexScan",
            "description": {
              "request": """ElasticsearchQueryRequest(indexName=test_data2, sourceBuilder={"from":0,"size":0,"timeout":"1m","aggregations":{"composite_buckets":{"composite":{"size":1000,"sources":[{"field1":{"terms":{"field":"field1","missing_bucket":true,"order":"asc"}}},{"field2":{"terms":{"field":"field2","missing_bucket":true,"order":"asc"}}}]},"aggregations":{"count(value)":{"value_count":{"field":"value"}}}}}}, searchDone=false)"""
            },
            "children": []
          }
        ]
      }
    ]
  }
}

The text was updated successfully, but these errors were encountered:

chloe-zh · 2021-06-10T23:40:15Z

Hi @ylwu-amzn I tried out a similar query with your requirements, and the sort operator is optimized by pushing it down to the OpenSearch engine, which means it will sort on all 10k documents down into the OS engine before return to the plugin and should be ok for your use case.

POST _plugins/_sql/_explain
{
  "query": "SELECT count(*) as c from opensearch_dashboards_sample_data_flights group by Origin, Dest order by AvgTicketPrice"
}

Response:
{
  "root": {
    "name": "ProjectOperator",
    "description": {
      "fields": "[c]"
    },
    "children": [
      {
        "name": "OpenSearchIndexScan",
        "description": {
          "request": """OpenSearchQueryRequest(indexName=opensearch_dashboards_sample_data_flights, sourceBuilder={"from":0,"size":0,"timeout":"1m","aggregations":{"composite_buckets":{"composite":{"size":1000,"sources":[{"Origin":{"terms":{"field":"Origin","missing_bucket":true,"order":"asc"}}},{"Dest":{"terms":{"field":"Dest","missing_bucket":true,"order":"asc"}}}]},"aggregations":{"count(*)":{"value_count":{"field":"_index"}}}}}}, searchDone=false)"""
        },
        "children": []
      }
    ]
  }
}

As for the case you mentioned where the sort operator is not pushed down and the sorting operation is done in the plugin, it happens when the it fails to proceed the optimization, for example when the expression to sort on is a complicated script or function rather than a simple field:

select value from table order by abs(fieldA + fieldB) + C

Then the sort is failed to push down, and will be done in the plugin runtime rather than in OS engine. If the users do want to run query like this and need a accurate result to sort all the documents (rather than fetch part of the index like 1k then sort locally), the users need to update the size limit to a larger number of the index size, and sometimes need to config the search size limit of the OS engine. But the tradeoff is very obvious that it leads to bad performance since all the computation is done in local JVM of the plugin.
This is a limitation of our query engine, also documented here: https://github.com/opensearch-project/sql/blob/main/docs/user/optimization/optimization.rst#sort-push-down

ylwu-amzn · 2021-06-11T18:24:06Z

hi, @chloe-zh , thanks for your answer. From this doc https://github.com/opensearch-project/sql/blob/main/docs/user/optimization/optimization.rst#sort-merge-into-opensearch-aggregation,

Because the OpenSearch Composite Aggregation doesn’t support order by metrics field, then if the sort list include fields which refer to metrics aggregation, then the sort operator can’t be push down

In our case, we use composite aggregation, so SQL plugin can't push the sort down. As SQL plugin just sort on the queried result from composite aggregation, the sorted result is not globally true. It will be great if SQL can support global sort on composite aggregation.

dai-chen · 2021-06-11T20:31:56Z

hi, @chloe-zh , thanks for your answer. From this doc https://github.com/opensearch-project/sql/blob/main/docs/user/optimization/optimization.rst#sort-merge-into-opensearch-aggregation,

Because the OpenSearch Composite Aggregation doesn’t support order by metrics field, then if the sort list include fields which refer to metrics aggregation, then the sort operator can’t be push down

In our case, we use composite aggregation, so SQL plugin can't push the sort down. As SQL plugin just sort on the queried result from composite aggregation, the sorted result is not globally true. It will be great if SQL can support global sort on composite aggregation.

Sure, we will evaluate all the limitations of current aggregation support and see how we can improve. Thanks!

ylwu-amzn added the enhancement New feature or request label Jun 10, 2021

dai-chen added the SQL label Jun 10, 2021

dai-chen mentioned this issue Jun 11, 2021

Aggregation support improvement #124

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Select global top N records with grouping by multiple fields #121

Select global top N records with grouping by multiple fields #121

ylwu-amzn commented Jun 10, 2021

chloe-zh commented Jun 10, 2021 •

edited

Loading

ylwu-amzn commented Jun 11, 2021

dai-chen commented Jun 11, 2021

Select global top N records with grouping by multiple fields #121

Select global top N records with grouping by multiple fields #121

Comments

ylwu-amzn commented Jun 10, 2021

chloe-zh commented Jun 10, 2021 • edited Loading

ylwu-amzn commented Jun 11, 2021

dai-chen commented Jun 11, 2021

chloe-zh commented Jun 10, 2021 •

edited

Loading