Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Select global top N records with grouping by multiple fields #121

Open
ylwu-amzn opened this issue Jun 10, 2021 · 3 comments
Open

Select global top N records with grouping by multiple fields #121

ylwu-amzn opened this issue Jun 10, 2021 · 3 comments
Labels
enhancement New feature or request SQL

Comments

@ylwu-amzn
Copy link
Contributor

Is your feature request related to a problem? Please describe.
We need to query top N records by grouping by multiple fields and sorting by doc count. Checked the explained DSL, seems no sort logic. Does that mean SQL plugin will take in the query result and sort ? In this way, the sorted result may be not globally top records? For example, we have 10K documents. The query returns 1K documents by default, then SQL plugin just sorts these 1K documents without checking the other 9K documents.

My testing query

POST _opendistro/_sql/_explain
{
  "query": "SELECT count(value) as c FROM test_data2 group by field1, field2 order by c"
}

# Response
{
  "root": {
    "name": "ProjectOperator",
    "description": {
      "fields": "[c]"
    },
    "children": [
      {
        "name": "SortOperator",
        "description": {
          "sortList": {
            "count(value)": {
              "sortOrder": "ASC",
              "nullOrder": "NULL_FIRST"
            }
          }
        },
        "children": [
          {
            "name": "ElasticsearchIndexScan",
            "description": {
              "request": """ElasticsearchQueryRequest(indexName=test_data2, sourceBuilder={"from":0,"size":0,"timeout":"1m","aggregations":{"composite_buckets":{"composite":{"size":1000,"sources":[{"field1":{"terms":{"field":"field1","missing_bucket":true,"order":"asc"}}},{"field2":{"terms":{"field":"field2","missing_bucket":true,"order":"asc"}}}]},"aggregations":{"count(value)":{"value_count":{"field":"value"}}}}}}, searchDone=false)"""
            },
            "children": []
          }
        ]
      }
    ]
  }
}
@ylwu-amzn ylwu-amzn added the enhancement New feature or request label Jun 10, 2021
@dai-chen dai-chen added the SQL label Jun 10, 2021
@chloe-zh
Copy link
Contributor

chloe-zh commented Jun 10, 2021

Hi @ylwu-amzn I tried out a similar query with your requirements, and the sort operator is optimized by pushing it down to the OpenSearch engine, which means it will sort on all 10k documents down into the OS engine before return to the plugin and should be ok for your use case.

POST _plugins/_sql/_explain
{
  "query": "SELECT count(*) as c from opensearch_dashboards_sample_data_flights group by Origin, Dest order by AvgTicketPrice"
}

Response:
{
  "root": {
    "name": "ProjectOperator",
    "description": {
      "fields": "[c]"
    },
    "children": [
      {
        "name": "OpenSearchIndexScan",
        "description": {
          "request": """OpenSearchQueryRequest(indexName=opensearch_dashboards_sample_data_flights, sourceBuilder={"from":0,"size":0,"timeout":"1m","aggregations":{"composite_buckets":{"composite":{"size":1000,"sources":[{"Origin":{"terms":{"field":"Origin","missing_bucket":true,"order":"asc"}}},{"Dest":{"terms":{"field":"Dest","missing_bucket":true,"order":"asc"}}}]},"aggregations":{"count(*)":{"value_count":{"field":"_index"}}}}}}, searchDone=false)"""
        },
        "children": []
      }
    ]
  }
}

As for the case you mentioned where the sort operator is not pushed down and the sorting operation is done in the plugin, it happens when the it fails to proceed the optimization, for example when the expression to sort on is a complicated script or function rather than a simple field:

select value from table order by abs(fieldA + fieldB) + C

Then the sort is failed to push down, and will be done in the plugin runtime rather than in OS engine. If the users do want to run query like this and need a accurate result to sort all the documents (rather than fetch part of the index like 1k then sort locally), the users need to update the size limit to a larger number of the index size, and sometimes need to config the search size limit of the OS engine. But the tradeoff is very obvious that it leads to bad performance since all the computation is done in local JVM of the plugin.
This is a limitation of our query engine, also documented here: https://github.com/opensearch-project/sql/blob/main/docs/user/optimization/optimization.rst#sort-push-down

@ylwu-amzn
Copy link
Contributor Author

hi, @chloe-zh , thanks for your answer. From this doc https://github.com/opensearch-project/sql/blob/main/docs/user/optimization/optimization.rst#sort-merge-into-opensearch-aggregation,

Because the OpenSearch Composite Aggregation doesn’t support order by metrics field, then if the sort list include fields which refer to metrics aggregation, then the sort operator can’t be push down

In our case, we use composite aggregation, so SQL plugin can't push the sort down. As SQL plugin just sort on the queried result from composite aggregation, the sorted result is not globally true. It will be great if SQL can support global sort on composite aggregation.

@dai-chen
Copy link
Collaborator

hi, @chloe-zh , thanks for your answer. From this doc https://github.com/opensearch-project/sql/blob/main/docs/user/optimization/optimization.rst#sort-merge-into-opensearch-aggregation,

Because the OpenSearch Composite Aggregation doesn’t support order by metrics field, then if the sort list include fields which refer to metrics aggregation, then the sort operator can’t be push down

In our case, we use composite aggregation, so SQL plugin can't push the sort down. As SQL plugin just sort on the queried result from composite aggregation, the sorted result is not globally true. It will be great if SQL can support global sort on composite aggregation.

Sure, we will evaluate all the limitations of current aggregation support and see how we can improve. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request SQL
Projects
None yet
Development

No branches or pull requests

3 participants