Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow individual shards to be targeted during query execution [FEATURE] #1478

Open
akuzin1 opened this issue Mar 27, 2023 · 9 comments · Fixed by #1771 · May be fixed by #2030
Open

Allow individual shards to be targeted during query execution [FEATURE] #1478

akuzin1 opened this issue Mar 27, 2023 · 9 comments · Fixed by #1771 · May be fixed by #2030
Labels
enhancement New feature or request Priority-High

Comments

@akuzin1
Copy link

akuzin1 commented Mar 27, 2023

Is your feature request related to a problem?

In MySQL one can retrieve partition information about a table, which can later be used to target specific partitions during query execution.

The following is an example of a query that can be used to retrieve partition information of a specific table.

"SELECT DISTINCT partition_name FROM INFORMATION_SCHEMA.PARTITIONS WHERE TABLE_NAME = <table_name> AND TABLE_SCHEMA = <table_schema> " +
            "AND partition_name IS NOT NULL"

Next is an example of a query targeting a specific partition in a table.

SELECT * FROM table PARTITION (partitionName);

When applying this to opensearch, partitions could be treated as the equivalent of shards for our use case.

What solution would you like?
It would be great to be able to treat shards in Opensearch as the equivalent to MySQL Partitions and be able to query individual shards.

What alternatives have you considered?
We've considered generating splits based on hashing a key or tuple of keys and then modulo that against some fixed number of splits that we want to generate.

Example:
For example for 3 splits:

split 1:
SELECT *
FROM some_unpartitioned_table
WHERE hash(col1, col2, col3) % 10 == 0

split 2:
SELECT *
FROM some_unpartitioned_table
WHERE hash(col1, col2, col3) % 10 == 1

split 3:
SELECT *
FROM some_unpartitioned_table
WHERE hash(col1, col2, col3) % 10 == 2

However, it doesn't seem like there is a hashing function like that available.

Therefore, the above mentioned solution would be a great behavior to add for all sql users, to more closely mimic the behavior and syntax of MySQL.

@akuzin1 akuzin1 added enhancement New feature or request untriaged labels Mar 27, 2023
@dai-chen
Copy link
Collaborator

@akuzin1 Thanks for the feature request! This recall an old request previously: opendistro-for-elasticsearch/sql#1151. @acarbonetto is adding OpenSearch meta field support to our SQL engine. I'm thinking can we support this as part of meta field work.

@akuzin1
Copy link
Author

akuzin1 commented Mar 29, 2023

Sounds good, that would be great to see. Is there an estimated timeline for when this feature would be released?
Thank you

@dai-chen
Copy link
Collaborator

@acarbonetto
Copy link
Collaborator

This will be a dependency for #1441
The API on OpenSearch side is _routing which we should consider using instead of _shard (or both!)

@acarbonetto
Copy link
Collaborator

related: #339

@akuzin1
Copy link
Author

akuzin1 commented Jun 9, 2023

Hi! Saw that the label was changed to "Priority-High" which is great to see, so wanted to check in and see if there is an estimated date for delivering of this feature? Thank you.

@acarbonetto
Copy link
Collaborator

Hi! Saw that the label was changed to "Priority-High" which is great to see, so wanted to check in and see if there is an estimated date for delivering of this feature? Thank you.

Target release is 2.10 at the moment. Hoping to get this in a little sooner.

@acarbonetto
Copy link
Collaborator

acarbonetto commented Jun 30, 2023

Proposal for setting routing field on the search request. There's two ways forward that I'd like to propose. But first a little contexts:

Goal

The objective of the partition/routing shard is to include the routing ID in the SearchRequest builder.

Example:

new SearchRequest()
        .indices(indexName.getIndexNames())
        .routing(routingId.getIndexNames())
        .scroll(scrollTimeout)
        .source(sourceBuilder);

Getting the routing id(s) from the initial query into pushdown can take one of two obvious routes.

Proposal 1: Request Parameter ("routing")

Syntax addition, includes a new json parameter only available in the V2 engine.

{
    "query": "select _id, _index, _routing, int0 from calcs_routing limit 5",
    "routing": "FURNITURE"
}

This includes an opensearch-similar API to target individual or lists of shards. The string will accept a comma-separated list of shard ID targets.

PoC Architecture Change

To get the request into OpenSearchRequest as part of pushdown requires that we create an AbstractPlan, LogicalPlan and PhysicalPlan operator like the Paginate and LogicalPaginate operators. This would allow us to pushdown a routingId string into the OpenSearchIndexScanQueryBuilder during execution. We would create Partition operators in a similar manner as the Pagination operators without much business logic. Reference to how paginate works: https://github.com/opensearch-project/sql/blob/main/docs/dev/Pagination-v2.md#unresolved-query-plan

Considerations: we may consider combining Paginate and Partition into a single set of operators, and call them something like PushDownConfiguration and push all the parameters at once. This would allow us to scale by configuration values without needing to add more operators.

Pros

  • OpenSearch-specific query syntax is favourable to OpenSearch users adopting the SQL language, and can use existing documentation to understand the usage of the routing query parameter.
  • No specific change required for PPL vs SQL.

Cons

  • This syntax differs from other SQL syntax and will cause a disruption in the JDBC driver as the option is not built-in for SQL users.
  • Architecture changes are quite disruptive.
  • Is not specifically obvious what the benefit is for other storage options (e.g. Prometheus) since this is an OpenSearch-specific query syntax.
  • Won't work (initially) with the SQL-CLI

Proposal 2: SQL FROM table PARTITION (key, key, ...) syntax

As defined in https://dev.mysql.com/doc/refman/8.0/en/partitioning-selection.html, MySQL (and other SQL engines) allows a query to target a specific partition by id. This also allows for multiple ids in the PARTITION function.

{
    "query": "select _id, _index, _routing, int0 from calcs_routing PARTITON(\"FUNITURE, OFFICE SPACE\") limit 5"
}

PoC Architecture Change

Update the parser syntax to allow for the PARTITION function in the FROM clause.

Ultimately, to get the routing ID(s) into the SearchRequest, we need to add the PARTITON keys to to IndexScan (which extends Table). Then in the IndexScanBuilder, send the routing ID(s) to the SearchRequest.

To accomplish this, we could push the partition keys into the Table (OpenSearchIndexScan) as partition keys during the analysis phase. On pushdown, the partition keys would be part of the IndexScan and could semi-easily push down to the OpenSearchRequest as part of the Index.

OpenSearchIndex already has the concept of an Index, and Routing Keys would have the same architecture:

Pros

  • SQL PARTITION is a MySQL concept that is simple to adopt and expose to users. No specific JDBC change is required.
  • May also be useful for other storage options, or ignored by those engines.

Cons

  • PPL and SQL syntax is different and will require that we add PPL syntax to push the routing ids into the storage engine.
  • Architecture changes are less disruptive but also not very usable.

@macohen
Copy link

macohen commented Feb 8, 2024

Will there be a new projected release target for this feature, @acarbonetto? I see it was targeted for 2.11 and 2.12 is being released soon. Can this make 2.13, perhaps?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment