Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Discuss] Add struct Query for datafusion #2181

Closed
Tracked by #474 ...
jackwener opened this issue Apr 8, 2022 · 22 comments
Closed
Tracked by #474 ...

[Discuss] Add struct Query for datafusion #2181

jackwener opened this issue Apr 8, 2022 · 22 comments
Assignees
Labels
enhancement New feature or request

Comments

@jackwener
Copy link
Member

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
I'm try to implement InSubquery.

I found that there isn't Query struct in datafusion. I think that it's because there isn't subquery handle in datafusion.

In the roadmap, datafusion include subquery implementation. I think Query struct is necessary just like datafusion-expr.

Describe the solution you'd like

Describe alternatives you've considered

Additional context

@jackwener jackwener added the enhancement New feature or request label Apr 8, 2022
@jackwener
Copy link
Member Author

@alamb @yahoNanJing @yjshen @mingmwang @tustvold How do you think about it?

@jackwener
Copy link
Member Author

I guess that reason why use datafusion-expr instead of sqlparser is sqlparser need to adapt to various sql types and a little complex.

Maybe we also need to use customized Query and other.

@alamb
Copy link
Contributor

alamb commented Apr 8, 2022

@jackwener I am not sure about the need for Query -- if there is one, perhaps we can combine it with what @andygrove is working on in #2172

I was imagining that a query like this

SELECT f.x, sq.a FROM foo as f JOIN (select a from bar) as sq ON (f.x = sq.a),

results in a logical plan like this:

Projection(f.x, sq.a)
  --> Join(Inner f.x = sq.a)
    --> TableScan(foo)
    --> Projection(a)  <--- Here is where the "subquery"'s LogicalPlan is attached 
      --> TableScan(bar)

Similarly for

SELECT f.x  FROM foo as f WHERE f.x IN  (select a from bar) as sq;

results in a logical plan like this:

Projection(f.x, sq.a)
  --> Join(Semi f.x = sq.a)
    --> TableScan(foo)
    --> Projection(a)  <--- Here is where the "subquery"'s LogicalPlan is attached 
      --> TableScan(bar)

@alamb
Copy link
Contributor

alamb commented Apr 8, 2022

See #1209 for some more thinking on this matter

@xudong963
Copy link
Member

xudong963 commented Apr 9, 2022

Here is a framework #1649 for implementing subqueries that I wrote some time ago. After this framework is completed, I believe that many kinds of subqueries can be easily added, such as in, scalar subquery, and this framework takes decorrelated into account at the beginning.

Currently this ticket encounters some problems because datafusion currently records columns by name, not by index. So some work need to do before continuing.

Since I've been busy with changing jobs and adjusting to a new life, I haven't had much time to move forward with this ticket, so it's a bit delayed :(

@jackwener
Copy link
Member Author

Here is a framework #1649 for implementing subqueries that I wrote some time ago. After this framework is completed, I believe that many kinds of subqueries can be easily added, such as in, scalar subquery, and this framework takes decorrelated into account at the beginning.

I also see this PR. I think it's not good choice to embed a logicplan to Expr.

@xudong963
Copy link
Member

xudong963 commented Apr 9, 2022

I think it's not good choice to embed a logicplan to Expr.

Why

@mingmwang
Copy link
Contributor

Here is a framework #1649 for implementing subqueries that I wrote some time ago. After this framework is completed, I believe that many kinds of subqueries can be easily added, such as in, scalar subquery, and this framework takes decorrelated into account at the beginning.

I also see this PR. I think it's not good choice to embed a logicplan to Expr.

I also have the same doubt.

@alamb
Copy link
Contributor

alamb commented Apr 12, 2022

There are two kinds of subqueries

  • Uncorrelated (aka subqueries that appear in the from clause that don't have references to the outer query)
  • Correlated (subqueries that appear in the from clause or IN / NOT IN /etc and that refer to the outer query)

For correlated queries that that refer to the outer query using = predicates they can be transformed into SEMI-JOIN / ANTI-JOIN typically

Correlated queries that use something other than = (non equijoins) eventually have to be run once per each row in the outer query and would likely be expressed as a LogicalPlan in an Expr.

that being said, I think we can get very far in DataFusion (maybe forever) by handling only uncorrelated and equi-join correlated subqueries and avoid LogicalPlan in Expr

@jackwener
Copy link
Member Author

jackwener commented Apr 12, 2022

Correlated queries that use something other than = (non equijoins) eventually have to be run once per each row in the outer query and would likely be expressed as a LogicalPlan in an Expr.

Maybe, we can use Query in an Expr (this is why I bring up this discussion).

Because the process is

| AST and binder --> Logical plan --> Execute plan |

Query is in AST and binder and will be transferred to LogicalPlan.

@xudong963
Copy link
Member

xudong963 commented Apr 12, 2022

Subquery processing is a complex matter, and I highly recommend that we figure out how to do it at the beginning (how to handle correlated and non-correlated), otherwise there will be a lot of bugs, and hard to cover areas later.

I recommend reading this article by materialize: https://materialize.com/decorrelation-subquery-optimization/, which mentions three articles, especially the tum one, which are very informative.

My previous framework was implemented according to this, and I will close it first. I'll continue when I'm free later, if others haven't done complete support :).

If anyone else has a better idea, feel free to ping me for a review!

@andygrove
Copy link
Member

andygrove commented Apr 14, 2022

I just wanted to add a note here to say that I have dedicated time now to help with this effort. I think we can break this work down into phases and the logical starting point (pun intended) would be to decide on how we want to represent the various types of subquery in the LogicalPlan (unoptimized). Once we have that we can create separate issues and PRs for SQL query planning, DataFrame API, logical plan optimizations, and implementing the physical plans.

I am going to start by reviewing Apache Spark's logical plan for subqueries and then will put a proposal up. I'm not sure if this will be in the form of a PR or a Google document yet but I will share the proposal here, probably early next week.

@alamb
Copy link
Contributor

alamb commented Apr 15, 2022

See dask-contrib/dask-sql#474 and its subtasks such as #2238

@andygrove
Copy link
Member

So I started prototyping and ran into an issue right away. As others have pointed out, we sometimes need to have an Expr that refers to a LogicalPlan, such as:

enum Expr {
  Exists {
    subquery: Box<LogicalPlan>
  }
  ...
}

However, the LogicalPlan enum is in the datafusion crate which depends on the datafusion-expr crate so I cannot reference LogicalPlan from Expr.

I was not paying much attention when this refactoring happened. Does it make sense to now move LogicalPlan as well to datafusion-expr crate?

@andygrove
Copy link
Member

Perhaps @jimexist can offer an opinion on this.

@andygrove
Copy link
Member

andygrove commented Apr 15, 2022

that being said, I think we can get very far in DataFusion (maybe forever) by handling only uncorrelated and equi-join correlated subqueries and avoid LogicalPlan in Expr

For dask-sql, I have a requirement to support a popular decision support SQL benchmark that has correlated subqueries that are not all equi-join. Even if we were just looking to support equi-join, I would expect to build a simple logical plan first and then have an optimizer rule re-write the plan to a SEMI-JOIN or ANTI-JOIN. I don't think it would be feasible to try and do this while translating a SQL AST to a logical plan before the entire query is built.

@andygrove
Copy link
Member

@andygrove andygrove mentioned this issue Apr 16, 2022
15 tasks
@alamb
Copy link
Contributor

alamb commented Apr 18, 2022

I have a requirement to support a popular decision support SQL benchmark that has correlated subqueries that are not all equi-join.

What is the query? Does it have equi-join correlation as well as some non equijoins?

@paveltiunov
Copy link
Contributor

@andygrove @alamb I guess great timing! We just implemented the correlated subquery here cube-js@d14f0de. If you're open I can bring it as a PR and would love to shape it to a more canonical structure. There're multiple caveats some of which are covered in the design document like circular dependency between Expr and LogicalPlan. We'd love to see correlated subquery support merged as we don't have incentives to support such a big change in our fork.

@alamb
Copy link
Contributor

alamb commented Apr 19, 2022

I plan to read the design document carefully tomorrow morning and hopefully then I can have a useful conversation about this topic. Thank you @andygrove for driving it

@andygrove
Copy link
Member

@andygrove @alamb I guess great timing! We just implemented the correlated subquery here cube-js@d14f0de. If you're open I can bring it as a PR and would love to shape it to a more canonical structure. There're multiple caveats some of which are covered in the design document like circular dependency between Expr and LogicalPlan. We'd love to see correlated subquery support merged as we don't have incentives to support such a big change in our fork.

Hi @paveltiunov Yes I would love to see a PR to add subquery support. The circular dependency issue is resolved in #2294 so that should help. I will start reviewing your implementation now to get up to speed.

@andygrove
Copy link
Member

I think this issue can be closed now that we have the following logical plan structures:

  • Expr::Exists
  • Expr::InSubquery
  • Expr::ScalarSubquery
  • LogicalPlan::Subquery
  • LogicalPlan::SubqueryAlias

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

6 participants