-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Discuss] Add struct Query
for datafusion
#2181
Comments
@alamb @yahoNanJing @yjshen @mingmwang @tustvold How do you think about it? |
I guess that Maybe we also need to use customized |
@jackwener I am not sure about the need for I was imagining that a query like this SELECT f.x, sq.a FROM foo as f JOIN (select a from bar) as sq ON (f.x = sq.a), results in a logical plan like this:
Similarly for SELECT f.x FROM foo as f WHERE f.x IN (select a from bar) as sq; results in a logical plan like this:
|
See #1209 for some more thinking on this matter |
Here is a framework #1649 for implementing subqueries that I wrote some time ago. After this framework is completed, I believe that many kinds of subqueries can be easily added, such as Currently this ticket encounters some problems because datafusion currently records columns by name, not by index. So some work need to do before continuing. Since I've been busy with changing jobs and adjusting to a new life, I haven't had much time to move forward with this ticket, so it's a bit delayed :( |
I also see this PR. I think it's not good choice to embed a |
Why |
I also have the same doubt. |
There are two kinds of subqueries
For correlated queries that that refer to the outer query using Correlated queries that use something other than that being said, I think we can get very far in DataFusion (maybe forever) by handling only uncorrelated and equi-join correlated subqueries and avoid LogicalPlan in Expr |
Maybe, we can use Because the process is | AST and binder --> Logical plan --> Execute plan | Query is in |
Subquery processing is a complex matter, and I highly recommend that we figure out how to do it at the beginning (how to handle correlated and non-correlated), otherwise there will be a lot of bugs, and hard to cover areas later. I recommend reading this article by materialize: https://materialize.com/decorrelation-subquery-optimization/, which mentions three articles, especially the tum one, which are very informative. My previous framework was implemented according to this, and I will close it first. I'll continue when I'm free later, if others haven't done complete support :). If anyone else has a better idea, feel free to ping me for a review! |
I just wanted to add a note here to say that I have dedicated time now to help with this effort. I think we can break this work down into phases and the logical starting point (pun intended) would be to decide on how we want to represent the various types of subquery in the LogicalPlan (unoptimized). Once we have that we can create separate issues and PRs for SQL query planning, DataFrame API, logical plan optimizations, and implementing the physical plans. I am going to start by reviewing Apache Spark's logical plan for subqueries and then will put a proposal up. I'm not sure if this will be in the form of a PR or a Google document yet but I will share the proposal here, probably early next week. |
See dask-contrib/dask-sql#474 and its subtasks such as #2238 |
So I started prototyping and ran into an issue right away. As others have pointed out, we sometimes need to have an enum Expr {
Exists {
subquery: Box<LogicalPlan>
}
...
} However, the LogicalPlan enum is in the I was not paying much attention when this refactoring happened. Does it make sense to now move |
Perhaps @jimexist can offer an opinion on this. |
For dask-sql, I have a requirement to support a popular decision support SQL benchmark that has correlated subqueries that are not all equi-join. Even if we were just looking to support equi-join, I would expect to build a simple logical plan first and then have an optimizer rule re-write the plan to a SEMI-JOIN or ANTI-JOIN. I don't think it would be feasible to try and do this while translating a SQL AST to a logical plan before the entire query is built. |
I have started a design document: https://docs.google.com/document/d/1j5vHyva-T_5l3POnHSS0-r5TOp86sEkfSUcRLqJZaK4/edit?usp=sharing |
What is the query? Does it have equi-join correlation as well as some non equijoins? |
@andygrove @alamb I guess great timing! We just implemented the correlated subquery here cube-js@d14f0de. If you're open I can bring it as a PR and would love to shape it to a more canonical structure. There're multiple caveats some of which are covered in the design document like circular dependency between |
I plan to read the design document carefully tomorrow morning and hopefully then I can have a useful conversation about this topic. Thank you @andygrove for driving it |
Hi @paveltiunov Yes I would love to see a PR to add subquery support. The circular dependency issue is resolved in #2294 so that should help. I will start reviewing your implementation now to get up to speed. |
I think this issue can be closed now that we have the following logical plan structures:
|
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
I'm try to implement
InSubquery
.I found that there isn't
Query
struct in datafusion. I think that it's because there isn't subquery handle in datafusion.In the roadmap, datafusion include
subquery
implementation. I thinkQuery
struct is necessary just likedatafusion-expr
.Describe the solution you'd like
Describe alternatives you've considered
Additional context
The text was updated successfully, but these errors were encountered: