-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: ordered read_postgres tables #2596
base: main
Are you sure you want to change the base?
Conversation
Interestingly enough, this seems to have the desired behavior (returning results in the correct order), which is really cool. Since we don't have a good way to assert about explain in SLTs (or anywhere else, I think?) I don't know if we have actually successfully exposed the fact that the the results will be ordered from the data source. If we don't do this DF will buffer all of the results to sort the (already sorted) data, which will be faster than having to sort the data, but still less than ideal. The other big question that I think we should make sure to decide on before landing this (and duplicating it in other relevant data sources/table providers and table funcs) is the syntax. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We shouldn't expose a new dsl for datasource specific ordering. All pushdowns need to be handled in the optimizer.
So the optimizer, as I'm sure you know, doesn't pass sort information to table providers/data sources, so while I'm not wed to this particular approach, you're proposing something that isn't possible in datafusion at the moment. If you have a better idea for this I'd love to hear it. |
It doesn't natively have any mechanisms to push down anything beyond slice, predicate and projections. This doesn't stop us for writing custom optimizer rules for pushing these down. This is how cube does sort pushdowns (among others), and how datafusion advises handling these kind of specialized optimizations |
So I think the syntax in the table function is probably wrong, but "just write an optimizer plan to solve this generically" is definitely at the very least "making the perfect the enemy of the good enough." I definitely think as an option in From an implementation perspective, implementing a physical optimizer pass, would (via it's own path) {maybe} end up constructing a table provider with roughly exactly the same code as the one here. From an ergonomics perspective this isn't all together that different from the way that transactional systems require users to create indexes on their tables to support fast sort operations, and it wouldn't be reasonable for those systems to say "users shouldn't need an explicit syntax, the optimizer should be able to handle it." Similarly, such systems have some kind of Thinking about it in these terms, pushing down sorts is the kind of thing that you'd really only want to do in some situations. While it'd be faster for some queries and some tables, if the tables weren't properly indexed, users might really not want the sort pushed down. This ends up not being so much about "how can we best optimize things" and more "how do users want [glaredb/etc] to interact with their databases," and the second part can't really be solved with optimizer abstractions. Apart from all of this I think the core use case here is pretty valid, and I don't think we should just walk away from any user who wants to sort their data without table scanning their operational databases. |
marking as draft as it's not actively waiting on review |
massive WIP;
want to see if we can get something that resembles pushdown sorts into
data sources that can sort on their own, without rearchitecting
datafusion.
if this works, relevant group-by and sorts can stay streaming in DF
and take advantage of indexes in the upstream datasource.