feat: ordered read_postgres tables #2596

tychoish · 2024-02-04T01:52:58Z

massive WIP;

want to see if we can get something that resembles pushdown sorts into
data sources that can sort on their own, without rearchitecting
datafusion.

if this works, relevant group-by and sorts can stay streaming in DF
and take advantage of indexes in the upstream datasource.

…rough-sorts

tychoish · 2024-02-04T17:31:50Z

Interestingly enough, this seems to have the desired behavior (returning results in the correct order), which is really cool.

Since we don't have a good way to assert about explain in SLTs (or anywhere else, I think?) I don't know if we have actually successfully exposed the fact that the the results will be ordered from the data source. If we don't do this DF will buffer all of the results to sort the (already sorted) data, which will be faster than having to sort the data, but still less than ideal.

The other big question that I think we should make sure to decide on before landing this (and duplicating it in other relevant data sources/table providers and table funcs) is the syntax.

universalmind303

We shouldn't expose a new dsl for datasource specific ordering. All pushdowns need to be handled in the optimizer.

tychoish · 2024-02-05T16:46:33Z

We shouldn't expose a new dsl for datasource specific ordering. All pushdowns need to be handled in the optimizer.

So the optimizer, as I'm sure you know, doesn't pass sort information to table providers/data sources, so while I'm not wed to this particular approach, you're proposing something that isn't possible in datafusion at the moment. If you have a better idea for this I'd love to hear it.

universalmind303 · 2024-02-05T18:14:45Z

So the optimizer, as I'm sure you know, doesn't pass sort information to table providers/data sources, so while I'm not wed to this particular approach, you're proposing something that isn't possible in datafusion at the moment. If you have a better idea for this I'd love to hear it.

It doesn't natively have any mechanisms to push down anything beyond slice, predicate and projections. This doesn't stop us for writing custom optimizer rules for pushing these down.

This is how cube does sort pushdowns (among others), and how datafusion advises handling these kind of specialized optimizations

tychoish · 2024-02-05T19:28:40Z

So I think the syntax in the table function is probably wrong, but "just write an optimizer plan to solve this generically" is definitely at the very least "making the perfect the enemy of the good enough." I definitely think as an option in CREATE EXTERNAL TABLE (perhaps) it's more reasonable.

From an implementation perspective, implementing a physical optimizer pass, would (via it's own path) {maybe} end up constructing a table provider with roughly exactly the same code as the one here.

From an ergonomics perspective this isn't all together that different from the way that transactional systems require users to create indexes on their tables to support fast sort operations, and it wouldn't be reasonable for those systems to say "users shouldn't need an explicit syntax, the optimizer should be able to handle it." Similarly, such systems have some kind of HINT functionality, that let users explicitly indicate which indexes to use, in the case when the optimizer selects the wrong index (usually because optimizers don't have a great view into workload or data distribution,) and doing something like this (creating an ordered table) is a lot like a hint.

Thinking about it in these terms, pushing down sorts is the kind of thing that you'd really only want to do in some situations. While it'd be faster for some queries and some tables, if the tables weren't properly indexed, users might really not want the sort pushed down. This ends up not being so much about "how can we best optimize things" and more "how do users want [glaredb/etc] to interact with their databases," and the second part can't really be solved with optimizer abstractions.

Apart from all of this I think the core use case here is pretty valid, and I don't think we should just walk away from any user who wants to sort their data without table scanning their operational databases.

universalmind303 · 2024-03-05T14:38:45Z

marking as draft as it's not actively waiting on review

tychoish added 5 commits February 3, 2024 18:37

feat: ordered read_postgres tables

bc505a5

df integration

e0e6e00

clippy and more tests

a4790a4

Merge remote-tracking branch 'origin/main' into tycho/postgres-passth…

8567149

…rough-sorts

bools

a3db321

universalmind303 suggested changes Feb 5, 2024

View reviewed changes

tychoish requested a review from universalmind303 February 5, 2024 16:46

universalmind303 marked this pull request as draft March 5, 2024 14:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: ordered read_postgres tables #2596

feat: ordered read_postgres tables #2596

tychoish commented Feb 4, 2024

tychoish commented Feb 4, 2024

universalmind303 left a comment

tychoish commented Feb 5, 2024

universalmind303 commented Feb 5, 2024

tychoish commented Feb 5, 2024

universalmind303 commented Mar 5, 2024

feat: ordered read_postgres tables #2596

Are you sure you want to change the base?

feat: ordered read_postgres tables #2596

Conversation

tychoish commented Feb 4, 2024

tychoish commented Feb 4, 2024

universalmind303 left a comment

Choose a reason for hiding this comment

tychoish commented Feb 5, 2024

universalmind303 commented Feb 5, 2024

tychoish commented Feb 5, 2024

universalmind303 commented Mar 5, 2024