Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: limit push down #8844

Closed
lmatz opened this issue Mar 29, 2023 · 17 comments
Closed

feat: limit push down #8844

lmatz opened this issue Mar 29, 2023 · 17 comments
Assignees
Labels
help wanted Issues that need help from contributors
Milestone

Comments

@lmatz
Copy link
Contributor

lmatz commented Mar 29, 2023

Describe the bug

query:

select cast(UDF(CONSTANT_1, Column_1) as jsonb)  from TABLE1 where name like '%ZZZ%' limit 1;

plan:

BatchLimit { limit: 1, offset: 0 }
 └─BatchExchange { order: [], dist: Single }
   └─BatchLimit { limit: 1, offset: 0 }
     └─BatchProject { exprs: [UserDefinedFunction { ... }::Jsonb as $expr1] }
       └─BatchFilter { predicate: Like(TABLE1.name, '%ZZZ%':Varchar) }
         └─BatchScan { table: TABLE1, columns: [COLUMN1, name] }
(6 rows)

Is it a bug or is limit 1 doing what it is supposed to do?

To Reproduce

No response

Expected behavior

No response

Additional context

Reported by a design partner

@lmatz lmatz added the type/bug Something isn't working label Mar 29, 2023
@github-actions github-actions bot added this to the release-0.19 milestone Mar 29, 2023
@xxchan
Copy link
Member

xxchan commented Mar 30, 2023

IIRC we don't have any "limit pushdown", but only "two phase limit"

@lmatz
Copy link
Contributor Author

lmatz commented Mar 30, 2023

Oh, now I remembered, we count on the early termination of limit to early terminate the scan.

The scan probably did a read to fetch 1024 rows anyway in this case and udf processed them all.

@lmatz lmatz added the help wanted Issues that need help from contributors label Mar 30, 2023
@xxchan
Copy link
Member

xxchan commented Mar 30, 2023

So are you making a feature request to pushing limit to the lowest level? 🤔

@lmatz
Copy link
Contributor Author

lmatz commented Mar 30, 2023

Yes, not a bug but a feature request

@lmatz lmatz removed the type/bug Something isn't working label Mar 30, 2023
@lmatz lmatz changed the title maybe bug: limit is not pushed down when there is a UDF feat: limit push down Mar 30, 2023
@y-wei
Copy link
Contributor

y-wei commented Mar 30, 2023

I think I've done similar things when optimizing TopN.

index_scan.set_chunk_size(
((u32::MAX as u64).min(logical_top_n.limit() + logical_top_n.offset())) as u32,
);

Do you want more cases that set chunk_size for scan?

@xxchan
Copy link
Member

xxchan commented Mar 30, 2023

Wow I don't know we have that 👀. But I don't think limit push down is like that. I think we just need to blindly push limit down until a leaf node 🤔

@y-wei
Copy link
Contributor

y-wei commented Mar 30, 2023

Oh I see, then IIUC it's pushing down until leaf or filter?

@xxchan
Copy link
Member

xxchan commented Mar 30, 2023

Oh yes and no... By mentioning filter you also remind me of other cases, like join or agg.

@y-wei y-wei self-assigned this Apr 3, 2023
@y-wei
Copy link
Contributor

y-wei commented Apr 3, 2023

I find the problem is more tricky that it might seem...

A solvable problem is the project set generates multiple rows for each row input, e.g., select generate_series(1, 1000, 1) from t limit 1;, although we can have a limit directly following the scan, we have to keep the original limit to guarantee our output is limited as desired.

A more tricky case is that, say one UDF generates 0 records for the first row in TABLE1 and generate 1 records for the second row in TABLE1, we cannot push down the limit to the scan as then no records will in the output. Generally speaking, we have to ensure the current op will generate at least 1 row(s) per input row to push down a limit, which there is no existing method to check and I think we may introduce some checks in LogicalProjectSet (or more?) to achieve it, if we do need this feature.

Therefore I find the use case where we can push the limit down could be very limited, could you please reconfirm this feature request?

@lmatz
Copy link
Contributor Author

lmatz commented Apr 4, 2023

I think UDF always returns a single row.
But your analysis does apply to UDTF which can return 0, 1 or multiple rows.👍

@y-wei
Copy link
Contributor

y-wei commented Apr 4, 2023

Cool, then we can do something. Could you please provide a runnable use case of UDF? I cannot find any in our e2e or planner tests😄

@y-wei
Copy link
Contributor

y-wei commented Apr 7, 2023

Hello guys. I find my pr a bit stuck, and thus I want to followup. Since pushing down ProjectSet can be dangerous as we've agreed above, my current implementation just specifically matches the limit <- project pattern. Do you have any concerns with it? I was implementing it by a blacklist way (pushing down until some nodes), yet I found it might only left us Project😅

@xxchan
Copy link
Member

xxchan commented Apr 7, 2023

I don't have any other concerns, but can also only imagine Limit -> Project case. 😅 So it seems too limited and ad-hoc. Wanted @lmatz to elaborate the use cases.

cc @wangrunji0408

@wangrunji0408
Copy link
Contributor

wangrunji0408 commented Apr 10, 2023

Could you please provide a runnable use case of UDF?

You can find them here.

I think pushing Limit down through Project should be no concern, as Project (including UDF) always produces chunks with the same number of rows as the inputs.

@st1page
Copy link
Contributor

st1page commented Apr 10, 2023

it is not a very big optimization because the LimitExecutor can stop the stream early when the input achieves the limit row count. But generally LGTM.

@lmatz
Copy link
Contributor Author

lmatz commented Apr 10, 2023

My input comes from one user who complains about why a query with limit 1 and UDF is so slow. That's all.

@y-wei
Copy link
Contributor

y-wei commented Apr 10, 2023

completed by #8971

@y-wei y-wei closed this as completed Apr 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Issues that need help from contributors
Projects
None yet
Development

No branches or pull requests

5 participants