Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reimplement limit_push_down to remove global-state, enhance optimize and simplify code. #4276

Merged
merged 4 commits into from
Nov 22, 2022

Conversation

jackwener
Copy link
Member

@jackwener jackwener commented Nov 18, 2022

Which issue does this PR close?

Part #4267 .
Close #4263

In this PR, I reimplement this rule.

  • I use pattern-match to get subtree that we want to match, and then pushdown.
  • If not match, top-down optimize.

Original implmentation:

  • use global state Ancestor to record the whole tree limit information. It's easy to cause bug and complex
  • we must travese whole tree one time.

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added the optimizer Optimizer rules label Nov 18, 2022
Comment on lines +428 to 400
\n Limit: skip=10, fetch=1000\
\n TableScan: test, fetch=1010";
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As for limit-tablescan.
I'm not sure whether tablescan can precise limit, so I keep the limit.

If tablescan limit is completely accurate, we can remove this limit for limit-tablescan

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we currently don't expect for tablescan to return exactly at most the limit nr of rows AFAIK.
For this to be possible we need to put some more info into the table scan trait, just like we do with statistics.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/apache/arrow-datafusion/blob/8c02485cfc3d2c48bc48a557db58e3e5a0f75777/datafusion/core/src/datasource/datasource.rs#L66-L70

Implies that in fact we still need the Limit node even when it has been pushed to the scan as well

Copy link
Contributor

@Dandandan Dandandan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a good improvement

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't review the code but I reviewed all the plan changes in the tests and they looked good to me. Thank you @jackwener

FYI @ming535 who originally contributed this code in #2638

Comment on lines +428 to 400
\n Limit: skip=10, fetch=1000\
\n TableScan: test, fetch=1010";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/apache/arrow-datafusion/blob/8c02485cfc3d2c48bc48a557db58e3e5a0f75777/datafusion/core/src/datasource/datasource.rs#L66-L70

Implies that in fact we still need the Limit node even when it has been pushed to the scan as well

\n Limit: skip=0, fetch=10\
\n Limit: skip=10, fetch=10\
\n TableScan: test, fetch=20";
let expected = "Limit: skip=10, fetch=10\
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those plans look much nicer

@jackwener jackwener force-pushed the reimplement_limit_push_down branch 3 times, most recently from 303452a to ad78bf1 Compare November 19, 2022 17:06
let expected = "Limit: skip=1000, fetch=0\
\n TableScan: test, fetch=1000";

assert_optimized_plan_eq(&plan, expected)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Original test miss case to confirm merge limit, avoid overflow problem.
Add more UT to check them.

@jackwener jackwener changed the title Reimplement limit_push_down reimplement limit_push_down to remove global-state, enhance optimize and simplify code. Nov 22, 2022
@liukun4515
Copy link
Contributor

@jackwener Thanks

@liukun4515 liukun4515 merged commit 1bcb333 into apache:master Nov 22, 2022
@ursabot
Copy link

ursabot commented Nov 22, 2022

Benchmark runs are scheduled for baseline = afe2333 and contender = 1bcb333. 1bcb333 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

@jackwener jackwener deleted the reimplement_limit_push_down branch November 24, 2022 02:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
optimizer Optimizer rules
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Reimplement limit_push_down
5 participants