Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

expr: avoid repeating the same scalar into an array #9052

Open
BugenZhao opened this issue Apr 7, 2023 · 9 comments
Open

expr: avoid repeating the same scalar into an array #9052

BugenZhao opened this issue Apr 7, 2023 · 9 comments
Assignees
Labels
component/common Common components, such as array, data chunk, expression. needs-discussion type/perf

Comments

@BugenZhao
Copy link
Member

For example, there's an EXTRACT(HOUR FROM col) in Nexmark Q14, where the HOUR is compiled to a literal VARCHAR expression. When evaluating the EXTRACT, we need to first repeat the same scalar "HOUR" 1024 times into an array, then evaluate the outer EXTRACT function. This is not efficient. #8503 (comment)

Possible solutions:

  • Introduce ConstantArray than only stores the scalar and the time it appears, which is essentially a special case of Run-Length Encoding (arrow-array). This seems hard to do this under current architecture as we always use static type for arrays, so introducing a wrapper requires a lot of changes.

  • Check whether an argument of the expression is constant (literal) during the build_from_proto with macro (introduced in refactor(expr): generate build-from-prost with procedural macros #8499). In this case, we're not able to handle the structure where a literal is nested under another expression, though in most cases this should be folded by the optimizer.

  • Allow the expression to directly return a scalar, and expands it into an array by repeating only if necessary. This sounds much simpler and the refactoring can be progressive.

@BugenZhao BugenZhao added component/common Common components, such as array, data chunk, expression. type/perf needs-discussion labels Apr 7, 2023
@github-actions github-actions bot added this to the release-0.19 milestone Apr 7, 2023
@xxchan
Copy link
Member

xxchan commented Apr 7, 2023

This seems hard to do this under current architecture as we always use static type for arrays, so introducing a wrapper requires a lot of changes.

Can you elaborate this? How is "static type" a problem and how dynamic is ConstantArray/RunArrary?

@BugenZhao
Copy link
Member Author

BugenZhao commented Jun 12, 2023

This seems hard to do this under current architecture as we always use static type for arrays, so introducing a wrapper requires a lot of changes.

Can you elaborate this? How is "static type" a problem and how dynamic is ConstantArray/RunArrary?

For example, arrays in arrow and arrow2 are all trait objects, so it can introduce a RunArray wrapper easily without exposing it to any callers. However in our type system, we need to write a lot of stuff like MaybeRun<Utf8Array> or MaybeRun<ArrayImpl>. 🤔

@BugenZhao BugenZhao removed this from the release-0.20 milestone Jun 12, 2023
@xxchan
Copy link
Member

xxchan commented Jun 20, 2023

How is the situation now after #9049?

@BugenZhao
Copy link
Member Author

BugenZhao commented Jun 20, 2023

How is the situation now after #9049?

I guess the ultimate solution should be allowing Value::Scalar to directly be passed among different executors and even remote actors, as described in #9733 (comment). But yes, It appears that #9049 has accomplished everything we can do without introducing a significant refactor. 😄

@xxchan
Copy link
Member

xxchan commented Jul 3, 2023

FWIW this looks similar 👀 apache/arrow-rs#1047

@kwannoel
Copy link
Contributor

kwannoel commented May 13, 2024

Wonder if we can further generalize this into some compact encoding for multiple repeated datums. It could potentially optimize join performance, since the datums in the join key don't need to be expanded inline.

@kwannoel
Copy link
Contributor

kwannoel commented May 20, 2024

Specifically for high amplification join, when building the new chunk, the probe side's record, just needs to convert its scalar values into constant array, then we can just concat that with the build side to form the new stream chunk.

@BugenZhao
Copy link
Member Author

Wonder if we can further generalize this into some compact encoding for multiple repeated datums.

Yes. Are you referring to...

@BugenZhao
Copy link
Member Author

Just FYI: eval_v2, introduced in #9049, is not adopted by all proc-macro-generated function impl any more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/common Common components, such as array, data chunk, expression. needs-discussion type/perf
Projects
None yet
Development

No branches or pull requests

4 participants