expr: avoid repeating the same scalar into an array #9052

BugenZhao · 2023-04-07T12:10:23Z

For example, there's an EXTRACT(HOUR FROM col) in Nexmark Q14, where the HOUR is compiled to a literal VARCHAR expression. When evaluating the EXTRACT, we need to first repeat the same scalar "HOUR" 1024 times into an array, then evaluate the outer EXTRACT function. This is not efficient. #8503 (comment)

Possible solutions:

Introduce ConstantArray than only stores the scalar and the time it appears, which is essentially a special case of Run-Length Encoding (arrow-array). This seems hard to do this under current architecture as we always use static type for arrays, so introducing a wrapper requires a lot of changes.
Check whether an argument of the expression is constant (literal) during the build_from_proto with macro (introduced in refactor(expr): generate build-from-prost with procedural macros #8499). In this case, we're not able to handle the structure where a literal is nested under another expression, though in most cases this should be folded by the optimizer.
Allow the expression to directly return a scalar, and expands it into an array by repeating only if necessary. This sounds much simpler and the refactoring can be progressive.

The text was updated successfully, but these errors were encountered:

xxchan · 2023-04-07T13:12:02Z

This seems hard to do this under current architecture as we always use static type for arrays, so introducing a wrapper requires a lot of changes.

Can you elaborate this? How is "static type" a problem and how dynamic is ConstantArray/RunArrary?

BugenZhao · 2023-06-12T06:44:49Z

This seems hard to do this under current architecture as we always use static type for arrays, so introducing a wrapper requires a lot of changes.

Can you elaborate this? How is "static type" a problem and how dynamic is ConstantArray/RunArrary?

For example, arrays in arrow and arrow2 are all trait objects, so it can introduce a RunArray wrapper easily without exposing it to any callers. However in our type system, we need to write a lot of stuff like MaybeRun<Utf8Array> or MaybeRun<ArrayImpl>. 🤔

xxchan · 2023-06-20T07:11:42Z

How is the situation now after #9049?

BugenZhao · 2023-06-20T09:18:10Z

How is the situation now after #9049?

I guess the ultimate solution should be allowing Value::Scalar to directly be passed among different executors and even remote actors, as described in #9733 (comment). But yes, It appears that #9049 has accomplished everything we can do without introducing a significant refactor. 😄

xxchan · 2023-07-03T12:50:19Z

FWIW this looks similar 👀 apache/arrow-rs#1047

kwannoel · 2024-05-13T08:32:47Z

Wonder if we can further generalize this into some compact encoding for multiple repeated datums. It could potentially optimize join performance, since the datums in the join key don't need to be expanded inline.

kwannoel · 2024-05-20T03:35:56Z

Specifically for high amplification join, when building the new chunk, the probe side's record, just needs to convert its scalar values into constant array, then we can just concat that with the build side to form the new stream chunk.

BugenZhao · 2024-05-20T08:52:56Z

Wonder if we can further generalize this into some compact encoding for multiple repeated datums.

Yes. Are you referring to...

Run-Length Encoding (arrow-array)

BugenZhao · 2024-08-28T06:17:36Z

Just FYI: eval_v2, introduced in #9049, is not adopted by all proc-macro-generated function impl any more.

BugenZhao added component/common Common components, such as array, data chunk, expression. type/perf needs-discussion labels Apr 7, 2023

github-actions bot added this to the release-0.19 milestone Apr 7, 2023

BugenZhao mentioned this issue Apr 7, 2023

perf(expr): new interface for expression directly returning scalar #9049

Merged

7 tasks

lmatz mentioned this issue May 12, 2023

perf: nexmark q14 #8503

Closed

TennyZhuang assigned BugenZhao May 19, 2023

TennyZhuang modified the milestones: release-0.19, release-0.20 May 19, 2023

BugenZhao removed this from the release-0.20 milestone Jun 12, 2023

kwannoel mentioned this issue Aug 28, 2024

Optimize case of high join amplification #16679

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

expr: avoid repeating the same scalar into an array #9052

expr: avoid repeating the same scalar into an array #9052

BugenZhao commented Apr 7, 2023

xxchan commented Apr 7, 2023

BugenZhao commented Jun 12, 2023 •

edited

Loading

xxchan commented Jun 20, 2023

BugenZhao commented Jun 20, 2023 •

edited

Loading

xxchan commented Jul 3, 2023

kwannoel commented May 13, 2024 •

edited

Loading

kwannoel commented May 20, 2024 •

edited

Loading

BugenZhao commented May 20, 2024

BugenZhao commented Aug 28, 2024

expr: avoid repeating the same scalar into an array #9052

expr: avoid repeating the same scalar into an array #9052

Comments

BugenZhao commented Apr 7, 2023

xxchan commented Apr 7, 2023

BugenZhao commented Jun 12, 2023 • edited Loading

xxchan commented Jun 20, 2023

BugenZhao commented Jun 20, 2023 • edited Loading

xxchan commented Jul 3, 2023

kwannoel commented May 13, 2024 • edited Loading

kwannoel commented May 20, 2024 • edited Loading

BugenZhao commented May 20, 2024

BugenZhao commented Aug 28, 2024

BugenZhao commented Jun 12, 2023 •

edited

Loading

BugenZhao commented Jun 20, 2023 •

edited

Loading

kwannoel commented May 13, 2024 •

edited

Loading

kwannoel commented May 20, 2024 •

edited

Loading