-
Notifications
You must be signed in to change notification settings - Fork 590
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
expr: avoid repeating the same scalar into an array #9052
Comments
Can you elaborate this? How is "static type" a problem and how dynamic is |
For example, arrays in |
How is the situation now after #9049? |
I guess the ultimate solution should be allowing |
FWIW this looks similar 👀 apache/arrow-rs#1047 |
Wonder if we can further generalize this into some compact encoding for multiple repeated datums. It could potentially optimize join performance, since the datums in the join key don't need to be expanded inline. |
Specifically for high amplification join, when building the new chunk, the probe side's record, just needs to convert its scalar values into constant array, then we can just concat that with the build side to form the new stream chunk. |
Yes. Are you referring to...
|
Just FYI: |
For example, there's an
EXTRACT(HOUR FROM col)
in Nexmark Q14, where theHOUR
is compiled to a literalVARCHAR
expression. When evaluating theEXTRACT
, we need to first repeat the same scalar"HOUR"
1024 times into an array, then evaluate the outerEXTRACT
function. This is not efficient. #8503 (comment)Possible solutions:
Introduce
ConstantArray
than only stores the scalar and the time it appears, which is essentially a special case of Run-Length Encoding (arrow-array). This seems hard to do this under current architecture as we always use static type for arrays, so introducing a wrapper requires a lot of changes.Check whether an argument of the expression is constant (literal) during the
build_from_proto
with macro (introduced in refactor(expr): generate build-from-prost with procedural macros #8499). In this case, we're not able to handle the structure where aliteral
is nested under another expression, though in most cases this should be folded by the optimizer.Allow the expression to directly return a scalar, and expands it into an array by repeating only if necessary. This sounds much simpler and the refactoring can be progressive.
The text was updated successfully, but these errors were encountered: