Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch datafusion to using eq_dyn_scalar, etc kernels #1610

Closed
alamb opened this issue Jan 18, 2022 · 9 comments · Fixed by #1685
Closed

Switch datafusion to using eq_dyn_scalar, etc kernels #1610

alamb opened this issue Jan 18, 2022 · 9 comments · Fixed by #1685
Labels
datafusion Changes in the datafusion crate enhancement New feature or request

Comments

@alamb
Copy link
Contributor

alamb commented Jan 18, 2022

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
I would (primarily) like DataFusion to stop unpacking DictionaryArrays when comparing to scalar values. It would also be nice to reduce the amount of type dispatch in the datafusion BinaryExpr physical expression

@matthewmturner has implemented eq_dyn_scalar kernels (which among other things have support for DictionaryArrays apache/arrow-rs#1113)

Describe the solution you'd like
I would like the the big type dispatch in binary_primitive_array_op_scalar to be replaced by calls to eq_dyn_scalar, etc

https://github.com/apache/arrow-datafusion/blob/c549d5145755e3ec07800425a56f86541d476f1f/datafusion/src/physical_plan/expressions/binary.rs#L570-L590

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

@alamb alamb added the enhancement New feature or request label Jan 18, 2022
@matthewmturner
Copy link
Contributor

I'll give this a shot

@alamb
Copy link
Contributor Author

alamb commented Jan 18, 2022

@matthewmturner note you'll likely hit some types that datafusion covers that arrow does not (yet) such as Decimal -- I think leaving in a partial switch statement in DataFusion while we ported the other types to arrow might be a nice way to make progress here

@matthewmturner
Copy link
Contributor

@alamb ive been looking through the existing code and im a bit confused. specifically, when looking at the below, it seems the array + scalar ops are not using the scalar arrow kernels. is my understanding correct? if so, could you provide color on that and if we should be moving to use those?

https://github.com/apache/arrow-datafusion/blob/ad392fd529f9f8631b90271f7dbf3a4f2feadb7b/datafusion/src/physical_plan/expressions/binary.rs#L884-L900

@alamb
Copy link
Contributor Author

alamb commented Jan 19, 2022

binary_array_op_scalar! calls compute_op_scalar! https://github.com/apache/arrow-datafusion/blob/ad392fd529f9f8631b90271f7dbf3a4f2feadb7b/datafusion/src/physical_plan/expressions/binary.rs#L595-L601

compute_op_scalar! then calls the eq_scalar or other kernel in https://github.com/apache/arrow-datafusion/blob/ad392fd529f9f8631b90271f7dbf3a4f2feadb7b/datafusion/src/physical_plan/expressions/binary.rs#L475-L489

Specifically, this bit of macro magic:
https://github.com/apache/arrow-datafusion/blob/ad392fd529f9f8631b90271f7dbf3a4f2feadb7b/datafusion/src/physical_plan/expressions/binary.rs#L484
Slaps a _scalar on to the argument

So when invoked with $OP = lt, this

paste::expr! {[<$OP _scalar>]}

Becomes

lt_scalar

Does that help @matthewmturner ?

@matthewmturner
Copy link
Contributor

@alamb ah yes i did trace that all the way back but didnt realize that paste::expr! {[<$OP _scalar>]} was doing that. thank you for the explanation.

@matthewmturner
Copy link
Contributor

@alamb I'm wondering if we could just remove compute_op_scalar macro (and friends) and use the dyn kernels at the call site in evaluate_array_scalar. This would also get us very close to being able to remove the paste dependency which i believe is only used in this binary.rs file. The remaining place its used that i would still need to look into is the macro compute_utf8_flag_op_scalar.

@alamb
Copy link
Contributor Author

alamb commented Jan 19, 2022

@alamb I'm wondering if we could just remove compute_op_scalar macro (and friends) and use the dyn kernels at the call site in evaluate_array_scalar.

I think this is a good avenue to try -- my suspicion is you won't be able to remove the compute_op_scalar quite yet because it is used for kernels that don't have a scalar equivalent yet (e.g. plus, etc)

@matthewmturner
Copy link
Contributor

@alamb im not sure we can remove the type dispatch in binary_primitive_array_op_scalar since its used for other operations such as add / multiply / etc. binary_array_op_scalar is used for all the operators we added xx_dyn_scalar kernels for as it handles primitive, utf8, decimal, timestamps, dates, etc. It's my understanding that that is actually the macro that will need to be updated (or new one to take its place that uses dyn scalar kernels).

Im still playing with it but I was thinking of adding new functions like call_dyn_cmp_scalar, call_dyn_cmp_utf8_scalar, etc (drawing inspirating from @liukun4515 in #1475) and then type dispatching based on if the data type of the array / data type of the values in dictionary array. Roughly something like:

if primitive_or_dict_vals_primitive {
    call_dyn_cmp_scalar(array, scalar, op)
} else if string_or_dict_vals_string {
    call_dyn_cmp_utf8_scalar(array, scalar, op)
} else {
    // Existing implementation for other types
}

Does this make sense?

@alamb
Copy link
Contributor Author

alamb commented Jan 21, 2022

Does this make sense?

Yes I think that makes sense to me.

I think they key outcome of this PR should be that eq_dyn_scalar etc are called, rather than any particular macros are removed from DataFusion's code. Overtime we can look to remove / simplify macros as the kernels in arrow get more full features, but it doesn't have to happen now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datafusion Changes in the datafusion crate enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants