Add Scalar/Datum abstraction (#1047) #4393

tustvold · 2023-06-09T16:25:46Z

Which issue does this PR close?

Closes #1047
Relates to #3999
Relates to #2837
Relates to #2766

Rationale for this change

This is a proposal on how to represent scalars within arrow-rs kernels, with a view to providing a more consistent experience for users, and allowing us to consolidate a non-trivial amount of dispatch logic currently found in downstreams.

There are a couple of aspects worth highlighting

Arrays are Preeminent

Arrow is a specification for arrays and not scalars. As such this approach makes a conscious decision to not define a parallel specification for representing scalars. This has a number of advantages:

Can use the existing stories for serialization, FFI, display, casting, etc...
Avoids ambiguity over how to represent scalars with non-native encodings, e.g. dictionaries, run-end encoded, etc...
Allows sharing more logic, by not bifurcating the types
Supports all array types with no additional effort

It does come with some obvious downsides compared to a first-class scalar approach

Less efficient representation
Potentially more confusing / less ergonomic to use

However, it is my assertion that preserving the arrow-ness of the representation is more important than either of these

Non-Owning

Related to the above, data is only stored in arrays. Datum and Scalar solely act as wrappers to influence the dispatch within kernels, you cannot store a scalar value. Similarly the return type of a kernel will always be an array.

This reflects both the desire to keep arrays as the canonical representation, and also to discourage use of the scalar abstraction within data structures. We want to encourage the use of arrays, not constructions like Vec<Scalar> or HashMap<Scalar, _>. Permitting such usage not only creates confusion, but performing type-erasure per field in this way is grossly inefficient both from a memory and performance standpoint.

Type-Erasure / Dynamic Dispatch

Both Scalar and Datum are abstractions relying on type-erasure. This reflects a couple of goals:

Designing generics for many operations, like arithmetic, is extremely hard and represents a significant API commitment, type-erasure presents a much smaller API surface that is therefore easier to evolve without breaking changes
Avoiding generics helps to reduce code bloat from monomorphisation and hopefully improve compile times
Is consistent with the vast majority of kernels which use &dyn Array

The implied assumption is that the overheads of this dynamic dispatch will be irrelevant when amortised over the number of values in an array. Or to phrase it differently, we are explicitly not optimising for performance of purely scalar operations. IMO such operations are outside the remit of a vectorised execution engine.

What changes are included in this PR?

Adds Scalar and Datum along with examples of their usage

Are there any user-facing changes?

No, this just adds the new types

tustvold · 2023-06-09T16:26:06Z

arrow-array/src/array/boolean_array.rs

@@ -93,6 +93,17 @@ impl BooleanArray {
        Self { values, nulls }
    }

+    /// Create a new [`BooleanArray`] with length `len` consisting only of nulls
+    pub fn new_null(len: usize) -> Self {


This was a change to make the example work

Split out into - #4402

tustvold · 2023-06-09T16:26:28Z

arrow-select/src/filter.rs

@@ -321,16 +321,6 @@ fn filter_array(
        // actually filter
        _ => downcast_primitive_array! {
            values => Ok(Arc::new(filter_primitive(values, predicate))),
-            DataType::Decimal128(p, s) => {


Drive-by fix I noticed whilst implementing this, downcast_primitive_array already handles the decimal case.

might make sense to put up as a new PR

tustvold · 2023-06-09T16:27:20Z

arrow-array/src/scalar.rs

+///     b_scalar: bool,
+/// ) -> BooleanArray {
+///     let (array, scalar) = match (a_scalar, b_scalar) {
+///         (true, true) | (false, false) => {


Here we can see a nice property of the encoding as scalars, the case where they are both scalars is identical to the case where they are both arrays

tustvold · 2023-06-09T17:16:31Z

Mailing list thread - https://lists.apache.org/thread/f5q4q1g6hgqgrvbv67llv87hpfxg34vc

alamb

TLDR is I think this is really nice -- thank you @tustvold

BTW for anyone else reading this, I think one of the major reasons @tustvold is tackling this now is so that as we add scalar kernels for the various timestamp / duration / interval types we don't make the existing problem worse.

I focused on this example, which shows how someone would use the kernels with scalar values:

/// // Comparison of an array and a scalar
/// let a = Int32Array::from(vec![1, 2, 3, 4, 5]);
/// let b = Int32Array::from(vec![1]);
/// let r = eq(&a, &Scalar::new(&b)).unwrap();
/// let values: Vec<_> = r.values().iter().collect();
/// assert_eq!(values, &[true, false, false, false, false]);

I really like the construction of a Scalar that wraps b and signals to the kernels that the 1 element array should be treated differently, rather than just implicitly treating 1 element arrays differently

I also think the use of a trait for Datum is important for uses like DataFusion, which will likely need some sort of owned variant of Scalar (e.g. to represent the 1 in an expression such as column + 1). Given that Datum is a trait means we can create such an abstraction.

It might make sense to add an OwnedScalar in arrow-rs for convenience (OwnedScalar(ArrayRef)) but we can sort that out later

Questions:
Have we thought about how we will migrate the existing kernels? Given the construction above, we could perhaps leave the old signatures around for a while and deprecate them and call through to the new variants

alamb · 2023-06-10T10:37:26Z

arrow-select/src/filter.rs

@@ -321,16 +321,6 @@ fn filter_array(
        // actually filter
        _ => downcast_primitive_array! {
            values => Ok(Arc::new(filter_primitive(values, predicate))),
-            DataType::Decimal128(p, s) => {


might make sense to put up as a new PR

alamb · 2023-06-10T10:41:02Z

arrow-array/src/scalar.rs

+/// assert_eq!(values, &[true, false, false, false, false]);
+pub trait Datum {
+    /// Returns the value for this [`Datum`] and a boolean indicating if the value is scalar
+    fn get(&self) -> (&dyn Array, bool);


It might make sense to have two separate methods so that if we wanted to add more information to the Datum we could do it in a backwards compatible way by adding a new method to the trait

fn get(&self) -> &dyn Array; fn is_scalar(&self) -> bool { false }

I think given that neither piece of information can be used in isolation, it makes sense to return them both together. We can still always add further trait methods to expose additional information in future

tustvold · 2023-06-10T10:52:02Z

Have we thought about how we will migrate the existing kernels

My hope is that we can simply replace &dyn Array with &dyn Datum and have everything work without it requiring downstreams to modify their code, but I haven't tested this extensively yet

tustvold · 2023-06-14T11:05:22Z

Another thing to potentially consider at the same time is whether we want to roll in any notion of selection vector support (#4095). I will think on this more

alamb · 2023-06-14T13:12:43Z

At the very least the "Datum" notion allows for the inclusion of a Selection Vector at a later time -- perhaps like

pub trait Datum {
...
  /// If there are certain rows that should be ignored by any kernels
  /// Defaults to None (all rows)
  fn selection(&self) -> Option<&BoolBuffer> { None }
}

tustvold · 2023-06-29T13:29:46Z

#4465 contains a POC of using this abstraction to implement scalar kernels, I'm therefore confident that this is a sensible abstraction

viirya · 2023-06-29T22:29:48Z

arrow-array/src/scalar.rs

+    fn get(&self) -> (&dyn Array, bool) {
+        (self, false)
+    }
+}


Hmm, so an Array with length 1 will also return false indicating it is not a scalar? Only if we explicitly wrap it with a Scalar so we can get correct output?

Yes, it was suggested this might be less confusing for users

Hmm, then what's difference that to return false in Datum implementation for Array if the Array is length 1?

https://github.com/apache/arrow-rs/pull/4393/files#diff-7174f433d1c74c0e47944054399558364f11d0a813e787b854196467fc7d7f8bR114

?

That is Datum for Scalar, I mean Datum for dyn Array, then so an Array with length 1 will be treated as scalar without Scalar?

An array with length 1 won't be treated as scalar, it will only be treated as scalar if wrapped in Scalar?

Yea, that is what I wanted to ask, I think. Not all arrays with length 1 are all scalars, but only wrapped with Scalar they are scalars.

The view maybe be reversed. Not array is treated as scalar but scalar is treated as array. When any one wants to talk with this crate, this crate only understands language in array. So if you want to mention a scalar, you need to fit it into an array and let this crate know it behaves like a scalar.

There are some downsides, but I think you already mentioned in the description.

An array with length 1 won't be treated as scalar, it will only be treated as scalar if wrapped in Scalar

FWIW I think the only practical difference would be that add(arr1, arr2) will fail if arr1 has one row (but is not marked as a scalar) and arr2 had some other number of rows (like 100).

I think @tustvold also considered simply treating any arrays that had 1 row as a scalar but felt (as do I) that making it explicit would make for a less confusing experience . Or maybe that was only my opinion 😆

When any one wants to talk with this crate, this crate only understands language in array. So if you want to mention a scalar, you need to fit it into an array and let this crate know it behaves like a scalar.

I think this is an excellent description 👍

tustvold · 2023-06-30T12:08:29Z

I intend to merge this after I cut the arrow 43 release

viirya · 2023-06-30T16:54:27Z

arrow-array/src/scalar.rs

+
+impl<'a> Datum for Scalar<'a> {
+    fn get(&self) -> (&dyn Array, bool) {
+        (self.0, true)


Do we need to consider datatype of the wrapped Array? A complex type array with length 1 is also a scalar?

If wrapped in Scalar then yes, this is consistent with DF's ScalarValue which can contain struct or list elements

github-actions bot added the arrow Changes to the arrow crate label Jun 9, 2023

tustvold commented Jun 9, 2023

View reviewed changes

Add Scalar/Datum abstraction (apache#1047)

04cc386

tustvold force-pushed the scalar-experiments branch from be967a2 to 04cc386 Compare June 9, 2023 16:30

alamb reviewed Jun 10, 2023

View reviewed changes

tustvold added 2 commits June 12, 2023 11:54

Add dyn Array

7e4d82a

Merge remote-tracking branch 'upstream/master' into scalar-experiments

e29f5c5

This was referenced Jun 14, 2023

Evaluate Kernel under Selection / Short-Circuiting Filter Evaluation #3620

Open

Arrow compute kernel regards selection vector #4095

Closed

Merge remote-tracking branch 'upstream/master' into scalar-experiments

8d47965

tustvold mentioned this pull request Jun 29, 2023

Add Datum based arithmetic kernels (#3999) #4465

Merged

tustvold marked this pull request as ready for review June 29, 2023 13:29

alamb approved these changes Jun 29, 2023

View reviewed changes

viirya reviewed Jun 29, 2023

View reviewed changes

viirya reviewed Jun 30, 2023

View reviewed changes

viirya approved these changes Jun 30, 2023

View reviewed changes

tustvold mentioned this pull request Jul 3, 2023

Update Arrow 45.0.0 And Datum Arithmetic, change Decimal Division semantics apache/datafusion#6832

Merged

tustvold merged commit 9ee36b2 into apache:master Jul 4, 2023

alamb mentioned this pull request Jul 4, 2023

Improve median performance. apache/datafusion#6837

Merged

tustvold mentioned this pull request Jul 4, 2023

Use Specialization Instead of ScalarValue Binary Operations apache/datafusion#6842

Open

tustvold mentioned this pull request Aug 21, 2023

Make ScalarValue an ArrayRef Wrapper apache/datafusion#7353

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Scalar/Datum abstraction (#1047) #4393

Add Scalar/Datum abstraction (#1047) #4393

tustvold commented Jun 9, 2023 •

edited

Loading

tustvold Jun 9, 2023

tustvold Jun 12, 2023

tustvold Jun 9, 2023

alamb Jun 10, 2023

tustvold Jun 9, 2023

tustvold commented Jun 9, 2023

alamb left a comment •

edited

Loading

alamb Jun 10, 2023

alamb Jun 10, 2023

tustvold Jun 11, 2023

tustvold commented Jun 10, 2023

tustvold commented Jun 14, 2023

alamb commented Jun 14, 2023

tustvold commented Jun 29, 2023

viirya Jun 29, 2023

tustvold Jun 30, 2023

viirya Jun 30, 2023

tustvold Jun 30, 2023

viirya Jun 30, 2023

tustvold Jun 30, 2023

viirya Jun 30, 2023

viirya Jun 30, 2023

alamb Jun 30, 2023

tustvold commented Jun 30, 2023

viirya Jun 30, 2023

tustvold Jun 30, 2023

Add Scalar/Datum abstraction (#1047) #4393

Add Scalar/Datum abstraction (#1047) #4393

Conversation

tustvold commented Jun 9, 2023 • edited Loading

Which issue does this PR close?

Rationale for this change

Arrays are Preeminent

Non-Owning

Type-Erasure / Dynamic Dispatch

What changes are included in this PR?

Are there any user-facing changes?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tustvold commented Jun 9, 2023

alamb left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tustvold commented Jun 10, 2023

tustvold commented Jun 14, 2023

alamb commented Jun 14, 2023

tustvold commented Jun 29, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tustvold commented Jun 30, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tustvold commented Jun 9, 2023 •

edited

Loading

alamb left a comment •

edited

Loading