Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up inlist for strings and primitives #813

Merged
merged 4 commits into from
Aug 4, 2021

Conversation

Dandandan
Copy link
Contributor

@Dandandan Dandandan commented Aug 2, 2021

Which issue does this PR close?

Closes #814

Rationale for this change

This gives a ~10-15% speed up for TCP-H query 12 (~24ms vs 28ms in memory) which spends quite some time in the code to execute l_shipmode in ('MAIL', 'SHIP')

For the micro bench on primitives (f32 values):

filter_scalar in list   time:   [1.1045 ms 1.1285 ms 1.1607 ms]                                   
                        change: [-23.382% -21.418% -19.513%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  1 (1.00%) low mild
  4 (4.00%) high mild
  3 (3.00%) high severe

A large part comes from the use of the slower BooleanArray::from_iter and some more branching inside the loop.

This PR basically optimizes the easier cases when the list to compare to doesn't contain nulls.

What changes are included in this PR?

  • Reusing the compare_op_scalar from arrow.
  • Special casing on not-contains null and negated or not.

Are there any user-facing changes?

Some code now returns DataFusionError::NotImplemented error instead of runtime panic.

@github-actions github-actions bot added the datafusion Changes in the datafusion crate label Aug 2, 2021
@Dandandan
Copy link
Contributor Author

FYI @jhorstmann

@@ -99,6 +123,20 @@ macro_rules! make_contains {
}};
}

fn values_in_list_utf8<OffsetSize: StringOffsetSizeTrait>(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could move to Arrow sometime.

@alamb
Copy link
Contributor

alamb commented Aug 2, 2021

The test appear to be failing due to #818

@houqp houqp added the performance Make DataFusion faster label Aug 3, 2021
@Dandandan Dandandan changed the title Speed up inlist for strings Speed up inlist for strings and primitives Aug 3, 2021
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks very cool -- nice work @Dandandan

I wonder if there is sufficient coverage of all these cases (e.g now there are different code paths for lists with/without nulls and negated/not negated that is different for primitive cases

let values = $LIST_VALUES
.iter()
.flat_map(|expr| match expr {
ColumnarValue::Scalar(s) => match s {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given this code just seems to pass along the values, wonder if it would be clearer / faster here to look for nulls using the ScalarValue::is_null(), perhaps like

let contains_nulls = $LIST_VALUES.iter().any(|expr| expr.is_null());

And that way you could stop when you hit the first one as well as potentially avoid a collect into a new Vec

}};
}

fn values_in_list_primitive<T: ArrowPrimitiveType>(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest adding some comments here noting these functions assume that there are no null values in the column

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could for clarity. Note that the values slice can not contain nulls based on the type, but the array can have nulls (and should work for that case).

})
.collect::<BooleanArray>(),
)))
if negated {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same comment about finding NULL by checking for the first match applies to the code right above this as well

self.negated,
Int32,
Int32Array
)
}
DataType::Int64 => {
make_contains!(array, list_values, self.negated, Int64, Int64Array)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason not to apply the same transformation to Int64 and Int8?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, great catch.

unimplemented!("InList does not support datatype {:?}.", datatype)
}
datatype => Result::Err(DataFusionError::NotImplemented(format!(
"InList does not support datatype {:?}.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👨‍🍳 👌

@Dandandan Dandandan merged commit 119948f into apache:master Aug 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datafusion Changes in the datafusion crate performance Make DataFusion faster
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Optimize Inlist for strings and primitives
3 participants