Implement the contained method of RowGroupPruningStatistics #8669

yahoNanJing · 2023-12-28T17:00:53Z

Which issue does this PR close?

Closes #8668.

Rationale for this change

The basic idea is to check whether all of the values are not within the min-max boundary.

If any one value is within the min-max boundary, then this row group will not be skipped.
Otherwise, this row group will be able to be skipped.

This implementation will be very useful for the case that high_cardinality_col in (v1, v2, ...) with bottom parquet files sorted with the high_cardinality_col.

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

yahoNanJing · 2023-12-28T17:01:38Z

Hi @alamb, could you help review this PR?

alamb · 2023-12-28T19:27:02Z

Thank you @yahoNanJing -- I have this on my review queue. Your use case of large IN lists is an excellent example (and one the current PruningPredicate rewrite won't handle well, as I think it only handles IN lists that are rewritten into OR chains

@NGA-TRAN and I have been thinking about how to make PruningPredicate handle cases where not all columns have statistics (e.g. #7869) which may be related

alamb · 2023-12-28T20:37:05Z

I believe CI will be fixed by merging up from main -- the clippy issue was fixed in #8662

alamb

Thank you @yahoNanJing -- this code is looking quite good to me. I didn't quite make it through the tests today and I ran out of time, but I will finish the review tomorrow

alamb · 2023-12-28T20:40:24Z

datafusion/core/src/datasource/physical_plan/parquet/row_groups.rs

    ) -> Option<BooleanArray> {
-        None
+        let min_values = self.min_values(column)?;


it is very unfortunate that we have to use ScalarValues here when the underlying code uses ArrayRefs (though I realize this PR is just following the same model as the existing code)

alamb · 2023-12-28T20:44:04Z

datafusion/core/src/datasource/physical_plan/parquet/row_groups.rs

+        let has_null = c.statistics()?.null_count() > 0;
+        let mut known_not_present = true;
+        for value in values {
+            // If it's null, check whether the null exists from the statistics


If value is null, I think it means that the statistics value is not known. To the best of my knowledge, NULL values on the column are never encoded in parquet statistics .

Thus I think this check needs to be something like

if value.is_null() { known_not_present = false; break;

Hi @alamb, this case is for filters like col is null. It's not related to the statistics. The values are from the filter literals.

Thank you for the clarification @yahoNanJing. I am still confused -- I don't think col IS NULL is handled by the LiteralGuarantee code so I am not sure how it would result in a value of NULL here.

col IN (NULL) (as opposed to col IS NULL) always evaluates to NULL (can never be true) which perhaps we should also handle 🤔

I think col in (NULL) will not match any thing, same as col = null, which means col in (a,b,c) same as col in (a,b,c, null). is there any rule to remove the null out of in list 🤔 @alamb

I think col in (NULL) will not match any thing, same as col = null, which means col in (a,b,c) same as col in (a,b,c, null). is there any rule to remove the null out of in list 🤔 @alamb

@Ted-Jiang -- I think we discussed this in #8688

alamb · 2023-12-28T20:46:03Z

datafusion/core/src/datasource/physical_plan/parquet/row_groups.rs

+            // The filter values should be cast to the boundary's data type
+            if !can_cast_types(&value.data_type(), &target_data_type) {
+                return None;
+            }
+            let value =
+                cast_scalar_value(value, &target_data_type, &DEFAULT_CAST_OPTIONS)
+                    .ok()?;


I think you could combine these checks:

Suggested change

// The filter values should be cast to the boundary's data type

if !can_cast_types(&value.data_type(), &target_data_type) {

return None;

}

let value =

cast_scalar_value(value, &target_data_type, &DEFAULT_CAST_OPTIONS)

.ok()?;

// The filter values should be cast to the boundary's data type

let Ok(value) = cast_scalar_value(value, &target_data_type, &DEFAULT_CAST_OPTIONS) else {

return None;

};

Good suggestion. I will refine it.

alamb · 2023-12-28T20:49:03Z

datafusion/core/src/datasource/physical_plan/parquet/row_groups.rs

        let schema = Arc::new(Schema::new(vec![
            Field::new("c1", DataType::Int32, false),
            Field::new("c2", DataType::Boolean, false),
        ]));
        let schema_descr = arrow_to_parquet_schema(&schema).unwrap();
+
+        // int > 1 and IsNull(bool) => c1_max > 1 and bool_null_count > 0


Suggested change

// int > 1 and IsNull(bool) => c1_max > 1 and bool_null_count > 0

// c1 > 15 and c2 IS NULL => c1_max > 15 and bool_null_count > 0

alamb

First of all, thank you again @yahoNanJing -- this is really important functionality.

After reviewing the tests carefully, before merging, I think this PR needs

We need to resolve the col IS NULL question (I may just still be confused)
Some additional tests to avoid regressions

In terms of additional tests I think the current tests would not fail if certain behaviors of the code were broken. Thus I suggest we add the following cases that have special handling int he code.

A test ensuring that multiple guarantees are correctly applied. For example, a predicate like col1 IN (10) AND col2 IN (20) and row groups such that col1 IN (10) is can be true but col2 IN (20) does not.
A test with a predicate on a column that has no statistics
A test where the statistics return the incorrect data type (e.g. that the cast has to be present).

alamb · 2023-12-29T11:46:24Z

datafusion/core/src/datasource/physical_plan/parquet/row_groups.rs

+        let has_null = c.statistics()?.null_count() > 0;
+        let mut known_not_present = true;
+        for value in values {
+            // If it's null, check whether the null exists from the statistics


Thank you for the clarification @yahoNanJing. I am still confused -- I don't think col IS NULL is handled by the LiteralGuarantee code so I am not sure how it would result in a value of NULL here.

col IN (NULL) (as opposed to col IS NULL) always evaluates to NULL (can never be true) which perhaps we should also handle 🤔

alamb · 2023-12-29T12:08:54Z

datafusion/core/src/datasource/physical_plan/parquet/row_groups.rs

@@ -598,19 +662,39 @@ mod tests {
            ),
            vec![1]
        );
+


I don't think this new test case covers the new code in this PR (as col IS NULL doesn't result in a literal guarantee). Is the idea to extend the test coverage?

If I remove the check if has_null && value.is_null() { known_not_present = false; break; }, the unit test of row_group_pruning_predicate_eq_null_expr would fail. Then I think the parameter values can be a set of one element which is a null scalar value.

I filed #8688 to track simplifying expressions that have null literals in them (e.g. X IN (NULL))

alamb · 2023-12-29T12:10:47Z

datafusion/core/src/datasource/physical_plan/parquet/row_groups.rs

@@ -632,6 +716,29 @@ mod tests {
            ),
            vec![1]
        );
+
+        // c1 < 5 and c2 IS NULL  => c1_min < 5 and bool_null_count > 0


Suggested change

// c1 < 5 and c2 IS NULL => c1_min < 5 and bool_null_count > 0

// c1 < 5 and c2 = NULL => c1_min < 5 and bool_null_count > 0

alamb · 2023-12-29T12:12:14Z

datafusion/core/src/datasource/physical_plan/parquet/row_groups.rs

        let schema = Arc::new(Schema::new(vec![
            Field::new("c1", DataType::Int32, false),
            Field::new("c2", DataType::Boolean, false),
        ]));
        let schema_descr = arrow_to_parquet_schema(&schema).unwrap();
+
+        // c1 > 15 and c2 IS NULL  => c1_max > 15 and bool_null_count > 0


Suggested change

// c1 > 15 and c2 IS NULL => c1_max > 15 and bool_null_count > 0

// c1 > 15 and c2 = NULL => c1_max > 15 and bool_null_count > 0

alamb · 2023-12-29T12:14:05Z

datafusion/core/src/datasource/physical_plan/parquet/row_groups.rs

+        let groups = gen_row_group_meta_data_for_pruning_predicate();
+
+        let metrics = parquet_file_metrics();
+        // bool = NULL always evaluates to NULL (and thus will not


I think this comment is incorrect of date -- both row groups are actually pruned as the vec is empty

alamb · 2023-12-29T12:16:23Z

datafusion/core/src/datasource/physical_plan/parquet/row_groups.rs

@@ -854,9 +956,9 @@ mod tests {
        let rgm2 = get_row_group_meta_data(
            &schema_descr,
            vec![ParquetStatistics::fixed_len_byte_array(
-                // 5.00
+                // 10.00


can you explain why you changed this value in the test?

Would it be possible to change this PR to not change existing tests so it is clear that the code change in this PR doesn't cause a regression in existing behavior? Maybe we can add a new test case with the different values?

It's main purpose is to prune the rgm2 and keep the rgm1 by the filter c1 in (8, 300, 400). If it's a concern, maybe I can introduce another independent new test case for it.

alamb · 2023-12-29T12:17:21Z

datafusion/core/src/datasource/physical_plan/parquet/row_groups.rs

                None,
                Some(&pruning_predicate),
                &metrics
            ),
            vec![1, 2]
        );
+        // c1 in (10, 300, 400)


Isn't the first value 0.8?

Suggested change

// c1 in (10, 300, 400)

// c1 in (0.8, 300, 400)

The same comment applies to several other comments below

May bad. It should be // c1 in (8, 300, 400)

alamb · 2023-12-29T12:19:37Z

datafusion/core/src/datasource/physical_plan/parquet/row_groups.rs

+                Some(&pruning_predicate),
+                &metrics
+            ),
+            vec![0, 2]


Suggested change

vec![0, 2]

// rgm2 (index 1) has ranges between 10 and 200. None of the

// constants are in that range so expect this is pruned by lliterals

vec![0, 2]

yahoNanJing · 2023-12-30T13:35:50Z

First of all, thank you again @yahoNanJing -- this is really important functionality.

After reviewing the tests carefully, before merging, I think this PR needs

We need to resolve the col IS NULL question (I may just still be confused)

Some additional tests to avoid regressions

In terms of additional tests I think the current tests would not fail if certain behaviors of the code were broken. Thus I suggest we add the following cases that have special handling int he code.

A test ensuring that multiple guarantees are correctly applied. For example, a predicate like col1 IN (10) AND col2 IN (20) and row groups such that col1 IN (10) is can be true but col2 IN (20) does not.

A test with a predicate on a column that has no statistics

A test where the statistics return the incorrect data type (e.g. that the cast has to be present).

Thanks @alamb for your review and suggestions. It's my bad. Maybe multiple rules are mixed together so that the RowGroupPruningStatistics's contained implementation does not affect the final result. I will introduce independent test cases for this implementation in a few days.

alamb · 2023-12-30T13:55:09Z

Thanks @alamb for your review and suggestions. It's my bad. Maybe multiple rules are mixed together so that the RowGroupPruningStatistics's contained implementation does not affect the final result. I will introduce independent test cases for this implementation in a few days.

Thanks @yahoNanJing -- I thought about this more. What would you think about adding the code that checks the min/max statistics against LiteralGuarantees for ALL predicates, not just the Parquet row statistics

Perhaps we could add the check after the call to contains here: https://github.com/apache/arrow-datafusion/blob/cc3042a6343457036770267f921bb3b6e726956c/datafusion/core/src/physical_optimizer/pruning.rs#L245

That would have the benefits of:

Working for all statistics (not just Parquet)
Might be easier to write tests using the existing framework in pruning.rs

alamb · 2023-12-31T12:35:42Z

Marking as draft so it is clear this PR is not waiting on feedback

yahoNanJing · 2024-01-09T06:44:34Z

The reason that the added unit test, https://github.com/apache/arrow-datafusion/blob/b37fc00d1d04114c61c9d2312cbf5044584df3d8/datafusion/core/src/datasource/physical_plan/parquet/row_groups.rs#L996-L1045, does not go through the RowGroupPruningStatistics's contained() is because there is a cast wrapped the related column for the inlist expr, which does not support to create the literal_guarantees due to https://github.com/apache/arrow-datafusion/blob/0e53c6d816f3a9d3d27c6ebb6d25b1699e5553e7/datafusion/physical-expr/src/utils/guarantee.rs#L132-L138

github-actions · 2024-04-14T01:59:51Z

Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days.

kyotoYaho added 2 commits December 29, 2023 00:56

Implement the contained method of RowGroupPruningStatistics

21e7242

Add unit test

ee5ca8d

github-actions bot added the core Core DataFusion crate label Dec 28, 2023

alamb mentioned this pull request Dec 28, 2023

DataFusion weekly project plan (Andrew Lamb) - Dec 25, 2023 #8655

Closed

7 tasks

alamb mentioned this pull request Dec 28, 2023

Support IN lists with more than three constants in predicates for bloom filters #8436

Closed

alamb reviewed Dec 28, 2023

View reviewed changes

kyotoYaho added 2 commits December 29, 2023 09:20

Refine for PR comments

9764500

Merge branch 'main' into issue-8668

b37fc00

alamb mentioned this pull request Dec 29, 2023

[pruning] Add shortcut when all units have been pruned #8675

Merged

alamb reviewed Dec 29, 2023

View reviewed changes

This was referenced Dec 30, 2023

Add NULL in list simplifications #8688

Closed

Config the length of list when using In_list on parquet, rather than a const of 20. #8609

Open

alamb marked this pull request as draft December 31, 2023 12:35

alamb mentioned this pull request Jan 1, 2024

DataFusion weekly project plan (Andrew Lamb) - Jan 1, 2024 #8704

Closed

9 tasks

github-actions bot added the Stale PR has not had any activity for some time label Apr 14, 2024

github-actions bot closed this Apr 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement the contained method of RowGroupPruningStatistics #8669

Implement the contained method of RowGroupPruningStatistics #8669

yahoNanJing commented Dec 28, 2023

yahoNanJing commented Dec 28, 2023

alamb commented Dec 28, 2023 •

edited

Loading

alamb commented Dec 28, 2023

alamb left a comment

alamb Dec 28, 2023

alamb Dec 28, 2023

yahoNanJing Dec 29, 2023

alamb Dec 29, 2023

Ted-Jiang Jan 11, 2024 •

edited

Loading

alamb Jan 11, 2024

alamb Dec 28, 2023

yahoNanJing Dec 29, 2023

alamb Dec 28, 2023

alamb left a comment

alamb Dec 29, 2023

alamb Dec 29, 2023

yahoNanJing Dec 30, 2023 •

edited

Loading

alamb Dec 30, 2023

alamb Dec 29, 2023

alamb Dec 29, 2023

alamb Dec 29, 2023

alamb Dec 29, 2023

yahoNanJing Dec 30, 2023 •

edited

Loading

alamb Dec 29, 2023

yahoNanJing Dec 30, 2023

alamb Dec 29, 2023

yahoNanJing commented Dec 30, 2023

alamb commented Dec 30, 2023

alamb commented Dec 31, 2023

yahoNanJing commented Jan 9, 2024 •

edited

Loading

github-actions bot commented Apr 14, 2024

	// int > 1 and IsNull(bool) => c1_max > 1 and bool_null_count > 0
	// c1 > 15 and c2 IS NULL => c1_max > 15 and bool_null_count > 0

@@ @@ -598,19 +662,39 @@ mod tests { @@
                           ),
                           vec![1]
                       );

	// c1 < 5 and c2 IS NULL => c1_min < 5 and bool_null_count > 0
	// c1 < 5 and c2 = NULL => c1_min < 5 and bool_null_count > 0

	// c1 > 15 and c2 IS NULL => c1_max > 15 and bool_null_count > 0
	// c1 > 15 and c2 = NULL => c1_max > 15 and bool_null_count > 0

Implement the contained method of RowGroupPruningStatistics #8669

Implement the contained method of RowGroupPruningStatistics #8669

Conversation

yahoNanJing commented Dec 28, 2023

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

yahoNanJing commented Dec 28, 2023

alamb commented Dec 28, 2023 • edited Loading

alamb commented Dec 28, 2023

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Ted-Jiang Jan 11, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yahoNanJing Dec 30, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yahoNanJing Dec 30, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yahoNanJing commented Dec 30, 2023

alamb commented Dec 30, 2023

alamb commented Dec 31, 2023

yahoNanJing commented Jan 9, 2024 • edited Loading

github-actions bot commented Apr 14, 2024

alamb commented Dec 28, 2023 •

edited

Loading

Ted-Jiang Jan 11, 2024 •

edited

Loading

yahoNanJing Dec 30, 2023 •

edited

Loading

yahoNanJing Dec 30, 2023 •

edited

Loading

yahoNanJing commented Jan 9, 2024 •

edited

Loading