Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement the contained method of RowGroupPruningStatistics #8669

Closed
wants to merge 4 commits into from

Conversation

yahoNanJing
Copy link
Contributor

Which issue does this PR close?

Closes #8668.

Rationale for this change

The basic idea is to check whether all of the values are not within the min-max boundary.

  • If any one value is within the min-max boundary, then this row group will not be skipped.
  • Otherwise, this row group will be able to be skipped.

This implementation will be very useful for the case that high_cardinality_col in (v1, v2, ...) with bottom parquet files sorted with the high_cardinality_col.

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added the core Core DataFusion crate label Dec 28, 2023
@yahoNanJing
Copy link
Contributor Author

Hi @alamb, could you help review this PR?

@alamb
Copy link
Contributor

alamb commented Dec 28, 2023

Thank you @yahoNanJing -- I have this on my review queue. Your use case of large IN lists is an excellent example (and one the current PruningPredicate rewrite won't handle well, as I think it only handles IN lists that are rewritten into OR chains

@NGA-TRAN and I have been thinking about how to make PruningPredicate handle cases where not all columns have statistics (e.g. #7869) which may be related

@alamb
Copy link
Contributor

alamb commented Dec 28, 2023

I believe CI will be fixed by merging up from main -- the clippy issue was fixed in #8662

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @yahoNanJing -- this code is looking quite good to me. I didn't quite make it through the tests today and I ran out of time, but I will finish the review tomorrow

) -> Option<BooleanArray> {
None
let min_values = self.min_values(column)?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is very unfortunate that we have to use ScalarValues here when the underlying code uses ArrayRefs (though I realize this PR is just following the same model as the existing code)

let has_null = c.statistics()?.null_count() > 0;
let mut known_not_present = true;
for value in values {
// If it's null, check whether the null exists from the statistics
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If value is null, I think it means that the statistics value is not known. To the best of my knowledge, NULL values on the column are never encoded in parquet statistics .

Thus I think this check needs to be something like

if value.is_null() { 
  known_not_present = false;
  break;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @alamb, this case is for filters like col is null. It's not related to the statistics. The values are from the filter literals.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the clarification @yahoNanJing. I am still confused -- I don't think col IS NULL is handled by the LiteralGuarantee code so I am not sure how it would result in a value of NULL here.

col IN (NULL) (as opposed to col IS NULL) always evaluates to NULL (can never be true) which perhaps we should also handle 🤔

Copy link
Member

@Ted-Jiang Ted-Jiang Jan 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think col in (NULL) will not match any thing, same as col = null, which means col in (a,b,c) same as col in (a,b,c, null). is there any rule to remove the null out of in list 🤔 @alamb

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think col in (NULL) will not match any thing, same as col = null, which means col in (a,b,c) same as col in (a,b,c, null). is there any rule to remove the null out of in list 🤔 @alamb

@Ted-Jiang -- I think we discussed this in #8688

Comment on lines 313 to 319
// The filter values should be cast to the boundary's data type
if !can_cast_types(&value.data_type(), &target_data_type) {
return None;
}
let value =
cast_scalar_value(value, &target_data_type, &DEFAULT_CAST_OPTIONS)
.ok()?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you could combine these checks:

Suggested change
// The filter values should be cast to the boundary's data type
if !can_cast_types(&value.data_type(), &target_data_type) {
return None;
}
let value =
cast_scalar_value(value, &target_data_type, &DEFAULT_CAST_OPTIONS)
.ok()?;
// The filter values should be cast to the boundary's data type
let Ok(value) = cast_scalar_value(value, &target_data_type, &DEFAULT_CAST_OPTIONS) else {
return None;
};

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion. I will refine it.

let schema = Arc::new(Schema::new(vec![
Field::new("c1", DataType::Int32, false),
Field::new("c2", DataType::Boolean, false),
]));
let schema_descr = arrow_to_parquet_schema(&schema).unwrap();

// int > 1 and IsNull(bool) => c1_max > 1 and bool_null_count > 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// int > 1 and IsNull(bool) => c1_max > 1 and bool_null_count > 0
// c1 > 15 and c2 IS NULL => c1_max > 15 and bool_null_count > 0

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First of all, thank you again @yahoNanJing -- this is really important functionality.

After reviewing the tests carefully, before merging, I think this PR needs

  1. We need to resolve the col IS NULL question (I may just still be confused)
  2. Some additional tests to avoid regressions

In terms of additional tests I think the current tests would not fail if certain behaviors of the code were broken. Thus I suggest we add the following cases that have special handling int he code.

  1. A test ensuring that multiple guarantees are correctly applied. For example, a predicate like col1 IN (10) AND col2 IN (20) and row groups such that col1 IN (10) is can be true but col2 IN (20) does not.
  2. A test with a predicate on a column that has no statistics
  3. A test where the statistics return the incorrect data type (e.g. that the cast has to be present).

let has_null = c.statistics()?.null_count() > 0;
let mut known_not_present = true;
for value in values {
// If it's null, check whether the null exists from the statistics
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the clarification @yahoNanJing. I am still confused -- I don't think col IS NULL is handled by the LiteralGuarantee code so I am not sure how it would result in a value of NULL here.

col IN (NULL) (as opposed to col IS NULL) always evaluates to NULL (can never be true) which perhaps we should also handle 🤔

@@ -598,19 +662,39 @@ mod tests {
),
vec![1]
);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this new test case covers the new code in this PR (as col IS NULL doesn't result in a literal guarantee). Is the idea to extend the test coverage?

Copy link
Contributor Author

@yahoNanJing yahoNanJing Dec 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I remove the check if has_null && value.is_null() { known_not_present = false; break; }, the unit test of row_group_pruning_predicate_eq_null_expr would fail. Then I think the parameter values can be a set of one element which is a null scalar value.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I filed #8688 to track simplifying expressions that have null literals in them (e.g. X IN (NULL))

@@ -632,6 +716,29 @@ mod tests {
),
vec![1]
);

// c1 < 5 and c2 IS NULL => c1_min < 5 and bool_null_count > 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// c1 < 5 and c2 IS NULL => c1_min < 5 and bool_null_count > 0
// c1 < 5 and c2 = NULL => c1_min < 5 and bool_null_count > 0

let schema = Arc::new(Schema::new(vec![
Field::new("c1", DataType::Int32, false),
Field::new("c2", DataType::Boolean, false),
]));
let schema_descr = arrow_to_parquet_schema(&schema).unwrap();

// c1 > 15 and c2 IS NULL => c1_max > 15 and bool_null_count > 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// c1 > 15 and c2 IS NULL => c1_max > 15 and bool_null_count > 0
// c1 > 15 and c2 = NULL => c1_max > 15 and bool_null_count > 0

let groups = gen_row_group_meta_data_for_pruning_predicate();

let metrics = parquet_file_metrics();
// bool = NULL always evaluates to NULL (and thus will not
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this comment is incorrect of date -- both row groups are actually pruned as the vec is empty

@@ -854,9 +956,9 @@ mod tests {
let rgm2 = get_row_group_meta_data(
&schema_descr,
vec![ParquetStatistics::fixed_len_byte_array(
// 5.00
// 10.00
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you explain why you changed this value in the test?

Would it be possible to change this PR  to not change existing tests so it is clear that the code change in this PR doesn't cause a regression in existing behavior? Maybe we can add a new test case with the different values?

Copy link
Contributor Author

@yahoNanJing yahoNanJing Dec 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's main purpose is to prune the rgm2 and keep the rgm1 by the filter c1 in (8, 300, 400). If it's a concern, maybe I can introduce another independent new test case for it.

None,
Some(&pruning_predicate),
&metrics
),
vec![1, 2]
);
// c1 in (10, 300, 400)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't the first value 0.8?

Suggested change
// c1 in (10, 300, 400)
// c1 in (0.8, 300, 400)

The same comment applies to several other comments below

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May bad. It should be // c1 in (8, 300, 400)

Some(&pruning_predicate),
&metrics
),
vec![0, 2]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
vec![0, 2]
// rgm2 (index 1) has ranges between 10 and 200. None of the
// constants are in that range so expect this is pruned by lliterals
vec![0, 2]

@yahoNanJing
Copy link
Contributor Author

First of all, thank you again @yahoNanJing -- this is really important functionality.

After reviewing the tests carefully, before merging, I think this PR needs

  1. We need to resolve the col IS NULL question (I may just still be confused)
  2. Some additional tests to avoid regressions

In terms of additional tests I think the current tests would not fail if certain behaviors of the code were broken. Thus I suggest we add the following cases that have special handling int he code.

  1. A test ensuring that multiple guarantees are correctly applied. For example, a predicate like col1 IN (10) AND col2 IN (20) and row groups such that col1 IN (10) is can be true but col2 IN (20) does not.
  2. A test with a predicate on a column that has no statistics
  3. A test where the statistics return the incorrect data type (e.g. that the cast has to be present).

Thanks @alamb for your review and suggestions. It's my bad. Maybe multiple rules are mixed together so that the RowGroupPruningStatistics's contained implementation does not affect the final result. I will introduce independent test cases for this implementation in a few days.

@alamb
Copy link
Contributor

alamb commented Dec 30, 2023

Thanks @alamb for your review and suggestions. It's my bad. Maybe multiple rules are mixed together so that the RowGroupPruningStatistics's contained implementation does not affect the final result. I will introduce independent test cases for this implementation in a few days.

Thanks @yahoNanJing -- I thought about this more. What would you think about adding the code that checks the min/max statistics against LiteralGuarantees for ALL predicates, not just the Parquet row statistics

Perhaps we could add the check after the call to contains here: https://github.com/apache/arrow-datafusion/blob/cc3042a6343457036770267f921bb3b6e726956c/datafusion/core/src/physical_optimizer/pruning.rs#L245

That would have the benefits of:

  1. Working for all statistics (not just Parquet)
  2. Might be easier to write tests using the existing framework in pruning.rs

@alamb
Copy link
Contributor

alamb commented Dec 31, 2023

Marking as draft so it is clear this PR is not waiting on feedback

@yahoNanJing
Copy link
Contributor Author

yahoNanJing commented Jan 9, 2024

The reason that the added unit test, https://github.com/apache/arrow-datafusion/blob/b37fc00d1d04114c61c9d2312cbf5044584df3d8/datafusion/core/src/datasource/physical_plan/parquet/row_groups.rs#L996-L1045, does not go through the RowGroupPruningStatistics's contained() is because there is a cast wrapped the related column for the inlist expr, which does not support to create the literal_guarantees due to https://github.com/apache/arrow-datafusion/blob/0e53c6d816f3a9d3d27c6ebb6d25b1699e5553e7/datafusion/physical-expr/src/utils/guarantee.rs#L132-L138

Copy link

Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days.

@github-actions github-actions bot added the Stale PR has not had any activity for some time label Apr 14, 2024
@github-actions github-actions bot closed this Apr 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate Stale PR has not had any activity for some time
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement the contained method of RowGroupPruningStatistics introduce by #8440
4 participants