-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update parquet page pruning code to use the StatisticsExtractor
#11483
Update parquet page pruning code to use the StatisticsExtractor
#11483
Conversation
StatisticsExtractor
StatisticsExtractor
c9bd0af
to
23f3efd
Compare
23f3efd
to
62bacdd
Compare
@@ -225,7 +223,7 @@ pub struct ParquetExec { | |||
/// Optional predicate for pruning row groups (derived from `predicate`) | |||
pruning_predicate: Option<Arc<PruningPredicate>>, | |||
/// Optional predicate for pruning pages (derived from `predicate`) | |||
page_pruning_predicate: Option<Arc<PagePruningPredicate>>, | |||
page_pruning_predicate: Option<Arc<PagePruningAccessPlanFilter>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I renamed this to be consistent with what this is -- it isn't a pruning predicate per se
@@ -749,26 +740,6 @@ fn should_enable_page_index( | |||
.unwrap_or(false) | |||
} | |||
|
|||
// Convert parquet column schema to arrow data type, and just consider the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is now handled entirely in the StatisticsConverter
@@ -1136,6 +1136,16 @@ pub struct StatisticsConverter<'a> { | |||
} | |||
|
|||
impl<'a> StatisticsConverter<'a> { | |||
/// Return the index of the column in the parquet file, if any |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are two new APIs I found I needed to add to the statistics converter API that is being ported upstream from @efredine in apache/arrow-rs#6046 (I'll do so later today)
.map(|(c, _s, _f)| c) | ||
.collect::<HashSet<_>>() | ||
.len() | ||
/// Returns Some(column) if this is a single column predicate. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this was an easier API to work with
update: and @Dandandan 's suggestion I think has made it faster (avoids the HashSet)
Some(Ok(p)) | ||
} | ||
_ => None, | ||
let pp = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also added a more logging for the cases when predicates can't be used for pruning
let row_group_indexes = access_plan.row_group_indexes(); | ||
for r in row_group_indexes { | ||
for row_group_index in row_group_indexes { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this now reads easier -- more of the index manipulation is captured in PagesPruningStatistics
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
debug!("Error evaluating page index predicate values {e}"); | ||
metrics.predicate_evaluation_errors.add(1); | ||
return None; | ||
} | ||
}; | ||
|
||
// Convert the information of which pages to skip into a RowSelection | ||
// that describes the ranges of rows to skip. | ||
let Some(page_row_counts) = pruning_stats.page_row_counts() else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I renamed row_vec
to page_row_counts
to make the logic clearer, and also added logging when it wasn't possible to construct
@@ -378,206 +354,143 @@ fn prune_pages_in_one_row_group( | |||
Some(RowSelection::from(vec)) | |||
} | |||
|
|||
fn create_row_count_in_each_page( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved into a function on PagesPruningStatistics
} | ||
|
||
// Extract the min or max value calling `func` from page idex | ||
macro_rules! get_min_max_values_for_page_index { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code is replaced by StatisticsConverter
which we have now tested quite thoroughly (kudos to @marvinlanhenke and others)
StatisticsExtractor
StatisticsExtractor
@liukun4515 or @Ted-Jiang I wonder if you have time to review this code? |
…tistics_converter
@alamb thanks for ping me , i will carefully review this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 👍 thanks @alamb the code looks more elegant
let Some(page_row_counts) = pruning_stats.page_row_counts() else { | ||
debug!( | ||
"Can not determine page row counts for row group {row_group_index}, skipping" | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think here need add metrics.predicate_evaluation_errors.add(1);
as above Returns None if there is an error evaluating the predicate
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added this in 79ecd80
I think it is somewhat debatable if missing row counts is a predicate evaluation error, but signaling that something went wrong will certainly help debug issues.
Thank you very much for the review @Ted-Jiang FYI @thinkharderdev as I think you use this feature as well. I don't expect this PR to have any effect but positive but wanted to give you a heads up |
StatisticsExtractor
StatisticsExtractor
/// * `a > 5 OR b < 10` returns `None` | ||
/// * `true` returns None | ||
pub(crate) fn single_column(&self) -> Option<&phys_expr::Column> { | ||
let cols = self.iter().map(|(c, _s, _f)| c).collect::<HashSet<_>>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems a bit wasteful to collect into a HashSet
only to decide whether it's a single column?
We can do e.g. self.columns.windows(2).all(|[x, y]| x.0 == y.0)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an excellent call. I did it in 7886e29. Thank you for the suggestion
🚀 |
…ache#11483) * Update the parquet code prune_pages_in_one_row_group to use the `StatisticsExtractor` * fix doc * Increase evaluation error counter if error determining data page row counts * Optimize `single_column`
…ache#11483) * Update the parquet code prune_pages_in_one_row_group to use the `StatisticsExtractor` * fix doc * Increase evaluation error counter if error determining data page row counts * Optimize `single_column`
Which issue does this PR close?
Closes #11480
Rationale for this change
Let's use the nice API added in #10922 which isbetter tested, more performant, and handles more data types than the current code
What changes are included in this PR?
StatisticsExtractor
Are these changes tested?
Yes, by existing tests
Here are the integration tests for page pruning
https://github.com/apache/datafusion/blob/77352b2411b5d9340374c30e21b861b0d0d46f82/datafusion/core/tests/parquet/page_pruning.rs#L83-L82
Also, the code for statistics extraction is quite well tested
Are there any user-facing changes?
No (though some queries might go faster as they will be better able to take advantage of the page index)