Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update parquet page pruning code to use the StatisticsExtractor #11483

Merged
merged 7 commits into from
Jul 18, 2024

Conversation

alamb
Copy link
Contributor

@alamb alamb commented Jul 15, 2024

Which issue does this PR close?

Closes #11480

Rationale for this change

Let's use the nice API added in #10922 which isbetter tested, more performant, and handles more data types than the current code

What changes are included in this PR?

  1. Rewrite the page pruning code to use StatisticsExtractor
  2. Remove the previous data page extraction code
  3. Improve comments and debug logging

Are these changes tested?

Yes, by existing tests

Here are the integration tests for page pruning
https://github.com/apache/datafusion/blob/77352b2411b5d9340374c30e21b861b0d0d46f82/datafusion/core/tests/parquet/page_pruning.rs#L83-L82

Also, the code for statistics extraction is quite well tested

Are there any user-facing changes?

No (though some queries might go faster as they will be better able to take advantage of the page index)

@alamb alamb changed the title Update the parquet code prune_pages_in_one_row_group to use the StatisticsExtractor WIP: Update the parquet code prune_pages_in_one_row_group to use the StatisticsExtractor Jul 15, 2024
@github-actions github-actions bot added the core Core DataFusion crate label Jul 16, 2024
@alamb alamb force-pushed the alamb/prune_pages_statistics_converter branch from c9bd0af to 23f3efd Compare July 16, 2024 12:43
@alamb alamb force-pushed the alamb/prune_pages_statistics_converter branch from 23f3efd to 62bacdd Compare July 16, 2024 12:55
@@ -225,7 +223,7 @@ pub struct ParquetExec {
/// Optional predicate for pruning row groups (derived from `predicate`)
pruning_predicate: Option<Arc<PruningPredicate>>,
/// Optional predicate for pruning pages (derived from `predicate`)
page_pruning_predicate: Option<Arc<PagePruningPredicate>>,
page_pruning_predicate: Option<Arc<PagePruningAccessPlanFilter>>,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I renamed this to be consistent with what this is -- it isn't a pruning predicate per se

@@ -749,26 +740,6 @@ fn should_enable_page_index(
.unwrap_or(false)
}

// Convert parquet column schema to arrow data type, and just consider the
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is now handled entirely in the StatisticsConverter

@@ -1136,6 +1136,16 @@ pub struct StatisticsConverter<'a> {
}

impl<'a> StatisticsConverter<'a> {
/// Return the index of the column in the parquet file, if any
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are two new APIs I found I needed to add to the statistics converter API that is being ported upstream from @efredine in apache/arrow-rs#6046 (I'll do so later today)

.map(|(c, _s, _f)| c)
.collect::<HashSet<_>>()
.len()
/// Returns Some(column) if this is a single column predicate.
Copy link
Contributor Author

@alamb alamb Jul 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was an easier API to work with

update: and @Dandandan 's suggestion I think has made it faster (avoids the HashSet)

Some(Ok(p))
}
_ => None,
let pp =
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also added a more logging for the cases when predicates can't be used for pruning

let row_group_indexes = access_plan.row_group_indexes();
for r in row_group_indexes {
for row_group_index in row_group_indexes {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this now reads easier -- more of the index manipulation is captured in PagesPruningStatistics

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

debug!("Error evaluating page index predicate values {e}");
metrics.predicate_evaluation_errors.add(1);
return None;
}
};

// Convert the information of which pages to skip into a RowSelection
// that describes the ranges of rows to skip.
let Some(page_row_counts) = pruning_stats.page_row_counts() else {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I renamed row_vec to page_row_counts to make the logic clearer, and also added logging when it wasn't possible to construct

@@ -378,206 +354,143 @@ fn prune_pages_in_one_row_group(
Some(RowSelection::from(vec))
}

fn create_row_count_in_each_page(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved into a function on PagesPruningStatistics

}

// Extract the min or max value calling `func` from page idex
macro_rules! get_min_max_values_for_page_index {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code is replaced by StatisticsConverter which we have now tested quite thoroughly (kudos to @marvinlanhenke and others)

@alamb alamb changed the title WIP: Update the parquet code prune_pages_in_one_row_group to use the StatisticsExtractor Update the parquet page pruning code to use the StatisticsExtractor Jul 16, 2024
@alamb alamb marked this pull request as ready for review July 16, 2024 13:41
@alamb alamb requested a review from Ted-Jiang July 16, 2024 13:41
@alamb
Copy link
Contributor Author

alamb commented Jul 16, 2024

@liukun4515 or @Ted-Jiang I wonder if you have time to review this code?

@Ted-Jiang
Copy link
Member

@alamb thanks for ping me , i will carefully review this.

Copy link
Member

@Ted-Jiang Ted-Jiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍 thanks @alamb the code looks more elegant

let Some(page_row_counts) = pruning_stats.page_row_counts() else {
debug!(
"Can not determine page row counts for row group {row_group_index}, skipping"
);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think here need add metrics.predicate_evaluation_errors.add(1);
as above Returns None if there is an error evaluating the predicate

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this in 79ecd80

I think it is somewhat debatable if missing row counts is a predicate evaluation error, but signaling that something went wrong will certainly help debug issues.

@alamb
Copy link
Contributor Author

alamb commented Jul 17, 2024

Thank you very much for the review @Ted-Jiang

FYI @thinkharderdev as I think you use this feature as well. I don't expect this PR to have any effect but positive but wanted to give you a heads up

@alamb alamb changed the title Update the parquet page pruning code to use the StatisticsExtractor Update parquet page pruning code to use the StatisticsExtractor Jul 17, 2024
/// * `a > 5 OR b < 10` returns `None`
/// * `true` returns None
pub(crate) fn single_column(&self) -> Option<&phys_expr::Column> {
let cols = self.iter().map(|(c, _s, _f)| c).collect::<HashSet<_>>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems a bit wasteful to collect into a HashSet only to decide whether it's a single column?
We can do e.g. self.columns.windows(2).all(|[x, y]| x.0 == y.0)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an excellent call. I did it in 7886e29. Thank you for the suggestion

@alamb
Copy link
Contributor Author

alamb commented Jul 18, 2024

🚀

@alamb alamb merged commit b197449 into apache:main Jul 18, 2024
23 checks passed
@alamb alamb deleted the alamb/prune_pages_statistics_converter branch July 18, 2024 10:04
Lordworms pushed a commit to Lordworms/arrow-datafusion that referenced this pull request Jul 23, 2024
…ache#11483)

* Update the parquet code prune_pages_in_one_row_group to use the `StatisticsExtractor`

* fix doc

* Increase evaluation error counter if error determining data page row counts

* Optimize `single_column`
wiedld pushed a commit to influxdata/arrow-datafusion that referenced this pull request Jul 31, 2024
…ache#11483)

* Update the parquet code prune_pages_in_one_row_group to use the `StatisticsExtractor`

* fix doc

* Increase evaluation error counter if error determining data page row counts

* Optimize `single_column`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Update the parquet code prune_pages_in_one_row_group to use the StatisticsExtractor
3 participants