Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Return Vec<bool> from PredicateBuilder rather than an Fn #370

Merged
merged 2 commits into from
May 24, 2021

Conversation

alamb
Copy link
Contributor

@alamb alamb commented May 20, 2021

Which issue does this PR close?

re #363

Rationale for this change

As explained on #363 the high level idea goal is to make the parquet row group pruning logic generic to any types of min/max statistics (not just parquet metadata)

What changes are included in this PR?

  1. Changes the output of PruningPredicateBuilder to be a bool for each of the input statistics
  2. Moves the parquet specific functionality (aka the function signature required for the ParquetFileReader) into the parquet.rs module
  3. Returns errors from build_pruning_predicate rather than silently ignoring them (though they are still silently ignored in parquet.rs as before)
  4. Improves some docstrings

Are there any user-facing changes?

No change in parquet functionality is intended in this PR

Sequence:

My next PR will change the input of the PruningPredicateBuilder to be generic

I am trying to do this in a few small PRs to reduce review burden; Here is how I plan that they will connect together:

Planned changes:

@alamb alamb added api change Changes the API exposed to users of the crate datafusion Changes in the datafusion crate labels May 20, 2021
pub fn build_pruning_predicate(
&self,
row_group_metadata: &[RowGroupMetaData],
) -> Box<dyn Fn(&RowGroupMetaData, usize) -> bool> {
statistics: &[RowGroupMetaData],
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main change in this PR is this function's signature. The rest of the changes are fallout from doing that

Ok(values) => values,
// stats filter array could not be built
// return a closure which will not filter out any row groups
_ => return Box::new(|_r, _i| true),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The existing code also (silently) ignores error so we continue that tradition in this PR

@alamb alamb force-pushed the alamb/generic_pruning_output branch from 9741d87 to c8bea51 Compare May 21, 2021 12:25
@alamb alamb marked this pull request as ready for review May 21, 2021 12:25
@alamb
Copy link
Contributor Author

alamb commented May 21, 2021

fyi @yordan-pavlov and @returnString

Copy link
Member

@jorgecarleitao jorgecarleitao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Makes sense, and I agree that it is a simpler approach. The number of record batches should be small anyways that is fine to allocate a Vec<bool> here.

Great work and thanks a lot, @alamb !

datafusion/src/physical_optimizer/pruning.rs Outdated Show resolved Hide resolved
Co-authored-by: Jorge Leitao <jorgecarleitao@gmail.com>
@codecov-commenter
Copy link

Codecov Report

Merging #370 (1db9363) into master (db4f098) will decrease coverage by 0.00%.
The diff coverage is 52.63%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #370      +/-   ##
==========================================
- Coverage   74.94%   74.93%   -0.01%     
==========================================
  Files         146      146              
  Lines       24314    24318       +4     
==========================================
+ Hits        18221    18223       +2     
- Misses       6093     6095       +2     
Impacted Files Coverage Δ
datafusion/src/physical_optimizer/pruning.rs 89.73% <38.46%> (-0.88%) ⬇️
datafusion/src/physical_plan/parquet.rs 81.76% <83.33%> (+0.49%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update db4f098...1db9363. Read the comment docs.

@yordan-pavlov
Copy link
Contributor

yordan-pavlov commented May 21, 2021

@alamb thank you for making this more generic so that it can be useful in more cases; I do like how the error case (where a Box::new(|_r, _i| true) is returned) is now handled in a single place); overall looks like a good change.

@alamb
Copy link
Contributor Author

alamb commented May 21, 2021

(BTW @yordan-pavlov the more I work with this code the cooler I think it (and its algorithm) is. Thank you for contributing it in the first place)

@alamb alamb merged commit 8b31714 into apache:master May 24, 2021
@alamb alamb deleted the alamb/generic_pruning_output branch May 24, 2021 12:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api change Changes the API exposed to users of the crate datafusion Changes in the datafusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants