-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add parquet predicate pushdown metrics #3989
Add parquet predicate pushdown metrics #3989
Conversation
576d488
to
b7171b4
Compare
match self | ||
.physical_expr | ||
.evaluate(&batch) | ||
.map(|v| v.into_array(batch.num_rows())) | ||
{ | ||
Ok(array) => { | ||
if let Some(mask) = array.as_any().downcast_ref::<BooleanArray>() { | ||
Ok(BooleanArray::from(mask.data().clone())) | ||
let bool_arr = BooleanArray::from(mask.data().clone()); | ||
// TODO is there a more efficient way to count the rows that are filtered? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tustvold do you have any suggestions on how to count the number of true values in a boolean array faster/better than this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you don't care about nulls
bool_arr.values().count_set_bits_offset(self.offset(), self.len())
If you do care about nulls it is slightly more complicated, I'll get something into arrow-rs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could copy paste for now, it isn't hugely complicated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do care about nulls, sadly, -- it needs to be Non-null and true.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can file a ticket in arrow-rs if that would be helpful
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks again @tustvold -- I filed apache/arrow-rs#2963 and will copy/paste your implementation here for nwo
cc @Ted-Jiang I am thinking we can use a similar approach to validate / verify the PageIndex pruning you are working on |
Sounds great! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, I think if there any more than one predicates in one query, should we record the each predicate's input records count to calculate the efficiency 🤔 So could guide the user do the rearrangement.
…cate_pushdown_metrics
It is a great idea -- filed #3998 and I will work on that next |
I am going to merge this in shortly as I have several other PRs #3976 and one in IOx that depend on it, unless there are objections |
I am not sure when I will have time to add per-predicate metrics -- I'll see how my other projects go. |
Benchmark runs are scheduled for baseline = 71f05a3 and contender = afc299a. afc299a is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
* Log error building row filters Inspired by @liukun4515 at https://github.com/apache/arrow-datafusion/pull/3380/files#r970198755 * Add parquet predicate pushdown metrics * more efficient bit counting
* Log error building row filters Inspired by @liukun4515 at https://github.com/apache/arrow-datafusion/pull/3380/files#r970198755 * Add parquet predicate pushdown metrics * more efficient bit counting
Which issue does this PR close?
Part of #3463
Rationale for this change
I am trying to verify the correctness and efficiency of parquet predicate pushdown, so that we can turn it on in datafusion by default. Thus I want metrics telling me how many rows were pruned as well as how long it took.
What changes are included in this PR?
Are there any user-facing changes?
New metrics:
You can see them in
explain analyze
:You can see: