Add parquet predicate pushdown metrics #3989

alamb · 2022-10-27T19:35:47Z

Which issue does this PR close?

Part of #3463

Rationale for this change

I am trying to verify the correctness and efficiency of parquet predicate pushdown, so that we can turn it on in datafusion by default. Thus I want metrics telling me how many rows were pruned as well as how long it took.

What changes are included in this PR?

Adds metric for how many rows were pruned using predicate pushdown
Add metric for how long the pruning took
Add tests

Are there any user-facing changes?

New metrics:

You can see them in explain analyze:

+-------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Plan with Metrics | ProjectionExec: expr=[COUNT(UInt8(1))@0 as COUNT(UInt8(1))], metrics=[output_rows=1, elapsed_compute=291ns, spill_count=0, spilled_bytes=0, mem_used=0]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |
|                   |   AggregateExec: mode=Final, gby=[], aggr=[COUNT(UInt8(1))], metrics=[output_rows=1, elapsed_compute=7.855µs, spill_count=0, spilled_bytes=0, mem_used=0]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
|                   |     CoalescePartitionsExec, metrics=[output_rows=16, elapsed_compute=13.775µs, spill_count=0, spilled_bytes=0, mem_used=0]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |
|                   |       AggregateExec: mode=Partial, gby=[], aggr=[COUNT(UInt8(1))], metrics=[output_rows=16, elapsed_compute=38.622µs, spill_count=0, spilled_bytes=0, mem_used=0]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
|                   |         CoalesceBatchesExec: target_batch_size=4096, metrics=[output_rows=25550, elapsed_compute=213.035µs, spill_count=0, spilled_bytes=0, mem_used=0]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |
|                   |           FilterExec: container@0 = database_container_1, metrics=[output_rows=25550, elapsed_compute=185.685µs, spill_count=0, spilled_bytes=0, mem_used=0]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            |
|                   |             RepartitionExec: partitioning=RoundRobinBatch(16), metrics=[send_time{inputPartition=0}=4.525µs, fetch_time{inputPartition=0}=4.6344ms, repart_time{inputPartition=0}=1ns]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |
|                   |               ParquetExec: limit=None, partitions=[data.parquet], predicate=container_min@0 <= database_container_1 AND database_container_1 <= container_max@1, projection=[container], metrics=[output_rows=25550, elapsed_compute=1ns, spill_count=0, spilled_bytes=0, mem_used=0, row_groups_pruned{filename=data.parquet}=0, bytes_scanned{filename=data.parquet}=354, num_predicate_creation_errors=0, pushdown_rows_filtered{filename=data.parquet}=225441, predicate_evaluation_errors{filename=data.parquet}=0, time_elapsed_processing=4.43559ms, pushdown_eval_time{filename=data.parquet}=3.602505ms, time_elapsed_scanning=4.247177ms, time_elapsed_opening=285.456µs] |
|                   |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |
+-------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

You can see:

pushdown_rows_filtered{filename=data.parquet}=225441, 
pushdown_eval_time{filename=data.parquet}=3.602505ms

@liukun4515

Inspired by @liukun4515 at https://github.com/apache/arrow-datafusion/pull/3380/files#r970198755

alamb · 2022-10-27T20:12:35Z

datafusion/core/src/physical_plan/file_format/row_filter.rs

        match self
            .physical_expr
            .evaluate(&batch)
            .map(|v| v.into_array(batch.num_rows()))
        {
            Ok(array) => {
                if let Some(mask) = array.as_any().downcast_ref::<BooleanArray>() {
-                    Ok(BooleanArray::from(mask.data().clone()))
+                    let bool_arr = BooleanArray::from(mask.data().clone());
+                    // TODO is there a more efficient way to count the rows that are filtered?


@tustvold do you have any suggestions on how to count the number of true values in a boolean array faster/better than this?

If you don't care about nulls

bool_arr.values().count_set_bits_offset(self.offset(), self.len())

If you do care about nulls it is slightly more complicated, I'll get something into arrow-rs

apache/arrow-rs#2957

Could copy paste for now, it isn't hugely complicated

I do care about nulls, sadly, -- it needs to be Non-null and true.

I can file a ticket in arrow-rs if that would be helpful

Thanks again @tustvold -- I filed apache/arrow-rs#2963 and will copy/paste your implementation here for nwo

alamb · 2022-10-27T20:21:55Z

cc @Ted-Jiang I am thinking we can use a similar approach to validate / verify the PageIndex pruning you are working on

Ted-Jiang · 2022-10-28T03:09:22Z

cc @Ted-Jiang I am thinking we can use a similar approach to validate / verify the PageIndex pruning you are working on

Sounds great!

Ted-Jiang

LGTM, I think if there any more than one predicates in one query, should we record the each predicate's input records count to calculate the efficiency 🤔 So could guide the user do the rearrangement.

…cate_pushdown_metrics

alamb · 2022-10-28T14:14:12Z

LGTM, I think if there any more than one predicates in one query, should we record the each predicate's input records count to calculate the efficiency 🤔 So could guide the user do the rearrangement.

It is a great idea -- filed #3998 and I will work on that next

alamb · 2022-10-28T16:58:14Z

I am going to merge this in shortly as I have several other PRs #3976 and one in IOx that depend on it, unless there are objections

alamb · 2022-10-30T11:32:37Z

I am not sure when I will have time to add per-predicate metrics -- I'll see how my other projects go.

ursabot · 2022-10-30T11:42:09Z

Benchmark runs are scheduled for baseline = 71f05a3 and contender = afc299a. afc299a is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

@liukun4515

* Log error building row filters Inspired by @liukun4515 at https://github.com/apache/arrow-datafusion/pull/3380/files#r970198755 * Add parquet predicate pushdown metrics * more efficient bit counting

@liukun4515

* Log error building row filters Inspired by @liukun4515 at https://github.com/apache/arrow-datafusion/pull/3380/files#r970198755 * Add parquet predicate pushdown metrics * more efficient bit counting

Log error building row filters

d7f2fba

Inspired by @liukun4515 at https://github.com/apache/arrow-datafusion/pull/3380/files#r970198755

alamb marked this pull request as draft October 27, 2022 19:35

github-actions bot added the core Core DataFusion crate label Oct 27, 2022

Add parquet predicate pushdown metrics

b7171b4

alamb force-pushed the alamb/parquet_predicate_pushdown_metrics branch from 576d488 to b7171b4 Compare October 27, 2022 20:00

alamb requested a review from thinkharderdev October 27, 2022 20:04

alamb marked this pull request as ready for review October 27, 2022 20:04

alamb commented Oct 27, 2022

View reviewed changes

tustvold mentioned this pull request Oct 27, 2022

Add BooleanArray::true_count and BooleanArray::false_count apache/arrow-rs#2957

Merged

Ted-Jiang approved these changes Oct 28, 2022

View reviewed changes

alamb mentioned this pull request Oct 28, 2022

Optimized way to count the numbers of true and false values in a BooleanArray apache/arrow-rs#2963

Closed

Merge remote-tracking branch 'apache/master' into alamb/parquet_predi…

c883f06

…cate_pushdown_metrics

alamb mentioned this pull request Oct 28, 2022

Record per-predicate statistics for effectiveness of parquet predicate pushdown #3998

Open

more efficient bit counting

2b3f70c

alamb mentioned this pull request Oct 28, 2022

Correctness integration test for parquet filter pushdown #3976

Merged

14 tasks

alamb merged commit afc299a into apache:master Oct 30, 2022

alamb deleted the alamb/parquet_predicate_pushdown_metrics branch October 30, 2022 11:32

alamb mentioned this pull request Oct 30, 2022

Fix predicate pushdown bugs: project columns within DatafusionArrowPredicate (#4005) (#4006) #4021

Merged

Ted-Jiang mentioned this pull request Nov 1, 2022

Add parquet page index pushdown metrics #4058

Closed

alamb mentioned this pull request Nov 2, 2022

Add metrics for parquet page level skipping #4086

Closed

alamb mentioned this pull request Nov 7, 2022

Minor: Use upstream BooleanArray::true_count #4129

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add parquet predicate pushdown metrics #3989

Add parquet predicate pushdown metrics #3989

alamb commented Oct 27, 2022 •

edited

Loading

alamb Oct 27, 2022

tustvold Oct 27, 2022

tustvold Oct 27, 2022

alamb Oct 27, 2022

alamb Oct 27, 2022

alamb Oct 28, 2022

alamb commented Oct 27, 2022

Ted-Jiang commented Oct 28, 2022

Ted-Jiang left a comment

alamb commented Oct 28, 2022

alamb commented Oct 28, 2022

alamb commented Oct 30, 2022

ursabot commented Oct 30, 2022

Add parquet predicate pushdown metrics #3989

Add parquet predicate pushdown metrics #3989

Conversation

alamb commented Oct 27, 2022 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

alamb Oct 27, 2022

Choose a reason for hiding this comment

tustvold Oct 27, 2022

Choose a reason for hiding this comment

tustvold Oct 27, 2022

Choose a reason for hiding this comment

alamb Oct 27, 2022

Choose a reason for hiding this comment

alamb Oct 27, 2022

Choose a reason for hiding this comment

alamb Oct 28, 2022

Choose a reason for hiding this comment

alamb commented Oct 27, 2022

Ted-Jiang commented Oct 28, 2022

Ted-Jiang left a comment

Choose a reason for hiding this comment

alamb commented Oct 28, 2022

alamb commented Oct 28, 2022

alamb commented Oct 30, 2022

ursabot commented Oct 30, 2022

alamb commented Oct 27, 2022 •

edited

Loading