Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support pushdown multi-columns in PageIndex pruning. #3967

Merged
merged 5 commits into from
Nov 2, 2022

Conversation

Ted-Jiang
Copy link
Member

@Ted-Jiang Ted-Jiang commented Oct 26, 2022

Signed-off-by: yangjiang yangjiang@ebay.com

Which issue does this PR close?

Closes #3834.
Closes #4002

Rationale for this change

What changes are included in this PR?

Thanks the original pic from @alamb ❤️

I tried to draw a diagram to illustrate what I think is going on:

┏━━ ━━━ ━━━ ━━━ ━━━ ━━━ ━━━ ━━━ ━━━ ━━━ ━━━ ━━━ ━━━ 
   ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   ┃
┃     ┌──────────────┐  │     ┌──────────────┐  │  ┃
┃  │  │              │     │  │              │     ┃
┃     │              │  │     │     Page     │  │   
   │  │              │     │  │      3       │     ┃
┃     │              │  │     │   min: "A"   │  │  ┃
┃  │  │              │     │  │   max: "C"   │     ┃
┃     │     Page     │  │     │ first_row: 0 │  │   
   │  │      1       │     │  │              │     ┃
┃     │   min: 10    │  │     └──────────────┘  │  ┃
┃  │  │   max: 20    │     │  ┌──────────────┐     ┃
┃     │ first_row: 0 │  │     │              │  │   
   │  │              │     │  │     Page     │     ┃
┃     │              │  │     │      4       │  │  ┃
┃  │  │              │     │  │   min: "D"   │     ┃
┃     │              │  │     │   max: "G"   │  │   
   │  │              │     │  │first_row: 100│     ┃
┃     └──────────────┘  │     │              │  │  ┃
┃  │  ┌──────────────┐     │  │              │     ┃
┃     │              │  │     └──────────────┘  │   
   │  │     Page     │     │  ┌──────────────┐     ┃
┃     │      2       │  │     │              │  │  ┃
┃  │  │   min: 30    │     │  │     Page     │     ┃
┃     │   max: 40    │  │     │      5       │  │   
   │  │first_row: 200│     │  │   min: "H"   │     ┃
┃     │              │  │     │   max: "Z"   │  │  ┃
┃  │  │              │     │  │first_row: 250│     ┃
┃     └──────────────┘  │     │              │  │   
   │                       │  └──────────────┘     ┃
┃   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘  ┃
┃       ColumnChunk            ColumnChunk         ┃
┃            A                      B               
 ━━━ ━━━ ━━━ ━━━ ━━━ ━━━ ━━━ ━━━ ━━━ ━━━ ━━━ ━━━ ━━┛
                                                    
       total rows: 300                                            

Multi Column Predicates

Given the predicate A > 35 AND B = "F":

Using A > 35, we can rule out page 1 using statistics as before, get selector(200~300)

Using the B = "F" part, we could rule out pages 3 and 5 (row_index 0->99 and row_index 250->onward) , get selector(100~244)

Finally do the intersection get selector(200~244)

Are there any user-facing changes?

@github-actions github-actions bot added the core Core DataFusion crate label Oct 26, 2022
offset_indexes: &'a Vec<Vec<PageLocation>>,
parquet_schema: &'a Schema,
col_id: usize,
col_page_indexes: &'a Index,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

simplify the PagesPruningStatistics no need pass all cols index!

//
// returned: NNNNNNNNY
// set `need_combine` true will combine result: Select(2) + Select(1) + Skip(2) -> Select(3) + Skip(2)
pub(crate) fn intersect_row_selection(
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will move this to arrow-rs

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Signed-off-by: yangjiang <yangjiang@ebay.com>
metrics.predicate_evaluation_errors.add(1);
return Ok(vec![RowSelector::select(group.num_rows() as usize)]);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb related to #4002 add one fallback, which select all rows.

got

❯ select count(*) from foo where container = 'backend_container_1';
[2022-10-29T15:51:31Z ERROR datafusion::physical_plan::file_format::parquet] Error evaluating page index predicate values Error during planning: Can not create statistics record batch: Invalid argument error: Column 'container_min' is declared as non-nullable but contains null values
+-----------------+
| COUNT(UInt8(1)) |
+-----------------+
| 15963           |
+-----------------+
1 row in set. Query took 0.035 seconds.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool! I hope to have #3976 ready for review later today -- with that additional testing coverage I will feel pretty good about this particular optimization 🎉

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I verified that this PR fixes #4002 -- I have also prepared a PR to enable page filtering in the integration test #4062 which we can run when this PR is merged

Signed-off-by: yangjiang <yangjiang@ebay.com>
@Ted-Jiang Ted-Jiang marked this pull request as ready for review October 29, 2022 16:25
@alamb
Copy link
Contributor

alamb commented Nov 1, 2022

Sorry @Ted-Jiang -- i will review this PR today

@alamb
Copy link
Contributor

alamb commented Nov 1, 2022

@Ted-Jiang - I took the liberty of merging up from master and fixing a merge conflict

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great @Ted-Jiang 🦾 🏆 -- I think it can be merged as is.

I am impressed with the tests in this PR and I also verified that this optimization passes my parquet predicate torture integration test #4062 that found #4002

I have some follow up suggestions but I think we can do them as follow on PRs if you prefer

Other items to do:

  1. Add statistics to verify and report on the efficiency of page index pruning (I will file a follow on ticket)
  2. Add some more variations to the parquet predicate integration tests to ensure multiple pages are being used (will file a follow on ticket)

datafusion/core/src/physical_plan/file_format/parquet.rs Outdated Show resolved Hide resolved
}),
);
} else {
// fallback select all rows
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

datafusion/core/src/physical_plan/file_format/parquet.rs Outdated Show resolved Hide resolved
datafusion/core/src/physical_plan/file_format/parquet.rs Outdated Show resolved Hide resolved
datafusion/core/src/physical_plan/file_format/parquet.rs Outdated Show resolved Hide resolved
datafusion/core/src/physical_plan/file_format/parquet.rs Outdated Show resolved Hide resolved
}
}
_ => {
if a.row_count < b.row_count {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be taking which one of a or b was skipped (rather than just the one that was smaller)? I tried messing around and trying to write some test that failed, but was not able to

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think two element at least one is skip, we can sure about skip the min len is correct🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍


// create filter (year = 2009 and id = 1) or (year = 2010)
// this filter use two columns will not push down
// todo but after use CNF rewrite it could rewrite to (year = 2009 or year = 2010) and (id = 1 or year = 2010)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure about the CNF rewrite comment here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we file a ticket do some tests.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure what tests you have in mind here so I did not file a ticket -- perhaps you could add to one of the existing tickets (#4087 ?) or file a new one?

@Ted-Jiang
Copy link
Member Author

I took the liberty of merging up from master and fixing a merge conflict

Thanks for the helping!😄

Add statistics to verify and report on the efficiency of page index pruning (I will file a follow on ticket)

Add some more variations to the parquet predicate integration tests to ensure multiple pages are being used (will file a follow on ticket)

Agree! will try to familiar with the IT, support these✌️

Signed-off-by: yangjiang <yangjiang@ebay.com>
Signed-off-by: yangjiang <yangjiang@ebay.com>
@alamb
Copy link
Contributor

alamb commented Nov 2, 2022

Looks good -- thanks @Ted-Jiang -- I'll merge this in and file some follow on tickets for the remaining items

@alamb
Copy link
Contributor

alamb commented Nov 2, 2022

I filed #4085 to track enabling this feature by default

@alamb
Copy link
Contributor

alamb commented Nov 2, 2022

Filed #4086 to track adding statistics

@alamb
Copy link
Contributor

alamb commented Nov 2, 2022

Filed #4087 for increasing test coverage

@ursabot
Copy link

ursabot commented Nov 2, 2022

Benchmark runs are scheduled for baseline = 065a478 and contender = 1a5f6ab. 1a5f6ab is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Dandandan pushed a commit to yuuch/arrow-datafusion that referenced this pull request Nov 5, 2022
* Support pushdown multi-columns in PageIndex pruning.

Signed-off-by: yangjiang <yangjiang@ebay.com>

* fix test

Signed-off-by: yangjiang <yangjiang@ebay.com>

* fix comments

Signed-off-by: yangjiang <yangjiang@ebay.com>

* avoid extract predicates when enable is false

Signed-off-by: yangjiang <yangjiang@ebay.com>

Signed-off-by: yangjiang <yangjiang@ebay.com>
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate
Projects
None yet
3 participants