-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add metrics for parquet page level skipping #4105
Conversation
Signed-off-by: yangjiang <yangjiang@ebay.com>
|
||
let metrics = rt.parquet_exec.metrics().unwrap(); | ||
|
||
// todo fix this https://github.com/apache/arrow-rs/issues/2941 release change to row limit. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://github.com/apache/arrow-rs/pull/2942/files#r1013838557
I think theres is a bug in should_add_data_page
self.encoder.num_values()
always zero.
So no matter how to set data_pagesize_limit
and write_batch_size
always return 1 page in on colunn chunk.
🤔 I think someone metion this before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But from the real world file, this metic works fine😂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in #4039
@@ -449,7 +449,7 @@ impl FileOpener for ParquetOpener { | |||
// page index pruning: if all data on individual pages can | |||
// be ruled using page metadata, rows from other columns | |||
// with that range can be skipped as well | |||
if let Some(row_selection) = enable_page_index | |||
if let Some(row_selection) = (enable_page_index && !row_groups.is_empty()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when all rowGroups are pruned by rg_metadata(min max), we do this skip fast
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me @Ted-Jiang -- thank you 🎉
Note that apache/arrow-rs#2941 should be included as part of #4039 so hopefully we can fix it soon.
I think it would be fine to either merge this PR as is and fix the test as a follow on, or else wait for #4039 to merge and then update this PR with the better stats.
@@ -63,13 +67,22 @@ impl ParquetFileMetrics { | |||
let pushdown_eval_time = MetricBuilder::new(metrics) | |||
.with_new_label("filename", filename.to_string()) | |||
.subset_time("pushdown_eval_time", partition); | |||
let page_index_rows_filtered = MetricBuilder::new(metrics) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Signed-off-by: yangjiang <yangjiang@ebay.com>
@alamb fix ut in use page row limit fix ut. just happened to arrow be released 😂 |
Signed-off-by: yangjiang <yangjiang@ebay.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot @Ted-Jiang |
Benchmark runs are scheduled for baseline = b7a3331 and contender = 4d23cae. 4d23cae is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
Signed-off-by: yangjiang yangjiang@ebay.com
Which issue does this PR close?
Closes #4086 .
Rationale for this change
we can get metric like
What changes are included in this PR?
Are there any user-facing changes?
We i try to create single column with multi pages, i think there is a bug in page size check. but i found apache/arrow-rs#2941 add page row number check(not release).