-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Add support for Utf8Type and TimeStamp in Parquet statistics #9129
Conversation
90d162e
to
d2a4f19
Compare
d2a4f19
to
b792e0f
Compare
Thank you @Weijun-H -- I plan to review this PR hopefully today or tomorrow |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @Weijun-H -- this looks like a great start. I really appreciate you working on this issue
I poked around and I also found the following code that does something similar (converts parquet statistics into Arrays) but that is used for Row Group Pruning:
Given I am quite confident in how that code works and it has had multiple contributors, I wonder would you be willing to consider refactoring the parquet statistics extraction code so that it all goes through a single path?
This would look something like making summarize_min_max
call get_statistic!
I think you could avoid a non trivial amount of new code.
@@ -1003,6 +1006,246 @@ mod tests { | |||
); | |||
} | |||
|
|||
#[test] | |||
fn row_group_pruning_predicate_utf8() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe the tests in this module are for row group pruning which use the statistics extraction code in
https://github.com/apache/arrow-datafusion/blob/6c4109017edfe10800e0ffee8c1c254aade05849/datafusion/core/src/datasource/physical_plan/parquet/statistics.rs#L58-L57, which confusingly isn't the same code used to extract statistics for the entire file.
A way to test this might be to create a parquet exec to read alltypes_plain.parquet'
and verify that statistics are present
For example, I think this information is encoded in the physical_plan_with_stats
line like this
[(Col[0]:),(Col[1]:),(Col[2]:),(Col[3]:),(Col[4]:),(Col[5]:),(Col[6]:),(Col[7]:),(Col[8]:),(Col[9]:),(Col[10]:)]]
❯ explain verbose select * from './parquet-testing/data/alltypes_plain.parquet';
....
| physical_plan_with_stats | ParquetExec: file_groups={1 group: [[Users/andrewlamb/Software/arrow-datafusion/parquet-testing/data/alltypes_plain.parquet]]}, projection=[id, bool_col, tinyint_col, smallint_col, int_col, bigint_col, float_col, double_col, date_string_col, string_col, timestamp_col], statistics=[Rows=Exact(8), Bytes=Absent, [(Col[0]:),(Col[1]:),(Col[2]:),(Col[3]:),(Col[4]:),(Col[5]:),(Col[6]:),(Col[7]:),(Col[8]:),(Col[9]:),(Col[10]:)]] |
| | |
+------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
ParquetStatistics::ByteArray(s) | ||
if matches!(fields[i].data_type(), DataType::Utf8 | DataType::LargeUtf8) => | ||
{ | ||
if let Some(max_value) = &mut max_values[i] { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe byte arrays are also used to store DataType::Decimal
values as well (though hopefully if we consolidate the statistics conversion code it will "just work")
Yes, I also consider refactoring the code to avoid code duplication. But in fn summarize_min_max{
match stat {
ParquetStatistics::Boolean => {
let value = get_statistic!(); // need to match target_arrow_type again
}
}
} |
Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days. |
I don't think I will be able to make it before then, sadly. Thank you @matthewmturner -- I think this would be a very impactful change. Part of the challenge is that there are two copies of the statistics extraction code. A first step may be to figure out how consolidate that Here is one copy (used for row group pruning): Here is the second copy (used for file level statistics): https://github.com/apache/arrow-datafusion/blob/19356b26f515149f96f9b6296975a77ac7260149/datafusion/core/src/datasource/physical_plan/parquet/statistics.rs#L179-L196 I think this code eventually belongs in Arrow -- see apache/arrow-rs#4328, but getting it working in DataFusion initially is probably the right thing |
Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days. |
Which issue does this PR close?
Closes #8295
Rationale for this change
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?