Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add support for Utf8Type and TimeStamp in Parquet statistics #9129

Closed
wants to merge 7 commits into from

Conversation

Weijun-H
Copy link
Member

@Weijun-H Weijun-H commented Feb 5, 2024

Which issue does this PR close?

Closes #8295

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

@Weijun-H Weijun-H marked this pull request as ready for review February 7, 2024 09:13
@alamb
Copy link
Contributor

alamb commented Feb 8, 2024

Thank you @Weijun-H -- I plan to review this PR hopefully today or tomorrow

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @Weijun-H -- this looks like a great start. I really appreciate you working on this issue

I poked around and I also found the following code that does something similar (converts parquet statistics into Arrays) but that is used for Row Group Pruning:

https://github.com/apache/arrow-datafusion/blob/6c4109017edfe10800e0ffee8c1c254aade05849/datafusion/core/src/datasource/physical_plan/parquet/statistics.rs#L58-L57

Given I am quite confident in how that code works and it has had multiple contributors, I wonder would you be willing to consider refactoring the parquet statistics extraction code so that it all goes through a single path?

This would look something like making summarize_min_max call get_statistic!

I think you could avoid a non trivial amount of new code.

@@ -1003,6 +1006,246 @@ mod tests {
);
}

#[test]
fn row_group_pruning_predicate_utf8() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the tests in this module are for row group pruning which use the statistics extraction code in
https://github.com/apache/arrow-datafusion/blob/6c4109017edfe10800e0ffee8c1c254aade05849/datafusion/core/src/datasource/physical_plan/parquet/statistics.rs#L58-L57, which confusingly isn't the same code used to extract statistics for the entire file.

A way to test this might be to create a parquet exec to read alltypes_plain.parquet' and verify that statistics are present

For example, I think this information is encoded in the physical_plan_with_stats line like this

[(Col[0]:),(Col[1]:),(Col[2]:),(Col[3]:),(Col[4]:),(Col[5]:),(Col[6]:),(Col[7]:),(Col[8]:),(Col[9]:),(Col[10]:)]]     
❯ explain verbose select * from './parquet-testing/data/alltypes_plain.parquet';
....
| physical_plan_with_stats                                   | ParquetExec: file_groups={1 group: [[Users/andrewlamb/Software/arrow-datafusion/parquet-testing/data/alltypes_plain.parquet]]}, projection=[id, bool_col, tinyint_col, smallint_col, int_col, bigint_col, float_col, double_col, date_string_col, string_col, timestamp_col], statistics=[Rows=Exact(8), Bytes=Absent, [(Col[0]:),(Col[1]:),(Col[2]:),(Col[3]:),(Col[4]:),(Col[5]:),(Col[6]:),(Col[7]:),(Col[8]:),(Col[9]:),(Col[10]:)]]                                                                                                                                                                                                                              |
|                                                            |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
+------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

ParquetStatistics::ByteArray(s)
if matches!(fields[i].data_type(), DataType::Utf8 | DataType::LargeUtf8) =>
{
if let Some(max_value) = &mut max_values[i] {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe byte arrays are also used to store DataType::Decimal values as well (though hopefully if we consolidate the statistics conversion code it will "just work")

@Weijun-H Weijun-H marked this pull request as draft February 12, 2024 01:47
@Weijun-H
Copy link
Member Author

Weijun-H commented Feb 12, 2024

Thank you @Weijun-H -- this looks like a great start. I really appreciate you working on this issue

I poked around and I also found the following code that does something similar (converts parquet statistics into Arrays) but that is used for Row Group Pruning:

6c41090/datafusion/core/src/datasource/physical_plan/parquet/statistics.rs#L58-L57

Given I am quite confident in how that code works and it has had multiple contributors, I wonder would you be willing to consider refactoring the parquet statistics extraction code so that it all goes through a single path?

This would look something like making summarize_min_max call get_statistic!

I think you could avoid a non trivial amount of new code.

Yes, I also consider refactoring the code to avoid code duplication. But in summarize_min_max, the Accumulator needs to update_batch, which will increase the number of times in the match statement. @alamb

fn summarize_min_max{
  match stat {
    ParquetStatistics::Boolean => {
          let value = get_statistic!(); // need to match target_arrow_type again
    }
  }
}

Copy link

Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days.

@github-actions github-actions bot added the Stale PR has not had any activity for some time label Apr 13, 2024
@matthewmturner
Copy link
Contributor

@alamb @Weijun-H i have plans to pick up #8295 next week unless you both think that this can be completed before then (I havent looked yet to see whether it makes sense to continue on this PR or make a new one).

Happy to get both of your thoughts!

@github-actions github-actions bot removed the Stale PR has not had any activity for some time label Apr 18, 2024
@alamb
Copy link
Contributor

alamb commented Apr 19, 2024

@alamb @Weijun-H i have plans to pick up #8295 next week unless you both think that this can be completed before then (I havent looked yet to see whether it makes sense to continue on this PR or make a new one).

Happy to get both of your thoughts!

I don't think I will be able to make it before then, sadly.

Thank you @matthewmturner -- I think this would be a very impactful change.

Part of the challenge is that there are two copies of the statistics extraction code. A first step may be to figure out how consolidate that

Here is one copy (used for row group pruning):
https://github.com/apache/arrow-datafusion/blob/19356b26f515149f96f9b6296975a77ac7260149/datafusion/core/src/datasource/physical_plan/parquet/row_groups.rs#L321-L329

Here is the second copy (used for file level statistics): https://github.com/apache/arrow-datafusion/blob/19356b26f515149f96f9b6296975a77ac7260149/datafusion/core/src/datasource/physical_plan/parquet/statistics.rs#L179-L196

I think this code eventually belongs in Arrow -- see apache/arrow-rs#4328, but getting it working in DataFusion initially is probably the right thing

Copy link

Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days.

@github-actions github-actions bot added the Stale PR has not had any activity for some time label Jun 19, 2024
@Weijun-H Weijun-H closed this Jun 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate Stale PR has not had any activity for some time
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ParquetExec::statistics() does not read statistics for many column types (like timstamps, strings, etc)
3 participants