feat: Add support for Utf8Type and TimeStamp in Parquet statistics #9129

Weijun-H · 2024-02-05T04:41:54Z

Which issue does this PR close?

Closes #8295

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

alamb · 2024-02-08T12:53:59Z

Thank you @Weijun-H -- I plan to review this PR hopefully today or tomorrow

alamb

Thank you @Weijun-H -- this looks like a great start. I really appreciate you working on this issue

I poked around and I also found the following code that does something similar (converts parquet statistics into Arrays) but that is used for Row Group Pruning:

https://github.com/apache/arrow-datafusion/blob/6c4109017edfe10800e0ffee8c1c254aade05849/datafusion/core/src/datasource/physical_plan/parquet/statistics.rs#L58-L57

Given I am quite confident in how that code works and it has had multiple contributors, I wonder would you be willing to consider refactoring the parquet statistics extraction code so that it all goes through a single path?

This would look something like making summarize_min_max call get_statistic!

I think you could avoid a non trivial amount of new code.

alamb · 2024-02-10T22:12:49Z

datafusion/core/src/datasource/physical_plan/parquet/row_groups.rs

@@ -1003,6 +1006,246 @@ mod tests {
        );
    }

+    #[test]
+    fn row_group_pruning_predicate_utf8() {


I believe the tests in this module are for row group pruning which use the statistics extraction code in
https://github.com/apache/arrow-datafusion/blob/6c4109017edfe10800e0ffee8c1c254aade05849/datafusion/core/src/datasource/physical_plan/parquet/statistics.rs#L58-L57, which confusingly isn't the same code used to extract statistics for the entire file.

A way to test this might be to create a parquet exec to read alltypes_plain.parquet' and verify that statistics are present

For example, I think this information is encoded in the physical_plan_with_stats line like this

[(Col[0]:),(Col[1]:),(Col[2]:),(Col[3]:),(Col[4]:),(Col[5]:),(Col[6]:),(Col[7]:),(Col[8]:),(Col[9]:),(Col[10]:)]]

❯ explain verbose select * from './parquet-testing/data/alltypes_plain.parquet'; .... | physical_plan_with_stats | ParquetExec: file_groups={1 group: [[Users/andrewlamb/Software/arrow-datafusion/parquet-testing/data/alltypes_plain.parquet]]}, projection=[id, bool_col, tinyint_col, smallint_col, int_col, bigint_col, float_col, double_col, date_string_col, string_col, timestamp_col], statistics=[Rows=Exact(8), Bytes=Absent, [(Col[0]:),(Col[1]:),(Col[2]:),(Col[3]:),(Col[4]:),(Col[5]:),(Col[6]:),(Col[7]:),(Col[8]:),(Col[9]:),(Col[10]:)]] | | | | +------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

alamb · 2024-02-10T22:14:19Z

datafusion/core/src/datasource/file_format/parquet.rs

+        ParquetStatistics::ByteArray(s)
+            if matches!(fields[i].data_type(), DataType::Utf8 | DataType::LargeUtf8) =>
+        {
+            if let Some(max_value) = &mut max_values[i] {


I believe byte arrays are also used to store DataType::Decimal values as well (though hopefully if we consolidate the statistics conversion code it will "just work")

Weijun-H · 2024-02-12T04:59:18Z

Thank you @Weijun-H -- this looks like a great start. I really appreciate you working on this issue

I poked around and I also found the following code that does something similar (converts parquet statistics into Arrays) but that is used for Row Group Pruning:

6c41090/datafusion/core/src/datasource/physical_plan/parquet/statistics.rs#L58-L57

Given I am quite confident in how that code works and it has had multiple contributors, I wonder would you be willing to consider refactoring the parquet statistics extraction code so that it all goes through a single path?

This would look something like making summarize_min_max call get_statistic!

I think you could avoid a non trivial amount of new code.

Yes, I also consider refactoring the code to avoid code duplication. But in summarize_min_max, the Accumulator needs to update_batch, which will increase the number of times in the match statement. @alamb

fn summarize_min_max{
  match stat {
    ParquetStatistics::Boolean => {
          let value = get_statistic!(); // need to match target_arrow_type again
    }
  }
}

github-actions · 2024-04-13T01:40:04Z

Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days.

matthewmturner · 2024-04-17T15:41:21Z

@alamb @Weijun-H i have plans to pick up #8295 next week unless you both think that this can be completed before then (I havent looked yet to see whether it makes sense to continue on this PR or make a new one).

Happy to get both of your thoughts!

alamb · 2024-04-19T11:48:31Z

@alamb @Weijun-H i have plans to pick up #8295 next week unless you both think that this can be completed before then (I havent looked yet to see whether it makes sense to continue on this PR or make a new one).

Happy to get both of your thoughts!

I don't think I will be able to make it before then, sadly.

Thank you @matthewmturner -- I think this would be a very impactful change.

Part of the challenge is that there are two copies of the statistics extraction code. A first step may be to figure out how consolidate that

Here is one copy (used for row group pruning):
https://github.com/apache/arrow-datafusion/blob/19356b26f515149f96f9b6296975a77ac7260149/datafusion/core/src/datasource/physical_plan/parquet/row_groups.rs#L321-L329

Here is the second copy (used for file level statistics): https://github.com/apache/arrow-datafusion/blob/19356b26f515149f96f9b6296975a77ac7260149/datafusion/core/src/datasource/physical_plan/parquet/statistics.rs#L179-L196

I think this code eventually belongs in Arrow -- see apache/arrow-rs#4328, but getting it working in DataFusion initially is probably the right thing

github-actions · 2024-06-19T01:49:19Z

Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days.

github-actions bot added the core Core DataFusion crate label Feb 5, 2024

Weijun-H force-pushed the support-utf8-stat-in-parquet branch from 90d162e to d2a4f19 Compare February 6, 2024 03:39

Weijun-H mentioned this pull request Feb 6, 2024

Support PrimitiveTypeBuilder when Logical type is Timestamp and Physical type is BYTE_ARRAY apache/arrow-rs#5365

Closed

Weijun-H added 4 commits February 7, 2024 11:26

Refactor parquet.rs: Add support for Utf8Type and ByteArray statistics

86cf9a6

support timestamp stat in row group

e4175e0

Suppot TimeStamp for Parquet Statistics

2d6abca

Fix timestamp conversion bug in Parquet row_groups.rs and statistics.rs

b792e0f

Weijun-H force-pushed the support-utf8-stat-in-parquet branch from d2a4f19 to b792e0f Compare February 7, 2024 08:21

Weijun-H added 2 commits February 7, 2024 16:25

fix fmt and clippy

4c05c91

Add row group pruning tests for different timestamp units

6c41090

Weijun-H marked this pull request as ready for review February 7, 2024 09:13

Update ParquetTimeUnit import

2192cf5

alamb mentioned this pull request Feb 8, 2024

DataFusion weekly project plan (Andrew Lamb) - Feb 5, 2024 #9121

Closed

6 tasks

alamb reviewed Feb 10, 2024

View reviewed changes

Weijun-H marked this pull request as draft February 12, 2024 01:47

alamb mentioned this pull request Feb 12, 2024

DataFusion weekly project plan (Andrew Lamb) - Feb 12, 2024 #9200

Closed

8 tasks

github-actions bot added the Stale PR has not had any activity for some time label Apr 13, 2024

github-actions bot removed the Stale PR has not had any activity for some time label Apr 18, 2024

github-actions bot added the Stale PR has not had any activity for some time label Jun 19, 2024

Weijun-H closed this Jun 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add support for Utf8Type and TimeStamp in Parquet statistics #9129

feat: Add support for Utf8Type and TimeStamp in Parquet statistics #9129

Weijun-H commented Feb 5, 2024

alamb commented Feb 8, 2024

alamb left a comment

alamb Feb 10, 2024

alamb Feb 10, 2024

Weijun-H commented Feb 12, 2024 •

edited

Loading

github-actions bot commented Apr 13, 2024

matthewmturner commented Apr 17, 2024

alamb commented Apr 19, 2024

github-actions bot commented Jun 19, 2024

feat: Add support for Utf8Type and TimeStamp in Parquet statistics #9129

feat: Add support for Utf8Type and TimeStamp in Parquet statistics #9129

Conversation

Weijun-H commented Feb 5, 2024

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

alamb commented Feb 8, 2024

alamb left a comment

Choose a reason for hiding this comment

alamb Feb 10, 2024

Choose a reason for hiding this comment

alamb Feb 10, 2024

Choose a reason for hiding this comment

Weijun-H commented Feb 12, 2024 • edited Loading

github-actions bot commented Apr 13, 2024

matthewmturner commented Apr 17, 2024

alamb commented Apr 19, 2024

github-actions bot commented Jun 19, 2024

Weijun-H commented Feb 12, 2024 •

edited

Loading