Extract parquet statistics to its own module, add tests #8294

alamb · 2023-11-21T18:04:55Z

Which issue does this PR close?

Part of #8229
Closes #8335
Potentially part of apache/arrow-rs#4328

In order to avoid boiling the ocean and to document more clearly what the current code does, I am trying to do this work in stages. The first one is to consolidate how statistic are read from parquet

Rationale for this change

I am in the process of trying to improve the statistics in DataFusion, which have grown organically over time. I would like to refactor them, but I need to ensure that I don't break anything.

There are tests for the existing pruning predicate code, but not the underlying statistics conversion.

There are a few problems with the existing code:

There are at least two copies of code that converts parquet statistics into DataFusion statistics, which have somewhat different semantics (copy 1 and copy 2)
Despite appearances to the contrary, the statistics are converted one row at a time (even though they are used as an Array), which is not ideal from evaluating the pruning statistics
The Arrow schema of the file is not used. Thus, if a file contains a timestamp column, the statistic are ready as a Int64Array rather than a TimestampSecondArray or TimestampNanosecondArray The pruning statistics work around this with a cast but @tustvold tells me this is not always correct (especially for certain timestamps and intervals)

What changes are included in this PR?

Extracts the statistics conversion code to a new parquet/statistics.rs module, and adds a columnar API (returns value as an ArrayRef).
Fixes Parquet pruning will be incorrect if field names are repeated #8335 which we found while working on this PR
Adds extensive tests, both round tripping data through parquet rust writer as well as using the existing parquet test data

Are these changes tested?

Yes (most of this PR is new tests)

Are there any user-facing changes?

There are non intended.

This implementation uses the same the existing code, so it is not a functional change, but it does add many tests for the existing code.

I plan to improve the existing code in follow on PRs.

alamb · 2023-11-21T18:05:32Z

datafusion/core/src/datasource/physical_plan/parquet/row_groups.rs

@@ -303,112 +298,6 @@ struct RowGroupPruningStatistics<'a> {
    parquet_schema: &'a Schema,
 }

-/// Extract the min/max statistics from a `ParquetStatistics` object
-macro_rules! get_statistic {


This macro is moved, without modification, into statistics.rs

alamb · 2023-11-21T18:06:43Z

datafusion/core/src/datasource/physical_plan/parquet/row_groups.rs

+            .find(&column.name)
+            .map(|(_idx, field)| field)?;
+
+        RowGoupStatisticsConverter::new(&field)


The idea here is (eventually) to prune more than one row group at a time. However, this PR still does it one at a time

alamb · 2023-11-21T18:09:22Z

datafusion/core/src/datasource/physical_plan/parquet.rs

@@ -718,28 +719,6 @@ pub async fn plan_to_parquet(
    Ok(())
 }

-// Copy from the arrow-rs


Moved to statistics.rs

alamb · 2023-11-21T18:10:06Z

datafusion/core/src/datasource/physical_plan/parquet/statistics.rs

+/// * `$func is the function` (`min`/`max`) to call to get the value
+/// * `$bytes_func` is the function (`min_bytes`/`max_bytes`) to call to get the value as bytes
+/// * `$target_arrow_type` is the [`DataType`] of the target statistics
+macro_rules! get_statistic {


This implementation leaves a lot to be desired, but I want to get tests in place before I start changing it

alamb · 2023-11-21T18:11:55Z

datafusion/core/src/datasource/physical_plan/parquet/statistics.rs

+    ) -> Result<ArrayRef> {
+        let mut row_group_meta_data = row_group_meta_data.into_iter().peekable();
+
+        // if it is empty, return empty array


this handling of empty iterators is new, to support the new array ref interface

datafusion/core/src/datasource/physical_plan/parquet/statistics.rs

alamb · 2023-11-22T12:16:20Z

datafusion/core/src/datasource/physical_plan/parquet/statistics.rs

+        }
+    }
+
+    #[test]


I added a bunch of tests for reading statistics out of existing files to document what the current behavior is.

Sadly, all of the example files in parquet_testing appear to have a single row group

datafusion/core/src/datasource/physical_plan/parquet/statistics.rs

alamb · 2023-11-22T17:56:07Z

I would also be interested in opinions about potentially moving this implementation upstream into the parquet-rs eventally

alamb · 2023-11-22T18:02:00Z

FYI @viirya @liukun4515 and @Ted-Jiang

tustvold

Left some comments, I think part of the confusion at the moment is that the current logic does not make a clear distinction between leaf and group columns. I think this will make it very hard to correctly handle parquet logical type mapping, etc... I would recommend making this explicit, e.g. by making the statistics conversion explicitly only handle leaf, i.e. non-nested columns as they appear in parquet, and then composing this into the arrow model at a higher level, e.g. within PruningStatistics.

datafusion/core/src/datasource/physical_plan/parquet/statistics.rs

tustvold · 2023-11-22T22:22:45Z

datafusion/core/src/datasource/physical_plan/parquet/statistics.rs

+                .columns()
+                .iter()
+                .enumerate()
+                .find(|(_idx, c)| c.column_descr().name() == self.field.name())


Aside from being slow, this will be incorrect in the presence of nested fields

I added tests for this -- and I didn't find a bug 🤔

Left a comment on how to see the bug

datafusion/core/src/datasource/physical_plan/parquet/statistics.rs

tustvold · 2023-11-22T22:30:53Z

datafusion/core/src/datasource/physical_plan/parquet/row_groups.rs

+            .find(&column.name)
+            .map(|(_idx, field)| field)?;
+
+        RowGroupStatisticsConverter::new(field)


There is a slight mismatch here as parquet handles schema nesting differently from arrow

I'm not sure how Column addresses nested fields, but I would expect to see something walking SchemaDescriptor to compute this mapping, or something similar.

TLDR is that Column does not address nested fields. The structure that does is

datafusion_physical_expr::expressions::GetFieldAccessExpr

or

https://docs.rs/datafusion/latest/datafusion/logical_expr/expr/enum.Expr.html#variant.GetIndexedField

datafusion/core/src/datasource/physical_plan/parquet/statistics.rs

alamb · 2023-11-27T16:23:30Z

I spoke with @tustvold and we came up with the following plan:

Plans:

Change this PR to take iterator of ColumnMetadata and a target arrow DataType so we can consolidate the "map arrow column -> parquet column" logic (which is incorrect for structs, as pointed out above)
(as a follow on PR): Add a test for reading/writing struct arrays with statistics (and likely file a ticket when this turns out to be wrong). The test should also ensure that reading a scalar field that appears after the struct field gets the correct values.

…_statistics

…s.rs Co-authored-by: Raphael Taylor-Davies <1781103+tustvold@users.noreply.github.com>

…rrow-datafusion into alamb/extract_parquet_statistics

…s.rs

alamb

@tustvold I think I have addressed your main concerns (handling of StructArrays) with tests

I think I have updated the API to something better, though not quite the same as what you suggested.

datafusion/core/src/datasource/physical_plan/parquet/statistics.rs

alamb · 2023-11-27T20:53:44Z

datafusion/core/src/datasource/physical_plan/parquet/statistics.rs

+                .columns()
+                .iter()
+                .enumerate()
+                .find(|(_idx, c)| c.column_descr().name() == self.field.name())


I added tests for this -- and I didn't find a bug 🤔

tustvold · 2023-11-27T21:05:21Z

datafusion/core/src/datasource/physical_plan/parquet/statistics.rs

+                Arc::new(boolean) as ArrayRef,
+            ),
+            (
+                Arc::new(Field::new("i", DataType::Int32, nullable)),


Suggested change

Arc::new(Field::new("i", DataType::Int32, nullable)),

Arc::new(Field::new("int_col", DataType::Int32, nullable)),

Indeed -- when I made this change in a601fbf the test with structs and non structs fails (as you predicated)

row_groups: 1 "struct_col.bool_col": Boolean({min: Some(true), max: Some(true), distinct_count: None, null_count: 1, min_max_deprecated: false, min_max_backwards_compatible: false}) "struct_col.int_col": Int32({min: Some(1), max: Some(3), distinct_count: None, null_count: 1, min_max_deprecated: false, min_max_backwards_compatible: false}) "int_col": Int32({min: Some(100), max: Some(300), distinct_count: None, null_count: 0, min_max_deprecated: false, min_max_backwards_compatible: false}) left: PrimitiveArray<Int32> [ 1, ] right: PrimitiveArray<Int32> [ 100, ] stack backtrace:

I filed #8335 for this issue

datafusion/core/src/datasource/physical_plan/parquet/statistics.rs

…et_statistics

Tustvold/extract parquet statistics

alamb · 2023-11-28T15:27:26Z

I incorporated @tustvold 's PR here: alamb#16

…_statistics

…lly accessable

alamb · 2023-11-28T20:20:14Z

Thank you -- I plan to merge this tomorrow unless there are any other comments

alamb · 2023-12-13T21:56:01Z

This PR introduced a regression it turns out: #8533

github-actions bot added the core Core DataFusion crate label Nov 21, 2023

alamb commented Nov 21, 2023

View reviewed changes

Extract parquet statistics to its own module, add tests

d187e36

alamb force-pushed the alamb/extract_parquet_statistics branch from 62f91b6 to d187e36 Compare November 21, 2023 18:29

alamb commented Nov 22, 2023

View reviewed changes

alamb marked this pull request as ready for review November 22, 2023 17:54

alamb requested a review from tustvold November 22, 2023 18:01

tustvold reviewed Nov 22, 2023

View reviewed changes

datafusion/core/src/datasource/physical_plan/parquet/statistics.rs Show resolved Hide resolved

alamb and others added 5 commits November 27, 2023 13:31

Merge remote-tracking branch 'apache/main' into alamb/extract_parquet…

7512c8b

…_statistics

Update datafusion/core/src/datasource/physical_plan/parquet/statistic…

d4e660a

…s.rs Co-authored-by: Raphael Taylor-Davies <1781103+tustvold@users.noreply.github.com>

rename enum

fd2aebc

Merge branch 'alamb/extract_parquet_statistics' of github.com:alamb/a…

96a42f9

…rrow-datafusion into alamb/extract_parquet_statistics

Improve API

a128a20

alamb mentioned this pull request Nov 27, 2023

Support parquet statistics for struct columns #8334

Open

alamb added 7 commits November 27, 2023 15:09

Add test for reading struct array statistics

b4009c2

Add test for column after statistics

ef79c42

improve tests

9b914db

simplify

b95dea9

clippy

cd3c042

Update datafusion/core/src/datasource/physical_plan/parquet/statistic…

ab95453

…s.rs

Update datafusion/core/src/datasource/physical_plan/parquet/statistic…

0235a9e

…s.rs

alamb commented Nov 27, 2023

View reviewed changes

tustvold reviewed Nov 27, 2023

View reviewed changes

datafusion/core/src/datasource/physical_plan/parquet/statistics.rs Show resolved Hide resolved

Add test showing incorrect statistics

a601fbf

alamb mentioned this pull request Nov 27, 2023

Parquet pruning will be incorrect if field names are repeated #8335

Closed

tustvold and others added 3 commits November 28, 2023 12:18

Merge remote-tracking branch 'upstream/main' into alamb/extract_parqu…

5c55302

…et_statistics

Rework statistics

06b5201

Merge pull request #16 from tustvold/tustvold/extract_parquet_statistics

641142b

Tustvold/extract parquet statistics

alamb added 3 commits November 28, 2023 10:34

Fix clippy

e5cd8cf

Merge remote-tracking branch 'apache/main' into alamb/extract_parquet…

b1666c2

…_statistics

Update documentation and make it clear the statistics are not publica…

7022691

…lly accessable

alamb mentioned this pull request Nov 28, 2023

Add function that converts from parquet statistics ParquetStatistics to arrow arrays ArrayRef apache/arrow-rs#4328

Closed

Add link to upstream arrow ticket

a5e235a

alamb requested a review from tustvold November 28, 2023 19:50

tustvold approved these changes Nov 28, 2023

View reviewed changes

tustvold mentioned this pull request Nov 29, 2023

Nested Schema Projection apache/arrow-rs#5148

Closed

alamb merged commit 06bbe12 into apache:main Nov 29, 2023
23 checks passed

alamb deleted the alamb/extract_parquet_statistics branch November 29, 2023 19:10

alamb mentioned this pull request Dec 1, 2023

Parquet Statistics Pruning Ignores ColumnOrder, resulting in potentially incorrect statistics #8342

Open

This was referenced Dec 13, 2023

Regression: Incorrect results when reading parquet files with different schemas and statistics #8532

Closed

Fix regression with Incorrect results when reading parquet files with different schemas and statistics #8533

Merged

appletreeisyellow mentioned this pull request Dec 14, 2023

chore: temporary branch for IOx update (11-30-2023 to 12-09-2023) #8543

Closed

matthewgapp mentioned this pull request Jan 11, 2024

matt/feat/recursive ctes/config flag matthewgapp/arrow-datafusion#3

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extract parquet statistics to its own module, add tests #8294

Extract parquet statistics to its own module, add tests #8294

alamb commented Nov 21, 2023 •

edited

Loading

alamb Nov 21, 2023

alamb Nov 21, 2023

alamb Nov 21, 2023

alamb Nov 21, 2023

alamb Nov 21, 2023 •

edited

Loading

alamb Nov 22, 2023

alamb commented Nov 22, 2023

alamb commented Nov 22, 2023

tustvold left a comment •

edited

Loading

tustvold Nov 22, 2023

alamb Nov 27, 2023

tustvold Nov 27, 2023

tustvold Nov 22, 2023

alamb Nov 27, 2023

alamb commented Nov 27, 2023

alamb left a comment

alamb Nov 27, 2023

tustvold Nov 27, 2023

alamb Nov 27, 2023

alamb Nov 27, 2023

alamb commented Nov 28, 2023

alamb commented Nov 28, 2023

alamb commented Dec 13, 2023

+                      }
+                  }
+                  #[test]

	Arc::new(Field::new("i", DataType::Int32, nullable)),
	Arc::new(Field::new("int_col", DataType::Int32, nullable)),

Extract parquet statistics to its own module, add tests #8294

Extract parquet statistics to its own module, add tests #8294

Conversation

alamb commented Nov 21, 2023 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb Nov 21, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb commented Nov 22, 2023

alamb commented Nov 22, 2023

tustvold left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb commented Nov 27, 2023

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb commented Nov 28, 2023

alamb commented Nov 28, 2023

alamb commented Dec 13, 2023

alamb commented Nov 21, 2023 •

edited

Loading

alamb Nov 21, 2023 •

edited

Loading

tustvold left a comment •

edited

Loading