[EPIC] Continued correct and improved extracting Parquet statistics into ArrayRefs #10922

alamb · 2024-06-14T20:37:21Z

Is your feature request related to a problem or challenge?

I consolidated the content of our previous tickets about better statistics #10806 and #10806 into a new Epic and cleaned up the subtasks

Describe the solution you'd like

Subtasks:

Describe alternatives you've considered

No response

Additional context

No response

The text was updated successfully, but these errors were encountered:

alamb · 2024-07-03T21:48:25Z

Wow this looks like it is basically done now. 😮 Thanks to everyone who helped

#10609 still needs a look from @Lordworms

I think the final piece of work is to port the code + tests upstream to arrow-rs

Lordworms · 2024-07-03T22:12:05Z

Wow this looks like it is basically done now. 😮 Thanks to everyone who helped

#10609 still needs a look from @Lordworms

I think the final piece of work is to port the code + tests upstream to arrow-rs

I'll take a look then, I forget about this one.... 😅

efredine · 2024-07-03T22:16:39Z

I think I've learnt enough that I can probably manage the upstream port to arrow-rs @alamb

There are probably a few performance improvements we might be able to make (using StringBuilder for example) and it's perhaps worth revisiting whether the tests can/should be restructured as per #11000?

efredine · 2024-07-03T22:49:10Z

Further to the performance discussion @alamb - the StringBuilder pattern you suggested in #11136 (comment) does seem to materially improve performance:

Extract data page statistics for String/extract_statistics/String
                        time:   [15.368 µs 15.405 µs 15.446 µs]
                        change: [-68.672% -68.540% -68.409%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  4 (4.00%) high mild

So seems like a worthwhile thing to go ahead with? I think there are several places where we can do something similar.

One question - I notice in that ticket that you appended nulls for missing values. However, I think in most cases, missing values are simply omitted because all the None values are removed by flattening. So, in general, users of the data page statistics will need to check whether or not the length of the array matches the number of actual data pages? This is different from how the row group statistics are handled - they will instead have a null value for any missing statistics.

Is this difference in behaviour expected or just a side effect of the implementation.

efredine · 2024-07-04T17:22:00Z

@marvinlanhenke @alamb We always flatten the date page stats iterator - following the pattern from the initial PR: https://github.com/apache/datafusion/pull/10852/files#diff-7110f4709c105a18ef74a212396444d62052179a735d148fb62470a8b157fb40R582

But I'm wondering if flatten is the right thing to do here?

The min or max values for each page will be None if all the values on the page happen to be null: https://github.com/apache/arrow-rs/blob/master/parquet/src/file/page_index/index.rs#L37-L44

Using flatten in this case will mean that the length of result for that page will be shorter than the number of data pages? So, is it possible that rather than flatten we instead want to do something like a flat map where the Some values are flattened and None values are mapped to a null value?

(It's entirely possible I'm misunderstanding something here, if so, apologies in advance!)

alamb · 2024-07-05T11:06:55Z

But I'm wondering if flatten is the right thing to do here?

The min or max values for each page will be None if all the values on the page happen to be null: https://github.com/apache/arrow-rs/blob/master/parquet/src/file/page_index/index.rs#L37-L44

Using flatten in this case will mean that the length of result for that page will be shorter than the number of data pages? So, is it possible that rather than flatten we instead want to do something like a flat map where the Some values are flattened and None values are mapped to a null value?

I think you are correct -- that is a very insightful conclusion @efredine

Ideally what I think we should do is to write up a test case (using your suggestion of a column / page that is entirely null) and verify there is a problem / fix it.

Is this something you are willing to do? I filed #11280 to track

alamb · 2024-07-05T11:12:17Z

So seems like a worthwhile thing to go ahead with? I think there are several places where we can do something similar.

I agree @efredine -- thank you -- I filed #11281 to track

alamb · 2024-07-15T15:56:55Z

Update here is @efredine has a PR up for porting this upstream: apache/arrow-rs#6046 ❤

If no one beats me to it I plan to review that PR this and then make a draft PR to use the upstream implementation in DataFusion when it is available, and then we can close this issue. Very exciting

alamb · 2024-07-22T15:14:05Z

Closing this epic as I think it is basically complete. The final piece #11479 is waiting on the next arrow-rs release apache/arrow-rs#5998) but I don't think there is any reason to leave this open

alamb added the enhancement New feature or request label Jun 14, 2024

tshauck mentioned this issue Jun 18, 2024

feat: extract floats for parquet data pages #10972

Closed

This was referenced Jun 30, 2024

Support FixedSizedBinaryArray Parquet Data Page Statistics #11184

Closed

Support DictionaryArray Parquet Data Page Statistics #11185

Closed

DataFusion weekly project plan (Andrew Lamb) - July 1, 2024 #11190

Closed

efredine mentioned this issue Jul 3, 2024

Clean Up Data Page Statistics Tests and Fix Bug #11235

Closed

alamb mentioned this issue Jul 5, 2024

Improve performance of DataPage statistics extraction using StringBuilder #11281

Closed

alamb mentioned this issue Jul 8, 2024

DataFusion weekly project plan (Andrew Lamb) - July 8, 2024 #11334

Closed

9 tasks

alamb closed this as completed Jul 22, 2024

alamb mentioned this issue Jul 22, 2024

DataFusion weekly project plan (Andrew Lamb) - July 22, 2024 #11601

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[EPIC] Continued correct and improved extracting Parquet statistics into ArrayRefs #10922

[EPIC] Continued correct and improved extracting Parquet statistics into ArrayRefs #10922

alamb commented Jun 14, 2024 •

edited

Loading

alamb commented Jul 3, 2024

Lordworms commented Jul 3, 2024

efredine commented Jul 3, 2024

efredine commented Jul 3, 2024

efredine commented Jul 4, 2024

alamb commented Jul 5, 2024

alamb commented Jul 5, 2024

alamb commented Jul 15, 2024

alamb commented Jul 22, 2024

[EPIC] Continued correct and improved extracting Parquet statistics into ArrayRefs #10922

[EPIC] Continued correct and improved extracting Parquet statistics into ArrayRefs #10922

Comments

alamb commented Jun 14, 2024 • edited Loading

Is your feature request related to a problem or challenge?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

alamb commented Jul 3, 2024

Lordworms commented Jul 3, 2024

efredine commented Jul 3, 2024

efredine commented Jul 3, 2024

efredine commented Jul 4, 2024

alamb commented Jul 5, 2024

alamb commented Jul 5, 2024

alamb commented Jul 15, 2024

alamb commented Jul 22, 2024

alamb commented Jun 14, 2024 •

edited

Loading