-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[EPIC] Continued correct and improved extracting Parquet statistics into ArrayRefs #10922
Comments
Wow this looks like it is basically done now. 😮 Thanks to everyone who helped #10609 still needs a look from @Lordworms I think the final piece of work is to port the code + tests upstream to arrow-rs |
I'll take a look then, I forget about this one.... 😅 |
Further to the performance discussion @alamb - the StringBuilder pattern you suggested in #11136 (comment) does seem to materially improve performance:
So seems like a worthwhile thing to go ahead with? I think there are several places where we can do something similar. One question - I notice in that ticket that you appended nulls for missing values. However, I think in most cases, missing values are simply omitted because all the None values are removed by flattening. So, in general, users of the data page statistics will need to check whether or not the length of the array matches the number of actual data pages? This is different from how the row group statistics are handled - they will instead have a null value for any missing statistics. Is this difference in behaviour expected or just a side effect of the implementation. |
@marvinlanhenke @alamb We always flatten the date page stats iterator - following the pattern from the initial PR: https://github.com/apache/datafusion/pull/10852/files#diff-7110f4709c105a18ef74a212396444d62052179a735d148fb62470a8b157fb40R582 But I'm wondering if flatten is the right thing to do here? The min or max values for each page will be None if all the values on the page happen to be null: https://github.com/apache/arrow-rs/blob/master/parquet/src/file/page_index/index.rs#L37-L44 Using flatten in this case will mean that the length of result for that page will be shorter than the number of data pages? So, is it possible that rather than flatten we instead want to do something like a flat map where the Some values are flattened and None values are mapped to a null value? (It's entirely possible I'm misunderstanding something here, if so, apologies in advance!) |
I think you are correct -- that is a very insightful conclusion @efredine Ideally what I think we should do is to write up a test case (using your suggestion of a column / page that is entirely null) and verify there is a problem / fix it. Is this something you are willing to do? I filed #11280 to track |
Update here is @efredine has a PR up for porting this upstream: apache/arrow-rs#6046 ❤ If no one beats me to it I plan to review that PR this and then make a draft PR to use the upstream implementation in DataFusion when it is available, and then we can close this issue. Very exciting |
Closing this epic as I think it is basically complete. The final piece #11479 is waiting on the next arrow-rs release apache/arrow-rs#5998) but I don't think there is any reason to leave this open |
Is your feature request related to a problem or challenge?
I consolidated the content of our previous tickets about better statistics #10806 and #10806 into a new Epic and cleaned up the subtasks
Describe the solution you'd like
Subtasks:
StatisticsConverter
#10923StatisticsConverter::row_group_null_counts
incorrect for missing column #10926Int8
,Int16
,Int32
statistics from Parquet Data Pages #10928String/LargeString
andBinary/LargeBinary
Parquet Data Page Statistics #11026FixedSizedBinaryArray
Parquet Data Page Statistics #11184DictionaryArray
Parquet Data Page Statistics #11185Boolean
Parquet Data Page Statistics #11027Decimal
andDecimal256
Parquet Data Page Statistics #11111Timestamp
Parquet Data Page Statistics #11112Date
Parquet Data Page Statistics #11113Time
Parquet Data Page Statistics #11114StatisticsConverter::row_group_counts
to returnNone
for non existent columns in parquet files #10965prune_pages_in_one_row_group
to use theStatisticsExtractor
#11480ParquetStatistics
to arrow arraysArrayRef
arrow-rs#4328Describe alternatives you've considered
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: