Change StatisticsConverter::row_group_counts
to return None
for non existent columns in parquet files
#10965
Labels
enhancement
New feature or request
Is your feature request related to a problem or challenge?
While working on #10926 @marvinlanhenke has an excellent question:
(this was because there was some inconsistency between data pages and row counts)
The reason
StatisticsConverter::row_group_counts
returns row counts even for a non existent column is because it is the API needed for PruningStatistics here:datafusion/datafusion/core/src/datasource/physical_plan/parquet/row_groups.rs
Lines 386 to 390 in a923c65
It is possible to do because the
ParquetMetadata
knows how many row groups there are even when there are no row group statistics, but it doesn't make logical sense. Furthermore, for data pages, it is different if the "page index" is not present as then it doesn't even make sense to ask how many rows are in each data page as we don't have any data pagesThus I think
row_group_row_counts
should also default to returningNone
if the column is not present, as @marvinlanhenke has done fornull_counts_page_statistics
inSo I guess my new proposal would be to return
Option
like:The rationale to return an Option rather than an error is that creating and ignoring
DataFusionError
viaok()
still requires a string allocation, which is wastefulI realize this is done many places already in the statistics extraction code, but I think for those cases it is meant to make the code resilent to incorrectly encoded parquet files rather than something that is "expected" to happen
Describe the solution you'd like
No response
Describe alternatives you've considered
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: