-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support Boolean
Parquet Data Page Statistics
#11027
Comments
Sorry for the noise |
Actually I don't think this is actually done This ticket covers extracting DataPage statistics (not row group statistics, which are annoyingly different in parquet 🤯 ) The data page statistics are extracted here datafusion/datafusion/core/src/datasource/physical_plan/parquet/statistics.rs Lines 612 to 627 in 18042fd
In order to complete this issue, we need to change
to check: Check::Both, And make the tests pass |
Oh sorry, that was stupid of me. |
No worries at all -- this stuff is tricky |
Yeah, all those similar names do get to me sometimes... On another note, I tried to implement this like all the others did, but the test fails with :
The implementation is like this: make_data_page_stats_iterator!(
MinBooleanDataPageStatsIterator,
|x: &PageIndex<bool>| { x.min },
Index::BOOLEAN,
bool
);
make_data_page_stats_iterator!(
MaxBooleanDataPageStatsIterator,
|x: &PageIndex<bool>| { x.max },
Index::BOOLEAN,
bool
);
...
macro_rules! get_data_page_statistics {
($stat_type_prefix: ident, $data_type: ident, $iterator: ident) => {
paste! {
match $data_type {
Some(DataType::Boolean) => Ok(Arc::new(
BooleanArray::from_iter(
[<$stat_type_prefix BooleanDataPageStatsIterator>]::new($iterator).flatten()
)
)),
...
} These macros, functions, and tests jump around a lot before I get to the caller, which causes this panic. Do you or anyone else know why this happens? |
The iterator must be sized thing comes from arrow -- one workaround is to collect the values into a Vec first and then create the array I don't know why boolean is different than the other data page types 🤔 |
take |
Is your feature request related to a problem or challenge?
Part of #10922
We are adding APIs to efficiently convert the data stored in Parquet's "PageIndex" into
ArrayRef
s -- which will make it significantly easier to use this information for pruning and other tasks.Describe the solution you'd like
Add support to
StatisticsConverter::min_page_statistics
andStatisticsConverter::max_page_statistics
for the types abovedatafusion/datafusion/core/src/datasource/physical_plan/parquet/statistics.rs
Lines 637 to 656 in a923c65
Describe alternatives you've considered
You can follow the model from @Weijun-H in #10931
test_int64
datafusion/datafusion/core/tests/parquet/arrow_statistics.rs
Lines 506 to 529 in a923c65
datafusion/datafusion/core/src/datasource/physical_plan/parquet/statistics.rs
Lines 575 to 586 in 2f43476
datafusion/datafusion/core/src/datasource/physical_plan/parquet/statistics.rs
Line 90 in 2f43476
Additional context
No response
The text was updated successfully, but these errors were encountered: