Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Boolean Parquet Data Page Statistics #11027

Closed
Tracked by #10922
alamb opened this issue Jun 20, 2024 · 8 comments · Fixed by #11054
Closed
Tracked by #10922

Support Boolean Parquet Data Page Statistics #11027

alamb opened this issue Jun 20, 2024 · 8 comments · Fixed by #11054
Assignees
Labels
enhancement New feature or request good first issue Good for newcomers

Comments

@alamb
Copy link
Contributor

alamb commented Jun 20, 2024

Is your feature request related to a problem or challenge?

Part of #10922

We are adding APIs to efficiently convert the data stored in Parquet's "PageIndex" into ArrayRefs -- which will make it significantly easier to use this information for pruning and other tasks.

Describe the solution you'd like

Add support to StatisticsConverter::min_page_statistics and StatisticsConverter::max_page_statistics for the types above

/// of parquet page [`Index`]'es to an [`ArrayRef`]
pub(crate) fn min_page_statistics<'a, I>(
data_type: Option<&DataType>,
iterator: I,
) -> Result<ArrayRef>
where
I: Iterator<Item = (usize, &'a Index)>,
{
get_data_page_statistics!(Min, data_type, iterator)
}
/// Extracts the max statistics from an iterator
/// of parquet page [`Index`]'es to an [`ArrayRef`]
pub(crate) fn max_page_statistics<'a, I>(
data_type: Option<&DataType>,
iterator: I,
) -> Result<ArrayRef>
where
I: Iterator<Item = (usize, &'a Index)>,
{

Describe alternatives you've considered

You can follow the model from @Weijun-H in #10931

  1. Update the test for the listed data types following the model of test_int64
    async fn test_int_64() {
    // This creates a parquet files of 4 columns named "i8", "i16", "i32", "i64"
    let reader = TestReader {
    scenario: Scenario::Int,
    row_per_group: 5,
    }
    .build()
    .await;
    // since each row has only one data page, the statistics are the same
    Test {
    reader: &reader,
    // mins are [-5, -4, 0, 5]
    expected_min: Arc::new(Int64Array::from(vec![-5, -4, 0, 5])),
    // maxes are [-1, 0, 4, 9]
    expected_max: Arc::new(Int64Array::from(vec![-1, 0, 4, 9])),
    // nulls are [0, 0, 0, 0]
    expected_null_counts: UInt64Array::from(vec![0, 0, 0, 0]),
    // row counts are [5, 5, 5, 5]
    expected_row_counts: UInt64Array::from(vec![5, 5, 5, 5]),
    column_name: "i64",
    check: Check::Both,
    }
    .run();
  2. Add any required implementation in
    make_data_page_stats_iterator!(MinInt64DataPageStatsIterator, min, Index::INT64, i64);
    make_data_page_stats_iterator!(MaxInt64DataPageStatsIterator, max, Index::INT64, i64);
    macro_rules! get_data_page_statistics {
    ($stat_type_prefix: ident, $data_type: ident, $iterator: ident) => {
    paste! {
    match $data_type {
    Some(DataType::Int64) => Ok(Arc::new(Int64Array::from_iter([<$stat_type_prefix Int64DataPageStatsIterator>]::new($iterator).flatten()))),
    _ => unimplemented!()
    }
    }
    }
    (follow the model of the row counts, )

Additional context

No response

@LorrensP-2158466
Copy link
Contributor

Boolean type is supported, it was added in commit 9845e6e of PR #10711

@alamb alamb closed this as completed Jun 20, 2024
@alamb
Copy link
Contributor Author

alamb commented Jun 20, 2024

Sorry for the noise

@alamb
Copy link
Contributor Author

alamb commented Jun 21, 2024

Actually I don't think this is actually done

This ticket covers extracting DataPage statistics (not row group statistics, which are annoyingly different in parquet 🤯 )

The data page statistics are extracted here

macro_rules! get_data_page_statistics {
($stat_type_prefix: ident, $data_type: ident, $iterator: ident) => {
paste! {
match $data_type {
Some(DataType::UInt8) => Ok(Arc::new(
UInt8Array::from_iter(
[<$stat_type_prefix Int32DataPageStatsIterator>]::new($iterator)
.map(|x| {
x.into_iter().filter_map(|x| {
x.and_then(|x| u8::try_from(x).ok())
})
})
.flatten()
)
)),
Some(DataType::UInt16) => Ok(Arc::new(

In order to complete this issue, we need to change

to

 check: Check::Both, 

And make the tests pass

@alamb alamb reopened this Jun 21, 2024
@LorrensP-2158466
Copy link
Contributor

Oh sorry, that was stupid of me.

@alamb
Copy link
Contributor Author

alamb commented Jun 21, 2024

Oh sorry, that was stupid of me.

No worries at all -- this stuff is tricky

@LorrensP-2158466
Copy link
Contributor

Yeah, all those similar names do get to me sometimes...

On another note, I tried to implement this like all the others did, but the test fails with :

thread 'parquet::arrow_statistics::test_boolean' panicked at src/array/boolean_array.rs:407:33:
Iterator must be sized

The implementation is like this:

make_data_page_stats_iterator!(
    MinBooleanDataPageStatsIterator,
    |x: &PageIndex<bool>| { x.min },
    Index::BOOLEAN,
    bool
);
make_data_page_stats_iterator!(
    MaxBooleanDataPageStatsIterator,
    |x: &PageIndex<bool>| { x.max },
    Index::BOOLEAN,
    bool
);
...
macro_rules! get_data_page_statistics {
    ($stat_type_prefix: ident, $data_type: ident, $iterator: ident) => {
        paste! {
            match $data_type {
                Some(DataType::Boolean) => Ok(Arc::new(
                    BooleanArray::from_iter(
                        [<$stat_type_prefix BooleanDataPageStatsIterator>]::new($iterator).flatten()
                    )
                )),
       ...
}

These macros, functions, and tests jump around a lot before I get to the caller, which causes this panic. Do you or anyone else know why this happens?

@alamb
Copy link
Contributor Author

alamb commented Jun 21, 2024

The iterator must be sized thing comes from arrow -- one workaround is to collect the values into a Vec first and then create the array

I don't know why boolean is different than the other data page types 🤔

@LorrensP-2158466
Copy link
Contributor

take

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants