Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[EPIC] Continued correct and improved extracting Parquet statistics into ArrayRefs #10922

Closed
19 of 23 tasks
alamb opened this issue Jun 14, 2024 · 9 comments · Fixed by #11479
Closed
19 of 23 tasks

[EPIC] Continued correct and improved extracting Parquet statistics into ArrayRefs #10922

alamb opened this issue Jun 14, 2024 · 9 comments · Fixed by #11479
Labels
enhancement New feature or request

Comments

@alamb
Copy link
Contributor

alamb commented Jun 14, 2024

Is your feature request related to a problem or challenge?

I consolidated the content of our previous tickets about better statistics #10806 and #10806 into a new Epic and cleaned up the subtasks

Describe the solution you'd like

Subtasks:

Describe alternatives you've considered

No response

Additional context

No response

@alamb alamb added the enhancement New feature or request label Jun 14, 2024
@alamb
Copy link
Contributor Author

alamb commented Jul 3, 2024

Wow this looks like it is basically done now. 😮 Thanks to everyone who helped

#10609 still needs a look from @Lordworms

I think the final piece of work is to port the code + tests upstream to arrow-rs

@Lordworms
Copy link
Contributor

Wow this looks like it is basically done now. 😮 Thanks to everyone who helped

#10609 still needs a look from @Lordworms

I think the final piece of work is to port the code + tests upstream to arrow-rs

I'll take a look then, I forget about this one.... 😅

@efredine
Copy link
Contributor

efredine commented Jul 3, 2024

I think I've learnt enough that I can probably manage the upstream port to arrow-rs @alamb

There are probably a few performance improvements we might be able to make (using StringBuilder for example) and it's perhaps worth revisiting whether the tests can/should be restructured as per #11000?

@efredine
Copy link
Contributor

efredine commented Jul 3, 2024

Further to the performance discussion @alamb - the StringBuilder pattern you suggested in #11136 (comment) does seem to materially improve performance:

Extract data page statistics for String/extract_statistics/String
                        time:   [15.368 µs 15.405 µs 15.446 µs]
                        change: [-68.672% -68.540% -68.409%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  4 (4.00%) high mild

So seems like a worthwhile thing to go ahead with? I think there are several places where we can do something similar.

One question - I notice in that ticket that you appended nulls for missing values. However, I think in most cases, missing values are simply omitted because all the None values are removed by flattening. So, in general, users of the data page statistics will need to check whether or not the length of the array matches the number of actual data pages? This is different from how the row group statistics are handled - they will instead have a null value for any missing statistics.

Is this difference in behaviour expected or just a side effect of the implementation.

@efredine
Copy link
Contributor

efredine commented Jul 4, 2024

@marvinlanhenke @alamb We always flatten the date page stats iterator - following the pattern from the initial PR: https://github.com/apache/datafusion/pull/10852/files#diff-7110f4709c105a18ef74a212396444d62052179a735d148fb62470a8b157fb40R582

But I'm wondering if flatten is the right thing to do here?

The min or max values for each page will be None if all the values on the page happen to be null: https://github.com/apache/arrow-rs/blob/master/parquet/src/file/page_index/index.rs#L37-L44

Using flatten in this case will mean that the length of result for that page will be shorter than the number of data pages? So, is it possible that rather than flatten we instead want to do something like a flat map where the Some values are flattened and None values are mapped to a null value?

(It's entirely possible I'm misunderstanding something here, if so, apologies in advance!)

@alamb
Copy link
Contributor Author

alamb commented Jul 5, 2024

But I'm wondering if flatten is the right thing to do here?

The min or max values for each page will be None if all the values on the page happen to be null: https://github.com/apache/arrow-rs/blob/master/parquet/src/file/page_index/index.rs#L37-L44

Using flatten in this case will mean that the length of result for that page will be shorter than the number of data pages? So, is it possible that rather than flatten we instead want to do something like a flat map where the Some values are flattened and None values are mapped to a null value?

I think you are correct -- that is a very insightful conclusion @efredine

Ideally what I think we should do is to write up a test case (using your suggestion of a column / page that is entirely null) and verify there is a problem / fix it.

Is this something you are willing to do? I filed #11280 to track

@alamb
Copy link
Contributor Author

alamb commented Jul 5, 2024

So seems like a worthwhile thing to go ahead with? I think there are several places where we can do something similar.

I agree @efredine -- thank you -- I filed #11281 to track

@alamb
Copy link
Contributor Author

alamb commented Jul 15, 2024

Update here is @efredine has a PR up for porting this upstream: apache/arrow-rs#6046

If no one beats me to it I plan to review that PR this and then make a draft PR to use the upstream implementation in DataFusion when it is available, and then we can close this issue. Very exciting

@alamb
Copy link
Contributor Author

alamb commented Jul 22, 2024

Closing this epic as I think it is basically complete. The final piece #11479 is waiting on the next arrow-rs release apache/arrow-rs#5998) but I don't think there is any reason to leave this open

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants