Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ParquetExec::statistics::is_exact likely wrong/misunderstood #5614

Open
crepererum opened this issue Mar 15, 2023 · 0 comments
Open

ParquetExec::statistics::is_exact likely wrong/misunderstood #5614

crepererum opened this issue Mar 15, 2023 · 0 comments
Labels
bug Something isn't working

Comments

@crepererum
Copy link
Contributor

A ParquetExec is created from a FileScanConfig and an optional filter predicate1. These two are different, independent parameters -- at least the documentation is not implying that the predicate should be considered when constructing the FileScanConfig. Now the statistics for the ParquetExec are calculated by FileScanConfig::project:

https://github.com/apache/arrow-datafusion/blob/0f6931caa6f8b48e116a8e77e989c404f31f3f8d/datafusion/core/src/physical_plan/file_format/mod.rs#L213-L219

This forwards is_exact from the input which might have been set to true. However there is a predicate, is_exact should likely be false because some data may be removed which will mess up the exact statistic. So either the forwarding is wrong (at least when a predicate is given) or the docs are imprecise.

Note that this is unrelated to #5613 because this issue here is about the is_exact=true case.

Footnotes

  1. And a metadata size hint, but this is irrelevant here.

@crepererum crepererum added the bug Something isn't working label Mar 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant