-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ParquetExec::statistics()
does not read statistics for many column types (like timstamps, strings, etc)
#8295
Comments
Note that the pruning predicate code does correctly read the statistics for other strings and timestamps, because it uses a different code path |
I plan to fix this |
Could I pick this ticket up? |
In |
I think there is some subtly related to decimals as well -- the best thing to do is probably to study what the existing code in row_groups does -- I think it is here https://github.com/apache/arrow-datafusion/blob/main/datafusion/core/src/datasource/physical_plan/parquet/statistics.rs#L57 |
At some point there were multiple code paths to extract statistics in parquet (one for file level and one for row group level) that should likely be combined |
I believe we have fixed this with #10453 -- statistics are now correctly extracted
|
Describe the bug
While working on #8229 I found another bug that is non obvious, but that can be clearly seen now thanks to #8110 and #8111 from @NGA-TRAN
To Reproduce
And then look at the explain verbose up can see there are no min/max statisics shown:
Expected behavior
I expect there to be min/max values extracted in the statistics for the strings, as there are for integers (
(Col[0]: Min=Exact(Int64(1)) Max=Exact(Int64(3))
)Additional context
No response
The text was updated successfully, but these errors were encountered: