-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix index out of bounds for stats on nested fields #1392
Fix index out of bounds for stats on nested fields #1392
Conversation
There is only the "Clippy" task that did fail. I think that is not related to my changes. Can someone check it out, please? |
I suspect that is due to the new Rust version that is released and clippy got a little more finicky -- we just need to fix it on master. I'll handle it tomorrow, unless someone beats me to it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @andrei-ionescu for the contribution ❤️ -- would it be possible to add a test that covers this case to protect against regressions?
Thanks @andrei-ionescu for taking a stab at this. I don't think we can just flatting all the columns like how it's handled in parquet because the rest of the datafusion code base assumes arrow's hierarchical column structure. As a result, column indexes only map to top level columns. I suspect this approach will not cause runtime crashes because we are only expanding the column vector, but it could lead to incorrect read of column stats in places like https://github.com/apache/arrow-datafusion/blob/069449f754c92765a2d0dbbf62cd7bac76257892/datafusion/src/physical_optimizer/aggregate_statistics.rs#L181 |
Re clippy failures, @xudong963 is working on them in this PR #1395 -- once that gets merged we should be able to merge this PR with master and clippy would pass |
@houqp This what I'm seeing - and please correct me if I'm wrong - there is no support in reading nested parquet files in DataFusion. I've tried to use the With all these being said, what's the strategy on supporting nested source structures in DataFusion? |
@houqp After more debugging and fixing different things I found that the physical plan lacks the nested fields support. I got into this error:
And this error is happening in these lines of code: The chunk of data that has been read has only 6 columns ( The root cause seems to be the way parquet files are read vs how it gets projected. It reads one top nested column at a time, while it tries to project that chunk of data over the full schema. For example, in the case of the It seems that DataFusion lacks the support for nested fields, at least when using the parquet data source. |
Do you think column statistics should include something similar to https://github.com/apache/arrow-rs/blob/master/arrow/src/datatypes/field.rs#L110-L125 |
Sorry for the late reply, @andrei-ionescu the problem you are getting is basically caused by the problem I mentioned in #1392 (comment). Fundamentally, it's due to differences between how nested struct fields are handled in Arrow and Parquet. @lst-codes managing stats in a nested data structure could fix the problem. However, being inspired by apache/arrow#11704, I think it would be more efficient to resolve the nested column key path during planning by traversing the |
210b6d9
to
a25fd49
Compare
Marking PRs without activity in the last month as stale. I'll plan to close it in another month or so without activity, though feel free to reopen it when you have time to work on it) |
Closing stale PRs. Please reopen (or open a new one) if you plan to keep working on this feature. |
@alamb Please re-open this PR. It is not stale just waiting for the apache/arrow#11704 to get merged and then rebase the work on that. |
@andrei-ionescu it looks like the issue you referenced waiting on was closed. with that in mind, do you have an idea how youd like to manage this PR now? |
Next week I'll take another stab at it. First I need to check the issue that I've seen before and refactor the PR changes if that still applies. I'll let you know. |
Marking as draft for the time being |
Closing for now to clean up PR list -- feel free to reopen to work on it more |
Which issue does this PR close?
Closes #1383
Rationale for this change
This is a step in supporting nested fields in data fusion being read from parquet in
What changes are included in this PR?
This adds a helper method to get the flattened schema. This is similar to how the columns are stored in the parquet metadata section.
Are there any user-facing changes?
No.