Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix index out of bounds for stats on nested fields #1392

Conversation

andrei-ionescu
Copy link

Which issue does this PR close?

Closes #1383

Rationale for this change

This is a step in supporting nested fields in data fusion being read from parquet in

What changes are included in this PR?

This adds a helper method to get the flattened schema. This is similar to how the columns are stored in the parquet metadata section.

Are there any user-facing changes?

No.

@github-actions github-actions bot added the datafusion Changes in the datafusion crate label Dec 2, 2021
@andrei-ionescu
Copy link
Author

There is only the "Clippy" task that did fail. I think that is not related to my changes. Can someone check it out, please?

@alamb
Copy link
Contributor

alamb commented Dec 2, 2021

There is only the "Clippy" task that did fail. I think that is not related to my changes. Can someone check it out, please?

I suspect that is due to the new Rust version that is released and clippy got a little more finicky -- we just need to fix it on master. I'll handle it tomorrow, unless someone beats me to it

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @andrei-ionescu for the contribution ❤️ -- would it be possible to add a test that covers this case to protect against regressions?

@houqp
Copy link
Member

houqp commented Dec 3, 2021

Thanks @andrei-ionescu for taking a stab at this. I don't think we can just flatting all the columns like how it's handled in parquet because the rest of the datafusion code base assumes arrow's hierarchical column structure. As a result, column indexes only map to top level columns. I suspect this approach will not cause runtime crashes because we are only expanding the column vector, but it could lead to incorrect read of column stats in places like https://github.com/apache/arrow-datafusion/blob/069449f754c92765a2d0dbbf62cd7bac76257892/datafusion/src/physical_optimizer/aggregate_statistics.rs#L181

@alamb
Copy link
Contributor

alamb commented Dec 3, 2021

Re clippy failures, @xudong963 is working on them in this PR #1395 -- once that gets merged we should be able to merge this PR with master and clippy would pass

datafusion/src/datasource/mod.rs Outdated Show resolved Hide resolved
@andrei-ionescu
Copy link
Author

andrei-ionescu commented Dec 3, 2021

@houqp This what I'm seeing - and please correct me if I'm wrong - there is no support in reading nested parquet files in DataFusion.

I've tried to use the nested_struct.rust.parquet file that is present in the parquet-testing git submodule building some DataFusion tests and are not passing. And I couldn't find any other tests with nested fields in parquet in DataFusion.

With all these being said, what's the strategy on supporting nested source structures in DataFusion?

@andrei-ionescu
Copy link
Author

andrei-ionescu commented Dec 3, 2021

@houqp After more debugging and fixing different things I found that the physical plan lacks the nested fields support.

I got into this error:

Error: ArrowError(SchemaError("Unexpected batch schema from file, expected 36 cols but got 6"))

And this error is happening in these lines of code:
https://github.com/apache/arrow-datafusion/blob/069449f754c92765a2d0dbbf62cd7bac76257892/datafusion/src/physical_plan/file_format/mod.rs#L223-L229

The chunk of data that has been read has only 6 columns (file_batch.columns().len()) while the expected number of columns (expected_cols) is 36.

The root cause seems to be the way parquet files are read vs how it gets projected. It reads one top nested column at a time, while it tries to project that chunk of data over the full schema. For example, in the case of the nested_struct.rust.parquet it reads the first column with 6 leaves and then tries to project that over all 36 top columns of that parquet file. This is root cause of the error above.

It seems that DataFusion lacks the support for nested fields, at least when using the parquet data source.

@lexi-sh
Copy link

lexi-sh commented Dec 8, 2021

Thanks @andrei-ionescu for taking a stab at this. I don't think we can just flatting all the columns like how it's handled in parquet because the rest of the datafusion code base assumes arrow's hierarchical column structure. As a result, column indexes only map to top level columns. I suspect this approach will not cause runtime crashes because we are only expanding the column vector, but it could lead to incorrect read of column stats in places like

https://github.com/apache/arrow-datafusion/blob/069449f754c92765a2d0dbbf62cd7bac76257892/datafusion/src/physical_optimizer/aggregate_statistics.rs#L181

@houqp

Do you think column statistics should include something similar to Field#fields(), which returns a list of subfields for nested structures?

https://github.com/apache/arrow-rs/blob/master/arrow/src/datatypes/field.rs#L110-L125

@houqp
Copy link
Member

houqp commented Dec 9, 2021

Sorry for the late reply, @andrei-ionescu the problem you are getting is basically caused by the problem I mentioned in #1392 (comment). Fundamentally, it's due to differences between how nested struct fields are handled in Arrow and Parquet.

@lst-codes managing stats in a nested data structure could fix the problem. However, being inspired by apache/arrow#11704, I think it would be more efficient to resolve the nested column key path during planning by traversing the Expr::GetIndexedField expression , then only load corresponding parquet column stats into memory. This way, we can skip columns that are not accessed by the query.

@alamb
Copy link
Contributor

alamb commented Jan 31, 2022

Marking PRs without activity in the last month as stale. I'll plan to close it in another month or so without activity, though feel free to reopen it when you have time to work on it)

@alamb alamb added the stale-pr label Jan 31, 2022
@alamb
Copy link
Contributor

alamb commented Feb 15, 2022

Closing stale PRs. Please reopen (or open a new one) if you plan to keep working on this feature.

@alamb alamb closed this Feb 15, 2022
@andrei-ionescu
Copy link
Author

@alamb Please re-open this PR. It is not stale just waiting for the apache/arrow#11704 to get merged and then rebase the work on that.

@houqp houqp reopened this Feb 21, 2022
@matthewmturner
Copy link
Contributor

@andrei-ionescu it looks like the issue you referenced waiting on was closed. with that in mind, do you have an idea how youd like to manage this PR now?

@andrei-ionescu
Copy link
Author

Next week I'll take another stab at it. First I need to check the issue that I've seen before and refactor the PR changes if that still applies.

I'll let you know.

@alamb
Copy link
Contributor

alamb commented Apr 6, 2022

Marking as draft for the time being

@alamb alamb marked this pull request as draft April 6, 2022 18:20
@alamb
Copy link
Contributor

alamb commented May 2, 2022

Closing for now to clean up PR list -- feel free to reopen to work on it more

@alamb alamb closed this May 2, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datafusion Changes in the datafusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Reading nested parquet files results in index out of bounds
6 participants