Fix index out of bounds for stats on nested fields #1392

andrei-ionescu · 2021-12-02T18:39:36Z

Which issue does this PR close?

Rationale for this change

This is a step in supporting nested fields in data fusion being read from parquet in

What changes are included in this PR?

This adds a helper method to get the flattened schema. This is similar to how the columns are stored in the parquet metadata section.

Are there any user-facing changes?

No.

andrei-ionescu · 2021-12-02T19:15:33Z

There is only the "Clippy" task that did fail. I think that is not related to my changes. Can someone check it out, please?

alamb · 2021-12-02T22:56:10Z

There is only the "Clippy" task that did fail. I think that is not related to my changes. Can someone check it out, please?

I suspect that is due to the new Rust version that is released and clippy got a little more finicky -- we just need to fix it on master. I'll handle it tomorrow, unless someone beats me to it

alamb

Thank you @andrei-ionescu for the contribution ❤️ -- would it be possible to add a test that covers this case to protect against regressions?

houqp · 2021-12-03T09:02:20Z

Thanks @andrei-ionescu for taking a stab at this. I don't think we can just flatting all the columns like how it's handled in parquet because the rest of the datafusion code base assumes arrow's hierarchical column structure. As a result, column indexes only map to top level columns. I suspect this approach will not cause runtime crashes because we are only expanding the column vector, but it could lead to incorrect read of column stats in places like https://github.com/apache/arrow-datafusion/blob/069449f754c92765a2d0dbbf62cd7bac76257892/datafusion/src/physical_optimizer/aggregate_statistics.rs#L181

alamb · 2021-12-03T12:01:53Z

Re clippy failures, @xudong963 is working on them in this PR #1395 -- once that gets merged we should be able to merge this PR with master and clippy would pass

datafusion/src/datasource/mod.rs

andrei-ionescu · 2021-12-03T17:49:47Z

@houqp This what I'm seeing - and please correct me if I'm wrong - there is no support in reading nested parquet files in DataFusion.

I've tried to use the nested_struct.rust.parquet file that is present in the parquet-testing git submodule building some DataFusion tests and are not passing. And I couldn't find any other tests with nested fields in parquet in DataFusion.

With all these being said, what's the strategy on supporting nested source structures in DataFusion?

andrei-ionescu · 2021-12-03T19:32:05Z

@houqp After more debugging and fixing different things I found that the physical plan lacks the nested fields support.

I got into this error:

Error: ArrowError(SchemaError("Unexpected batch schema from file, expected 36 cols but got 6"))

And this error is happening in these lines of code:
https://github.com/apache/arrow-datafusion/blob/069449f754c92765a2d0dbbf62cd7bac76257892/datafusion/src/physical_plan/file_format/mod.rs#L223-L229

The chunk of data that has been read has only 6 columns (file_batch.columns().len()) while the expected number of columns (expected_cols) is 36.

The root cause seems to be the way parquet files are read vs how it gets projected. It reads one top nested column at a time, while it tries to project that chunk of data over the full schema. For example, in the case of the nested_struct.rust.parquet it reads the first column with 6 leaves and then tries to project that over all 36 top columns of that parquet file. This is root cause of the error above.

It seems that DataFusion lacks the support for nested fields, at least when using the parquet data source.

lexi-sh · 2021-12-08T17:39:32Z

Thanks @andrei-ionescu for taking a stab at this. I don't think we can just flatting all the columns like how it's handled in parquet because the rest of the datafusion code base assumes arrow's hierarchical column structure. As a result, column indexes only map to top level columns. I suspect this approach will not cause runtime crashes because we are only expanding the column vector, but it could lead to incorrect read of column stats in places like

https://github.com/apache/arrow-datafusion/blob/069449f754c92765a2d0dbbf62cd7bac76257892/datafusion/src/physical_optimizer/aggregate_statistics.rs#L181

@houqp

Do you think column statistics should include something similar to Field#fields(), which returns a list of subfields for nested structures?

https://github.com/apache/arrow-rs/blob/master/arrow/src/datatypes/field.rs#L110-L125

houqp · 2021-12-09T08:44:17Z

Sorry for the late reply, @andrei-ionescu the problem you are getting is basically caused by the problem I mentioned in #1392 (comment). Fundamentally, it's due to differences between how nested struct fields are handled in Arrow and Parquet.

@lst-codes managing stats in a nested data structure could fix the problem. However, being inspired by apache/arrow#11704, I think it would be more efficient to resolve the nested column key path during planning by traversing the Expr::GetIndexedField expression , then only load corresponding parquet column stats into memory. This way, we can skip columns that are not accessed by the query.

Closes apache#1383

alamb · 2022-01-31T20:44:07Z

Marking PRs without activity in the last month as stale. I'll plan to close it in another month or so without activity, though feel free to reopen it when you have time to work on it)

alamb · 2022-02-15T19:08:43Z

Closing stale PRs. Please reopen (or open a new one) if you plan to keep working on this feature.

andrei-ionescu · 2022-02-21T10:19:19Z

@alamb Please re-open this PR. It is not stale just waiting for the apache/arrow#11704 to get merged and then rebase the work on that.

matthewmturner · 2022-03-04T19:02:36Z

@andrei-ionescu it looks like the issue you referenced waiting on was closed. with that in mind, do you have an idea how youd like to manage this PR now?

andrei-ionescu · 2022-03-04T19:15:48Z

Next week I'll take another stab at it. First I need to check the issue that I've seen before and refactor the PR changes if that still applies.

I'll let you know.

alamb · 2022-04-06T18:20:23Z

Marking as draft for the time being

alamb · 2022-05-02T17:58:45Z

Closing for now to clean up PR list -- feel free to reopen to work on it more

github-actions bot added the datafusion Changes in the datafusion crate label Dec 2, 2021

alamb reviewed Dec 2, 2021

View reviewed changes

Dandandan requested changes Dec 3, 2021

View reviewed changes

datafusion/src/datasource/mod.rs Outdated Show resolved Hide resolved

andrei-ionescu mentioned this pull request Dec 12, 2021

Implement statistics estimation for partition columns #1186

Open

andrei-ionescu added 2 commits December 28, 2021 17:24

Fix index out of bounds for stats on nested fields

d689140

Closes apache#1383

Added tests for flattening schema and counting total number of leaves

a25fd49

andrei-ionescu force-pushed the ai_fix_index_out_of_bounds_1383 branch from 210b6d9 to a25fd49 Compare December 28, 2021 15:24

alamb added the stale-pr label Jan 31, 2022

alamb closed this Feb 15, 2022

houqp reopened this Feb 21, 2022

alamb marked this pull request as draft April 6, 2022 18:20

alamb closed this May 2, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix index out of bounds for stats on nested fields #1392

Fix index out of bounds for stats on nested fields #1392

andrei-ionescu commented Dec 2, 2021

andrei-ionescu commented Dec 2, 2021

alamb commented Dec 2, 2021

alamb left a comment

houqp commented Dec 3, 2021

alamb commented Dec 3, 2021

andrei-ionescu commented Dec 3, 2021 •

edited

Loading

andrei-ionescu commented Dec 3, 2021 •

edited

Loading

lexi-sh commented Dec 8, 2021

houqp commented Dec 9, 2021 •

edited

Loading

alamb commented Jan 31, 2022

alamb commented Feb 15, 2022

andrei-ionescu commented Feb 21, 2022

matthewmturner commented Mar 4, 2022

andrei-ionescu commented Mar 4, 2022

alamb commented Apr 6, 2022

alamb commented May 2, 2022

Fix index out of bounds for stats on nested fields #1392

Fix index out of bounds for stats on nested fields #1392

Conversation

andrei-ionescu commented Dec 2, 2021

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

andrei-ionescu commented Dec 2, 2021

alamb commented Dec 2, 2021

alamb left a comment

Choose a reason for hiding this comment

houqp commented Dec 3, 2021

alamb commented Dec 3, 2021

andrei-ionescu commented Dec 3, 2021 • edited Loading

andrei-ionescu commented Dec 3, 2021 • edited Loading

lexi-sh commented Dec 8, 2021

houqp commented Dec 9, 2021 • edited Loading

alamb commented Jan 31, 2022

alamb commented Feb 15, 2022

andrei-ionescu commented Feb 21, 2022

matthewmturner commented Mar 4, 2022

andrei-ionescu commented Mar 4, 2022

alamb commented Apr 6, 2022

alamb commented May 2, 2022

andrei-ionescu commented Dec 3, 2021 •

edited

Loading

andrei-ionescu commented Dec 3, 2021 •

edited

Loading

houqp commented Dec 9, 2021 •

edited

Loading