Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix ORC read error when read schema reorders file schema columns #3015

Merged
merged 1 commit into from
Jul 28, 2021

Conversation

wbo4958
Copy link
Collaborator

@wbo4958 wbo4958 commented Jul 23, 2021

The orc reader reads the needed columns according to the column order of
original orc file, but we are writing the file schema using reading schema
order. If the reading schema order is not following the file schema order
then the re-constructed orc file buffer will be mangled.

See the issue #3007

Signed-off-by: Bobby Wang wbo4958@gmail.com

The orc reader reads the needed columns according to the column order of
original orc file, but we are writing the file schema using reading schema
order. If the reading schema order is not following the file schema order
then the re-constructed orc file buffer will be mangled.

Signed-off-by: Bobby Wang <wbo4958@gmail.com>
@wbo4958
Copy link
Collaborator Author

wbo4958 commented Jul 23, 2021

This PR is not good, since it will change code again when we try to support union/map for orc reading.
If we can re-construct the read schema order, that will be better.

@wbo4958
Copy link
Collaborator Author

wbo4958 commented Jul 23, 2021

build

@wbo4958 wbo4958 changed the title [WIP] Fix orc data reading mess up issue Fix orc data reading mess up issue Jul 23, 2021
@jlowe jlowe changed the title Fix orc data reading mess up issue Fix ORC read error when read schema reorders file schema columns Jul 27, 2021
Copy link
Member

@jlowe jlowe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks OK to me, but I don't think it will address @firestarman's desire to satisfy nested schema pruning. I see that checkSchemaCompatibility will trim the top-level schema for columns that we don't need, but I think that needs to be a recursive function to potentially prune unnecessary child fields out of structs underneath the top-level schema. As it is coded now, it takes the entire top-level column as-is, which I think means any struct columns will always have all of their child fields even if the read schema only wants a portion of the fields.

@sameerz sameerz added the bug Something isn't working label Jul 27, 2021
@sameerz sameerz added this to the July 19 - July 30 milestone Jul 27, 2021
@firestarman
Copy link
Collaborator

firestarman commented Jul 28, 2021

This looks OK to me, but I don't think it will address @firestarman's desire to satisfy nested schema pruning. I see that checkSchemaCompatibility will trim the top-level schema for columns that we don't need, but I think that needs to be a recursive function to potentially prune unnecessary child fields out of structs underneath the top-level schema. As it is coded now, it takes the entire top-level column as-is, which I think means any struct columns will always have all of their child fields even if the read schema only wants a portion of the fields.

No, it will not.
As mentioned, at least we need to update checkSchemaCompatibility accordingly to support nested column pruning. It will come in another PR.

@wbo4958
Copy link
Collaborator Author

wbo4958 commented Jul 28, 2021

This looks OK to me, but I don't think it will address @firestarman's desire to satisfy nested schema pruning. I see that checkSchemaCompatibility will trim the top-level schema for columns that we don't need, but I think that needs to be a recursive function to potentially prune unnecessary child fields out of structs underneath the top-level schema. As it is coded now, it takes the entire top-level column as-is, which I think means any struct columns will always have all of their child fields even if the read schema only wants a portion of the fields.

Yeah, we need to re-work checkSchemaCompatibility, I just have a simple test for nested-column-prune by just returning the read schema, and the dumped orc file has pruned the necessary columns. Let's leave it on the next PR Liangcai will work on

@wbo4958 wbo4958 merged commit 29e9487 into NVIDIA:branch-21.08 Jul 28, 2021
@wbo4958 wbo4958 deleted the orc-data branch July 28, 2021 03:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants