-
Notifications
You must be signed in to change notification settings - Fork 237
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix ORC read error when read schema reorders file schema columns #3015
Conversation
The orc reader reads the needed columns according to the column order of original orc file, but we are writing the file schema using reading schema order. If the reading schema order is not following the file schema order then the re-constructed orc file buffer will be mangled. Signed-off-by: Bobby Wang <wbo4958@gmail.com>
This PR is not good, since it will change code again when we try to support union/map for orc reading. |
build |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks OK to me, but I don't think it will address @firestarman's desire to satisfy nested schema pruning. I see that checkSchemaCompatibility
will trim the top-level schema for columns that we don't need, but I think that needs to be a recursive function to potentially prune unnecessary child fields out of structs underneath the top-level schema. As it is coded now, it takes the entire top-level column as-is, which I think means any struct columns will always have all of their child fields even if the read schema only wants a portion of the fields.
No, it will not. |
Yeah, we need to re-work checkSchemaCompatibility, I just have a simple test for nested-column-prune by just returning the read schema, and the dumped orc file has pruned the necessary columns. Let's leave it on the next PR Liangcai will work on |
The orc reader reads the needed columns according to the column order of
original orc file, but we are writing the file schema using reading schema
order. If the reading schema order is not following the file schema order
then the re-constructed orc file buffer will be mangled.
See the issue #3007
Signed-off-by: Bobby Wang wbo4958@gmail.com