-
Notifications
You must be signed in to change notification settings - Fork 240
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support decimal type in orc reader #3239
Conversation
Signed-off-by: Firestarman <firestarmanllc@gmail.com>
build |
Signed-off-by: Firestarman <firestarmanllc@gmail.com>
build |
sql-plugin/src/main/java/com/nvidia/spark/rapids/GpuColumnVector.java
Outdated
Show resolved
Hide resolved
// We should honor the actual decimal type used for the cudf column first, then the precision. | ||
// Instead of always inferring a deciaml type from the precision of Spark. Because both | ||
// DECIMAL32 and DECIMAL64 can support precisions <= MAX_INT_DIGITS. Cudf may use either one | ||
// for this case. e.g. ORC reader will always read decimal as DECIMAL64, even the precision |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the plugin's ORC reader be downcasting columns to DECIMAL32
in the case of excessively-precise decimals? That could save space on operations downstream from the query.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it will be better to cast columns to DECIMAL32
when the precision allows. However I think this is an improvement since it is not required for the functionality.
I thought about this for a while. Seems we need to iterate the columns (including nested columns) to find out the ones matching the case and do the casting. And compond columns need to be rebuilt with new cast children. The whole process looks like a rule for space optimization, which can be used in more places than just the ORC reader.
Instead of integrating the process with ORC readers, how about implementing it as an utils method and applying it to the ORC reader in another PR ?
Or any other suggestions ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should be consistent with how the decimals are loaded. The Parquet reader is already converting DECIMAL64 columns into DECIMAL32 columns on read when appropriate, so I think we should do the same here. The code to convert is already written and can be refactored from the Parquet code as a utility method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Parquet code for this is part of ParquetPartitionReaderBase#evolveSchemaIfNeededAndClose
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. Updated.
…or.java Address the comment Co-authored-by: Jason Lowe <jlowe@nvidia.com>
build |
LGTM |
Signed-off-by: Firestarman <firestarmanllc@gmail.com>
build |
Signed-off-by: Firestarman <firestarmanllc@gmail.com>
build |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a nit for me
} | ||
|
||
/** | ||
* Cast columns with precision that can be stored in an int to DECIMAL32, to save space. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: It might be good to expand on this a bit. Our plugin makes the assumption that if the precision is small enough to fit in a DECIMAL32, then CUDF has it stored as a DECIMAL32. Getting this wrong can lead to a number of problems later on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will add it in a following PR to avoid running the premerge again for just doc udpate to get new approvals.
* Support decimal type in orc reader Signed-off-by: Firestarman <firestarmanllc@gmail.com> Signed-off-by: Raza Jafri <rjafri@nvidia.com>
This fixes #3177
Signed-off-by: Firestarman firestarmanllc@gmail.com