Support decimal type in orc reader #3239

firestarman · 2021-08-17T02:27:31Z

This fixes #3177

Signed-off-by: Firestarman firestarmanllc@gmail.com

Signed-off-by: Firestarman <firestarmanllc@gmail.com>

firestarman · 2021-08-17T02:28:20Z

build

Signed-off-by: Firestarman <firestarmanllc@gmail.com>

firestarman · 2021-08-17T03:15:04Z

build

sql-plugin/src/main/java/com/nvidia/spark/rapids/GpuColumnVector.java

jlowe · 2021-08-17T15:21:05Z

sql-plugin/src/main/java/com/nvidia/spark/rapids/GpuColumnVector.java

+    // We should honor the actual decimal type used for the cudf column first, then the precision.
+    // Instead of always inferring a deciaml type from the precision of Spark. Because both
+    // DECIMAL32 and DECIMAL64 can support precisions <= MAX_INT_DIGITS. Cudf may use either one
+    // for this case. e.g. ORC reader will always read decimal as DECIMAL64, even the precision


Should the plugin's ORC reader be downcasting columns to DECIMAL32 in the case of excessively-precise decimals? That could save space on operations downstream from the query.

Yes, it will be better to cast columns to DECIMAL32 when the precision allows. However I think this is an improvement since it is not required for the functionality.

I thought about this for a while. Seems we need to iterate the columns (including nested columns) to find out the ones matching the case and do the casting. And compond columns need to be rebuilt with new cast children. The whole process looks like a rule for space optimization, which can be used in more places than just the ORC reader.

Instead of integrating the process with ORC readers, how about implementing it as an utils method and applying it to the ORC reader in another PR ?
Or any other suggestions ?

I think we should be consistent with how the decimals are loaded. The Parquet reader is already converting DECIMAL64 columns into DECIMAL32 columns on read when appropriate, so I think we should do the same here. The code to convert is already written and can be refactored from the Parquet code as a utility method.

The Parquet code for this is part of ParquetPartitionReaderBase#evolveSchemaIfNeededAndClose

Got it. Updated.

…or.java Address the comment Co-authored-by: Jason Lowe <jlowe@nvidia.com>

firestarman · 2021-08-18T01:12:25Z

build

sperlingxx · 2021-08-18T10:35:27Z

LGTM

Signed-off-by: Firestarman <firestarmanllc@gmail.com>

firestarman · 2021-08-19T04:14:29Z

build

Signed-off-by: Firestarman <firestarmanllc@gmail.com>

firestarman · 2021-08-19T08:47:19Z

build

revans2

Just a nit for me

revans2 · 2021-08-19T13:03:27Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOrcScan.scala

+  }
+
+  /**
+   * Cast columns with precision that can be stored in an int to DECIMAL32, to save space.


nit: It might be good to expand on this a bit. Our plugin makes the assumption that if the precision is small enough to fit in a DECIMAL32, then CUDF has it stored as a DECIMAL32. Getting this wrong can lead to a number of problems later on.

Will add it in a following PR to avoid running the premerge again for just doc udpate to get new approvals.

* Support decimal type in orc reader Signed-off-by: Firestarman <firestarmanllc@gmail.com> Signed-off-by: Raza Jafri <rjafri@nvidia.com>

Support decimal type in orc reader

dce0143

Signed-off-by: Firestarman <firestarmanllc@gmail.com>

Address the test failures

6d3e0ee

Signed-off-by: Firestarman <firestarmanllc@gmail.com>

jlowe added this to the Aug 16 - Aug 27 milestone Aug 17, 2021

jlowe reviewed Aug 17, 2021

View reviewed changes

Update sql-plugin/src/main/java/com/nvidia/spark/rapids/GpuColumnVect…

c7c9b12

…or.java Address the comment Co-authored-by: Jason Lowe <jlowe@nvidia.com>

sameerz added the feature request New feature or request label Aug 17, 2021

firestarman added 2 commits August 19, 2021 11:10

Cast Decimal to DECIMAL32 when precision allows.

4b68602

Signed-off-by: Firestarman <firestarmanllc@gmail.com>

Merge branch 'branch-21.10' into decimal-orc

51dc5c4

Check if needing to cast first

ad32884

Signed-off-by: Firestarman <firestarmanllc@gmail.com>

revans2 approved these changes Aug 19, 2021

View reviewed changes

jlowe approved these changes Aug 19, 2021

View reviewed changes

firestarman merged commit d3ae0d0 into NVIDIA:branch-21.10 Aug 20, 2021

firestarman deleted the decimal-orc branch August 20, 2021 01:59

razajafri pushed a commit to razajafri/spark-rapids that referenced this pull request Aug 23, 2021

Support decimal type in orc reader (NVIDIA#3239)

b17f3bd

* Support decimal type in orc reader Signed-off-by: Firestarman <firestarmanllc@gmail.com> Signed-off-by: Raza Jafri <rjafri@nvidia.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support decimal type in orc reader #3239

Support decimal type in orc reader #3239

firestarman commented Aug 17, 2021

firestarman commented Aug 17, 2021

firestarman commented Aug 17, 2021

jlowe Aug 17, 2021

firestarman Aug 18, 2021 •

edited

Loading

jlowe Aug 18, 2021

jlowe Aug 18, 2021

firestarman Aug 19, 2021

firestarman commented Aug 18, 2021

sperlingxx commented Aug 18, 2021

firestarman commented Aug 19, 2021

firestarman commented Aug 19, 2021

revans2 left a comment

revans2 Aug 19, 2021

firestarman Aug 20, 2021

Support decimal type in orc reader #3239

Support decimal type in orc reader #3239

Conversation

firestarman commented Aug 17, 2021

firestarman commented Aug 17, 2021

firestarman commented Aug 17, 2021

jlowe Aug 17, 2021

Choose a reason for hiding this comment

firestarman Aug 18, 2021 • edited Loading

Choose a reason for hiding this comment

jlowe Aug 18, 2021

Choose a reason for hiding this comment

jlowe Aug 18, 2021

Choose a reason for hiding this comment

firestarman Aug 19, 2021

Choose a reason for hiding this comment

firestarman commented Aug 18, 2021

sperlingxx commented Aug 18, 2021

firestarman commented Aug 19, 2021

firestarman commented Aug 19, 2021

revans2 left a comment

Choose a reason for hiding this comment

revans2 Aug 19, 2021

Choose a reason for hiding this comment

firestarman Aug 20, 2021

Choose a reason for hiding this comment

firestarman Aug 18, 2021 •

edited

Loading