Update GpuFileFormatWriter to stay in sync with recent Spark changes, but still not support writing Hive bucketed table on GPU. #4484

HaoYang670 · 2022-01-10T08:48:06Z

Signed-off-by: remzi 13716567376yh@gmail.com
close #3949

In this PR, we:

Copy some code from apache/spark@4a34db9 which supports Hive bucketed writing. However, as we do not support Hive hash partition on GPU, we throw an exception in Rapids.
Create a new file, datasourcev2_write_test.py and add a test here to make sure we are falling back to CPU when someone tries to do a Hive bucketed write.

throw an exception when trying to do hive hash partition on GPU Signed-off-by: remzi <13716567376yh@gmail.com>

Signed-off-by: remzi <13716567376yh@gmail.com>

HaoYang670 · 2022-01-10T08:53:07Z

build

jlowe

Headline implies that this adds support for writing Hive bucketed tables, but it is not supported even after this PR is merged. We weren't accidentally trying to support this before the PR either, so maybe headline should just state we're updating GpuWriteJobDescription to stay in sync with recent Spark changes.

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/GpuFileFormatWriter.scala

Signed-off-by: remzi <13716567376yh@gmail.com>

HaoYang670 · 2022-01-11T02:24:33Z

build

Because Spark330 and Spark301 behave differently on insertInto Signed-off-by: remzi <13716567376yh@gmail.com>

HaoYang670 · 2022-01-11T08:34:16Z

build

jlowe

Looks better, but tests are failing with:

The format of the existing table default.tmp_table_574308_0 is `HiveFileFormat`. It doesn't match the specified format `ParquetDataSourceV2`.

We may need a .format("hive") when writing the dataframe, but I'm not an expert on Spark's Hive support.

revans2 · 2022-01-11T18:51:40Z

Should we update the title of this PR as we are not adding in support for hived bucketed table writes.

Signed-off-by: remzi <13716567376yh@gmail.com>

HaoYang670 · 2022-01-12T05:09:32Z

build

HaoYang670 · 2022-01-12T07:28:07Z

The title has been updated to tell users that we do not support writing hive bucketed table on GPU so far.

HaoYang670 added 5 commits January 7, 2022 14:42

copy some code from Spark

96026b4

throw an exception when trying to do hive hash partition on GPU Signed-off-by: remzi <13716567376yh@gmail.com>

start to add a test

697c24c

Signed-off-by: remzi <13716567376yh@gmail.com>

temp save

527f3c5

Signed-off-by: remzi <13716567376yh@gmail.com>

move the test to datasourcev2_write_test.py

a3f3e5a

Signed-off-by: remzi <13716567376yh@gmail.com>

update copyright

16cec72

Signed-off-by: remzi <13716567376yh@gmail.com>

HaoYang670 marked this pull request as draft January 10, 2022 08:49

update the fallback reason

6873fac

Signed-off-by: remzi <13716567376yh@gmail.com>

HaoYang670 changed the title ~~Issue3949 support writing hive bucketed table~~ Support writing Hive bucketed table Jan 10, 2022

HaoYang670 added the audit_3.3.0 Audit related tasks for 3.3.0 label Jan 10, 2022

jlowe reviewed Jan 10, 2022

View reviewed changes

HaoYang670 changed the title ~~Support writing Hive bucketed table~~ Update GpuFileFormatWriter to stay in sync with recent Spark changes. Jan 11, 2022

correct the code style and delete some useless code

5f22f61

Signed-off-by: remzi <13716567376yh@gmail.com>

replace insertInto by saveAsTable with append mode.

436ad7a

Because Spark330 and Spark301 behave differently on insertInto Signed-off-by: remzi <13716567376yh@gmail.com>

jlowe reviewed Jan 11, 2022

View reviewed changes

Change the way of creating hive bucketed table.

c3f0ac8

Signed-off-by: remzi <13716567376yh@gmail.com>

HaoYang670 changed the title ~~Update GpuFileFormatWriter to stay in sync with recent Spark changes.~~ Update GpuFileFormatWriter to stay in sync with recent Spark changes, but still not support writing Hive bucketed table on GPU. Jan 12, 2022

HaoYang670 marked this pull request as ready for review January 12, 2022 08:12

revans2 approved these changes Jan 12, 2022

View reviewed changes

jlowe approved these changes Jan 12, 2022

View reviewed changes

jlowe merged commit d21f794 into NVIDIA:branch-22.02 Jan 12, 2022

HaoYang670 deleted the issue3949_support_writing_hive_bucketed_table branch January 13, 2022 01:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update GpuFileFormatWriter to stay in sync with recent Spark changes, but still not support writing Hive bucketed table on GPU. #4484

Update GpuFileFormatWriter to stay in sync with recent Spark changes, but still not support writing Hive bucketed table on GPU. #4484

HaoYang670 commented Jan 10, 2022

HaoYang670 commented Jan 10, 2022

jlowe left a comment

HaoYang670 commented Jan 11, 2022

HaoYang670 commented Jan 11, 2022

jlowe left a comment

revans2 commented Jan 11, 2022

HaoYang670 commented Jan 12, 2022

HaoYang670 commented Jan 12, 2022

Update GpuFileFormatWriter to stay in sync with recent Spark changes, but still not support writing Hive bucketed table on GPU. #4484

Update GpuFileFormatWriter to stay in sync with recent Spark changes, but still not support writing Hive bucketed table on GPU. #4484

Conversation

HaoYang670 commented Jan 10, 2022

HaoYang670 commented Jan 10, 2022

jlowe left a comment

Choose a reason for hiding this comment

HaoYang670 commented Jan 11, 2022

HaoYang670 commented Jan 11, 2022

jlowe left a comment

Choose a reason for hiding this comment

revans2 commented Jan 11, 2022

HaoYang670 commented Jan 12, 2022

HaoYang670 commented Jan 12, 2022