Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update GpuFileFormatWriter to stay in sync with recent Spark changes, but still not support writing Hive bucketed table on GPU. #4484

Conversation

HaoYang670
Copy link
Collaborator

Signed-off-by: remzi 13716567376yh@gmail.com
close #3949

In this PR, we:

  1. Copy some code from apache/spark@4a34db9 which supports Hive bucketed writing. However, as we do not support Hive hash partition on GPU, we throw an exception in Rapids.
  2. Create a new file, datasourcev2_write_test.py and add a test here to make sure we are falling back to CPU when someone tries to do a Hive bucketed write.

throw an exception when trying to do hive hash partition on GPU

Signed-off-by: remzi <13716567376yh@gmail.com>
Signed-off-by: remzi <13716567376yh@gmail.com>
Signed-off-by: remzi <13716567376yh@gmail.com>
Signed-off-by: remzi <13716567376yh@gmail.com>
Signed-off-by: remzi <13716567376yh@gmail.com>
@HaoYang670 HaoYang670 marked this pull request as draft January 10, 2022 08:49
Signed-off-by: remzi <13716567376yh@gmail.com>
@HaoYang670 HaoYang670 changed the title Issue3949 support writing hive bucketed table Support writing Hive bucketed table Jan 10, 2022
@HaoYang670
Copy link
Collaborator Author

build

@HaoYang670 HaoYang670 added the audit_3.3.0 Audit related tasks for 3.3.0 label Jan 10, 2022
Copy link
Member

@jlowe jlowe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Headline implies that this adds support for writing Hive bucketed tables, but it is not supported even after this PR is merged. We weren't accidentally trying to support this before the PR either, so maybe headline should just state we're updating GpuWriteJobDescription to stay in sync with recent Spark changes.

@HaoYang670 HaoYang670 changed the title Support writing Hive bucketed table Update GpuFileFormatWriter to stay in sync with recent Spark changes. Jan 11, 2022
Signed-off-by: remzi <13716567376yh@gmail.com>
@HaoYang670
Copy link
Collaborator Author

build

Because Spark330 and Spark301 behave differently on insertInto

Signed-off-by: remzi <13716567376yh@gmail.com>
@HaoYang670
Copy link
Collaborator Author

build

Copy link
Member

@jlowe jlowe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks better, but tests are failing with:

The format of the existing table default.tmp_table_574308_0 is `HiveFileFormat`. It doesn't match the specified format `ParquetDataSourceV2`.

We may need a .format("hive") when writing the dataframe, but I'm not an expert on Spark's Hive support.

@revans2
Copy link
Collaborator

revans2 commented Jan 11, 2022

Should we update the title of this PR as we are not adding in support for hived bucketed table writes.

Signed-off-by: remzi <13716567376yh@gmail.com>
@HaoYang670
Copy link
Collaborator Author

build

@HaoYang670 HaoYang670 changed the title Update GpuFileFormatWriter to stay in sync with recent Spark changes. Update GpuFileFormatWriter to stay in sync with recent Spark changes, but still not support writing Hive bucketed table on GPU. Jan 12, 2022
@HaoYang670
Copy link
Collaborator Author

The title has been updated to tell users that we do not support writing hive bucketed table on GPU so far.

@HaoYang670 HaoYang670 marked this pull request as ready for review January 12, 2022 08:12
@jlowe jlowe merged commit d21f794 into NVIDIA:branch-22.02 Jan 12, 2022
@HaoYang670 HaoYang670 deleted the issue3949_support_writing_hive_bucketed_table branch January 13, 2022 01:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
audit_3.3.0 Audit related tasks for 3.3.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Audit][SPARK-32709][SQL] Support writing Hive bucketed table (Parquet/ORC format with Hive hash)
3 participants