Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update GPU ORC statistics write support #5715

Merged
merged 3 commits into from
Jun 2, 2022

Conversation

amahussein
Copy link
Collaborator

Signed-off-by: Ahmed Hussein (amahussein) a@ahussein.me

closes #4860

Followup after Closing CUDF-Issue 5826 [BUG] ORC file-level statistics omitted with chunked writes.

  • There is no need to write the ORC file inside CPU session.
  • I had to use mode(overwrite). Otherwise, the integration test fails because the file already exists.
  • Updated the compatibility doc indicating that writing ORC file statistics is not supported prior to release 22.06.
  • I can see a pending bug [FEA] Add File Statistic when writing the ORC file. Need to check if this is needs to be closed or there are still limitations to the ORC statistics features.

How it was testes:

./integration_tests/run_pyspark_from_build.sh -k orc_test.py --tmp_path=/tmp/pyspark_tests

Signed-off-by: Ahmed Hussein (amahussein) <a@ahussein.me>
@amahussein amahussein added this to the May 23 - Jun 3 milestone Jun 1, 2022
@amahussein amahussein self-assigned this Jun 1, 2022
@amahussein amahussein marked this pull request as ready for review June 1, 2022 20:21
@amahussein
Copy link
Collaborator Author

build

Signed-off-by: Ahmed Hussein (amahussein) <a@ahussein.me>
@amahussein
Copy link
Collaborator Author

build

RAPIDS does not support whole file statistics in ORC file. We are working with
[CUDF](https://github.com/rapidsai/cudf) to support writing statistics and you can track it
[here](https://github.com/rapidsai/cudf/issues/5826).
RAPIDS does not support whole file statistics in ORC file in releases _prior_ to release 22.06.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit

Suggested change
RAPIDS does not support whole file statistics in ORC file in releases _prior_ to release 22.06.
RAPIDS does not support whole file statistics in ORC file in releases prior to release 22.06.

@sameerz sameerz added the test Only impacts tests label Jun 2, 2022
tgravescs
tgravescs previously approved these changes Jun 2, 2022
Copy link
Collaborator

@tgravescs tgravescs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks fine pending the existing comment

Signed-off-by: Ahmed Hussein (amahussein) <a@ahussein.me>
@amahussein
Copy link
Collaborator Author

build

@sameerz sameerz merged commit 34c9aea into NVIDIA:branch-22.06 Jun 2, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
test Only impacts tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] GPU writing ORC columns statistics
3 participants