Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix the row index stream order in ORC reader #13242

Merged
merged 8 commits into from
May 10, 2023

Conversation

vuule
Copy link
Contributor

@vuule vuule commented Apr 27, 2023

Description

Fixes #11890

Use fixed order when reading row index streams.
The order is PRESENT, DATA, SECONDARY/LENGTH (maps to DATA2).
Is any of these is absent in the column, relative order is maintained. Thus, the order is a sub-array of the one above.

Also simplified some logic related to stream order, as we do not need to pass it from the host. Instead, we only pass a bitmap to denote which streams are present.
Updated the xfail test, as it now passes :)

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@vuule vuule added bug Something isn't working cuIO cuIO issue non-breaking Non-breaking change labels Apr 27, 2023
@vuule vuule self-assigned this Apr 27, 2023
@github-actions github-actions bot added Python Affects Python cuDF API. libcudf Affects libcudf (C++/CUDA) code. labels Apr 27, 2023
@vuule vuule changed the title Bug orc index stream order Fix the row index stream order in ORC reader Apr 28, 2023
@vuule vuule marked this pull request as ready for review April 28, 2023 06:34
@vuule vuule requested review from a team as code owners April 28, 2023 06:34
Copy link
Collaborator

@kkraus14 kkraus14 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Python test update LGTM

Thanks for this fix @vuule 😄

Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving Python test change.

cpp/src/io/orc/reader_impl.cu Show resolved Hide resolved
cpp/src/io/orc/stripe_init.cu Outdated Show resolved Hide resolved
cpp/src/io/orc/stripe_init.cu Show resolved Hide resolved
*
* @return The order of index streams
*/
static auto __device__ index_order_from_index_types(uint32_t index_types_bitmap)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically all free functions are static.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I'm misunderstanding the comment, but static is used here to give the function internal linkage - it is not called outside of this translation unit. Maybe this is not relevant for __device__ functions, but we use this extensively in cuIO.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps @ttnghia is confusing the concept of static member variables with usage of static as a storage class specifier? They are two completely different concepts that unfortunately reuse the same keyword. Otherwise I am not sure what the comment is referring to.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but static is used here to give the function internal linkage

This makes sense. Otherwise, I don't see why it needs to be static. Sorry for the confusion.

Typically we can just put the function into an anonymous namespace for that purpose.

@vuule vuule requested review from ttnghia and vyasr May 9, 2023 23:26
cpp/src/io/orc/stripe_init.cu Show resolved Hide resolved
*
* @return The order of index streams
*/
static auto __device__ index_order_from_index_types(uint32_t index_types_bitmap)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps @ttnghia is confusing the concept of static member variables with usage of static as a storage class specifier? They are two completely different concepts that unfortunately reuse the same keyword. Otherwise I am not sure what the comment is referring to.

@vuule vuule added the 5 - Ready to Merge Testing and reviews complete, ready to merge label May 10, 2023
@vuule
Copy link
Contributor Author

vuule commented May 10, 2023

/merge

@rapids-bot rapids-bot bot merged commit 3d814bd into rapidsai:branch-23.06 May 10, 2023
@vuule vuule deleted the bug-orc-index-stream-order branch August 10, 2023 03:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] cuDF ORC reader incorrectly reads file written by pyarrow
6 participants