Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clean up and simplify gpuDecideCompression #13202

Merged
merged 13 commits into from
May 2, 2023

Conversation

vuule
Copy link
Contributor

@vuule vuule commented Apr 22, 2023

Description

Changed the block size to single warp, since only 32 threads are used in the kernel.
Simplify the kernel logic a bit and remove unnecessary atomic operations.

FWIW, the kernel is faster now; not important as it is a tiny part of E2E time.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@vuule vuule added cuIO cuIO issue improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Apr 22, 2023
@vuule vuule self-assigned this Apr 22, 2023
@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Apr 22, 2023
@vuule vuule marked this pull request as ready for review April 25, 2023 20:38
@vuule vuule requested a review from a team as a code owner April 25, 2023 20:38
cpp/src/io/parquet/page_enc.cu Outdated Show resolved Hide resolved
cpp/src/io/parquet/page_enc.cu Outdated Show resolved Hide resolved
compressed_data_size += comp_res->bytes_written;
if (comp_res->status != compression_status::SUCCESS) { atomicAdd(&error_count, 1); }
}
__syncwarp();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we write this kernel in a block-size agnostic way? Unlike __syncthreads();, using __syncwarp(); assumes that block_size == warp_size == 32.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That depends on how we would scale the parallelism with multiple warps. If any warps worked on a single chunks element, then, yes, we would need to syn all threads in the block. But, with multiple warps, IMO this kernel should actually have each warp would work on a separate chunks element. In this case we don't need to synchronize different warps and __syncwarp is still the right option.
I understand that my change left this ambiguous as warp size is used interchangeably for block and warp size. I'll try to make this clearer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modified the kernel to work with any number of warps in a block. The size can be adjusted via constexpr decide_compression_warps_in_block. Used warp_size as well, so we should be magic number-free now :)

Comment on lines +1331 to +1332
auto const lane_id = threadIdx.x % cudf::detail::warp_size;
auto const warp_id = threadIdx.x / cudf::detail::warp_size;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question for the reviewers: Are there maybe helper functions for this? Looks very generic.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not that I am aware of.

@vuule vuule requested a review from bdice April 27, 2023 02:24
@vuule vuule requested a review from karthikeyann May 1, 2023 17:57
@vuule vuule changed the base branch from branch-23.06 to branch-23.04 May 1, 2023 18:07
@vuule vuule requested review from a team as code owners May 1, 2023 18:07
@vuule vuule requested a review from a team as a code owner May 1, 2023 18:07
@vuule vuule requested review from isVoid and removed request for a team May 1, 2023 18:07
@vuule vuule changed the base branch from branch-23.04 to branch-23.06 May 1, 2023 18:07
Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more fix for warpSize. Otherwise I think this is better!

Comment on lines +1331 to +1332
auto const lane_id = threadIdx.x % cudf::detail::warp_size;
auto const warp_id = threadIdx.x / cudf::detail::warp_size;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not that I am aware of.

cpp/src/io/parquet/page_enc.cu Outdated Show resolved Hide resolved
@ajschmidt8 ajschmidt8 removed the request for review from a team May 1, 2023 18:55
Co-authored-by: Bradley Dice <bdice@bradleydice.com>
@ttnghia ttnghia removed request for a team May 1, 2023 19:46
Comment on lines +1326 to +1327
__shared__ __align__(8) EncColumnChunk ck_g[decide_compression_warps_in_block];
__shared__ __align__(4) unsigned int compression_error[decide_compression_warps_in_block];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we align them manually? And why do we need to align them?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It allows more efficient access, at least in theory. I'm not the one who added the alignment, and I also haven't tested how this alignment impacts performance in practice.

@vuule vuule added the 5 - Ready to Merge Testing and reviews complete, ready to merge label May 1, 2023
@vuule
Copy link
Contributor Author

vuule commented May 2, 2023

/merge

@rapids-bot rapids-bot bot merged commit ef4ebce into rapidsai:branch-23.06 May 2, 2023
@vuule vuule deleted the clean-up-decide-compression branch May 2, 2023 00:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge cuIO cuIO issue improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants