-
Notifications
You must be signed in to change notification settings - Fork 890
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Potential optimization: Batched memset. #15773
Labels
Comments
nvdbaranec
added
feature request
New feature or request
Performance
Performance related issue
labels
May 17, 2024
nvdbaranec
changed the title
[FEA] Potential optimization:: Batched memset.
[FEA] Potential optimization: Batched memset.
May 17, 2024
3 tasks
rapids-bot bot
pushed a commit
that referenced
this issue
Aug 5, 2024
…der (#16281) Under some situations in the Parquet reader (particularly the case with tables containing many columns or deeply nested column) we burn a decent amount of time doing cudaMemset() operations on output buffers. A good amount of this overhead seems to stem from the fact that we're simply launching many tiny kernels. This PR adds a batched memset kernel that takes a list of device spans as a single input and does all the work under a single kernel launch. This PR addresses issue #15773 ## Improvements Using out performance cluster, improvements of 2.39% were shown on running the overall NDS queries Additionally, benchmarks were added showing big improvements(around 20%) especially on fixed width data types which can be shown below data_type | num_cols | cardinality | run_length | bytes_per_second_before_this_pr | bytes_per_second_after_this_pr | speedup --- | --- | --- | --- | --- | --- | --- INTEGRAL | 1000 | 0 | 1 | 36514934834 | 42756531566 | 1.170932709 INTEGRAL | 1000 | 1000 | 1 | 35364061247 | 39112512476 | 1.105996062 INTEGRAL | 1000 | 0 | 32 | 37349112510 | 39641370858 | 1.061373837 INTEGRAL | 1000 | 1000 | 32 | 39167079622 | 43740824957 | 1.116775245 FLOAT | 1000 | 0 | 1 | 51877322003 | 64083898838 | 1.235296973 FLOAT | 1000 | 1000 | 1 | 48983612272 | 58705522023 | 1.198472699 FLOAT | 1000 | 0 | 32 | 46544977658 | 53715018581 | 1.154045426 FLOAT | 1000 | 1000 | 32 | 54493432148 | 66617609904 | 1.22248879 DECIMAL | 1000 | 0 | 1 | 47616412888 | 57952310685 | 1.217065864 DECIMAL | 1000 | 1000 | 1 | 47166138095 | 54283772484 | 1.1509056 DECIMAL | 1000 | 0 | 32 | 45266163387 | 53770390830 | 1.18787162 DECIMAL | 1000 | 1000 | 32 | 52292176603 | 58847723569 | 1.125363819 TIMESTAMP | 1000 | 0 | 1 | 50245415328 | 60797982330 | 1.210020495 TIMESTAMP | 1000 | 1000 | 1 | 50300238706 | 60810368331 | 1.208947908 TIMESTAMP | 1000 | 0 | 32 | 55338354243 | 66786275739 | 1.206871376 TIMESTAMP | 1000 | 1000 | 32 | 55680028082 | 69029227374 | 1.23974843 DURATION | 1000 | 0 | 1 | 54680007758 | 66855201896 | 1.222662626 DURATION | 1000 | 1000 | 1 | 54305832171 | 66602436269 | 1.226432477 DURATION | 1000 | 0 | 32 | 60040760815 | 72663056969 | 1.210228784 DURATION | 1000 | 1000 | 32 | 60212221703 | 75646396131 | 1.256329595 STRING | 1000 | 0 | 1 | 29691707753 | 33388700976 | 1.12451265 STRING | 1000 | 1000 | 1 | 31411129876 | 35407241037 | 1.127219593 STRING | 1000 | 0 | 32 | 29680479388 | 33382478907 | 1.124728427 STRING | 1000 | 1000 | 32 | 35476213777 | 40478389269 | 1.141000827 LIST | 1000 | 0 | 1 | 6874253484 | 7370835717 | 1.072237987 LIST | 1000 | 1000 | 1 | 6763426009 | 7253762966 | 1.07249831 LIST | 1000 | 0 | 32 | 6981508808 | 7502741115 | 1.074658977 LIST | 1000 | 1000 | 32 | 6989374761 | 7506418252 | 1.073975643 STRUCT | 1000 | 0 | 1 | 2137525922 | 2189495762 | 1.024313081 STRUCT | 1000 | 1000 | 1 | 1057923939 | 1078475980 | 1.019426766 STRUCT | 1000 | 0 | 32 | 1637342446 | 1698913790 | 1.037604439 STRUCT | 1000 | 1000 | 32 | 1057587701 | 1082539399 | 1.02359303 Authors: - Rahul Prabhu (https://github.com/sdrp713) - Muhammad Haseeb (https://github.com/mhaseeb123) Approvers: - https://github.com/nvdbaranec - Muhammad Haseeb (https://github.com/mhaseeb123) - Kyle Edwards (https://github.com/KyleFromNVIDIA) - Bradley Dice (https://github.com/bdice) URL: #16281
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
Under some situations in the Parquet reader (particularly the case with tables containing many columns or deeply nested column) we burn a decent amount of time doing
cudaMemset()
operations on output buffers. A good amount of this overhead seems to stem from the fact that we're simply launching many tiny kernels. It might be useful to have a batched/multi memset kernel that takes a list of address/sizes/values as a single input and does all the work under a single kernel launch. Similar to the Cub multi-buffer memcpy orcontiguous_split
.The text was updated successfully, but these errors were encountered: