Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set default RMM pool to ASYNC for cuda 11.2+ #4606

Merged
merged 2 commits into from
Jan 27, 2022

Conversation

rongou
Copy link
Collaborator

@rongou rongou commented Jan 21, 2022

Fixes #4515

Signed-off-by: Rong Ou rong.ou@gmail.com

Signed-off-by: Rong Ou <rong.ou@gmail.com>
@rongou rongou added documentation Improvements or additions to documentation performance A performance related task/issue shuffle things that impact the shuffle plugin ease of use Makes the product simpler to use or configure labels Jan 21, 2022
@rongou rongou added this to the Jan 10 - Jan 28 milestone Jan 21, 2022
@rongou rongou requested review from jlowe, abellina and revans2 January 21, 2022 23:10
@rongou rongou self-assigned this Jan 21, 2022
Copy link
Member

@jlowe jlowe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we really ready for this? It would be good to have benchmark results showing this is not going to be a regression for some use-cases. It seems like we still have unresolved performance issues that appear to be tied to this allocator, e.g.:#4536.

Signed-off-by: Rong Ou <rong.ou@gmail.com>
@rongou
Copy link
Collaborator Author

rongou commented Jan 22, 2022

We do have benchmarks on TPCDS at various scale factors. The idea is turn this on early in 22.04 and get it more widely tested and benchmarked. If we do run into some block we can always revert it back before the release is cut.

@jlowe
Copy link
Member

jlowe commented Jan 24, 2022

We do have benchmarks on TPCDS at various scale factors.

Do we have those metrics isolated to just this change? There are quite a few performance improvements going in recently, and I want to make sure we don't end up masking a regression from the async allocator with a performance improvement from elsewhere.

I would really like to see what this allocator costs for queries that don't spill, and last I knew we only had metrics at scale for UCX shuffle which of course can spill quite a lot due to its aggressive caching of shuffle outputs. The only metrics I've seen for a query that doesn't spill is from #4536, and I find it odd that we want to check this in without first understanding the impact footprint for a 500X slowdown.

@revans2
Copy link
Collaborator

revans2 commented Jan 24, 2022

I would really like to see what this allocator costs for queries that don't spill, and last I knew we only had metrics at scale for UCX shuffle which of course can spill quite a lot due to its aggressive caching of shuffle outputs. The only metrics I've seen for a query that doesn't spill is from #4536, and I find it odd that we want to check this in without first understanding the impact footprint for a 500X slowdown.

To be clear most of the tests didn't show that slowness. I would need to do some more analysis to see if there was any real difference that I saw beyond this one corner case.

@rongou
Copy link
Collaborator Author

rongou commented Jan 24, 2022

This is the result for TPCDS at 1TB. I'll dig more into #4536.
ARENA vs ASYNC allocator
.

@rongou
Copy link
Collaborator Author

rongou commented Jan 24, 2022

#4536 turns out to be a red herring.

@rongou
Copy link
Collaborator Author

rongou commented Jan 27, 2022

build

@rongou rongou merged commit cbb9b14 into NVIDIA:branch-22.04 Jan 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation ease of use Makes the product simpler to use or configure performance A performance related task/issue shuffle things that impact the shuffle plugin
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Set RMM async allocator as default
4 participants