Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support CUDA async memory resource in JNI #9201

Merged
merged 5 commits into from
Sep 13, 2021

Conversation

rongou
Copy link
Contributor

@rongou rongou commented Sep 9, 2021

CUDA 11.2 introduced stream ordered memory allocator that can potentially resolve memory fragmentation issues. See https://developer.nvidia.com/blog/using-cuda-stream-ordered-memory-allocator-part-1/

@rongou rongou added feature request New feature or request 3 - Ready for Review Ready for review by team RMM Performance Performance related issue Java Affects Java cuDF API. Spark Functionality that helps Spark RAPIDS non-breaking Non-breaking change labels Sep 9, 2021
@rongou rongou requested review from jlowe and abellina September 9, 2021 02:52
@rongou rongou self-assigned this Sep 9, 2021
@rongou rongou requested a review from a team as a code owner September 9, 2021 02:52
Copy link
Member

@jlowe jlowe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to have at least a smoke test of the new allocator type in RmmTest that sets up the allocator, allocates and frees memory to exercise it. Bonus points if it also sets up the allocator with a small limit and verifies it gets an OOM if it tries to allocate just beyond that size.

java/src/main/java/ai/rapids/cudf/Rmm.java Outdated Show resolved Hide resolved
java/src/main/native/src/RmmJni.cpp Outdated Show resolved Hide resolved
java/src/main/native/src/RmmJni.cpp Outdated Show resolved Hide resolved
Copy link
Contributor Author

@rongou rongou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added smoke test, which will be skipped if cuda < 11.2.

java/src/main/java/ai/rapids/cudf/Rmm.java Outdated Show resolved Hide resolved
java/src/main/native/src/RmmJni.cpp Outdated Show resolved Hide resolved
java/src/main/native/src/RmmJni.cpp Outdated Show resolved Hide resolved
@abellina
Copy link
Contributor

abellina commented Sep 9, 2021

I just filed another pretty much exact PR to this one: #9208. The main difference is that I am wrapping the async allocator with limiting_resource_adaptor.

I did not know @rongou's PR was up, for some reason. I closed mine in favor of @rongou's PR, since folks already spent time reviewing his.

java/src/main/native/src/RmmJni.cpp Outdated Show resolved Hide resolved
java/src/test/java/ai/rapids/cudf/RmmTest.java Outdated Show resolved Hide resolved
@jlowe
Copy link
Member

jlowe commented Sep 10, 2021

Note that we now support Java in the CI, so Java PRs should not skip ci.

@jlowe jlowe changed the title support cuda async memory resource in jni [skip ci] support CUDA async memory resource in JNI Sep 10, 2021
@jlowe
Copy link
Member

jlowe commented Sep 10, 2021

rerun tests

1 similar comment
@rongou
Copy link
Contributor Author

rongou commented Sep 10, 2021

rerun tests

@rongou
Copy link
Contributor Author

rongou commented Sep 10, 2021

rerun tests

1 similar comment
@jlowe
Copy link
Member

jlowe commented Sep 13, 2021

rerun tests

@jrhemstad
Copy link
Contributor

fyi, you could also experiment with using cuda_async_resource as the upstream for arena. That might give you best of both worlds of what you're looking for.

@codecov
Copy link

codecov bot commented Sep 13, 2021

Codecov Report

❗ No coverage uploaded for pull request base (branch-21.10@4349232). Click here to learn what that means.
The diff coverage is n/a.

❗ Current head 6e8cf91 differs from pull request most recent head 5d91944. Consider uploading reports for the commit 5d91944 to get more accurate results
Impacted file tree graph

@@               Coverage Diff               @@
##             branch-21.10    #9201   +/-   ##
===============================================
  Coverage                ?   10.82%           
===============================================
  Files                   ?      115           
  Lines                   ?    19166           
  Branches                ?        0           
===============================================
  Hits                    ?     2074           
  Misses                  ?    17092           
  Partials                ?        0           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4349232...5d91944. Read the comment docs.

@rongou
Copy link
Contributor Author

rongou commented Sep 13, 2021

@gpucibot merge

@rapids-bot rapids-bot bot merged commit c6ddd46 into rapidsai:branch-21.10 Sep 13, 2021
@rongou
Copy link
Contributor Author

rongou commented Sep 13, 2021

fyi, you could also experiment with using cuda_async_resource as the upstream for arena. That might give you best of both worlds of what you're looking for.

@jrhemstad Yeah that's something we can try if it turns out small allocations are too expensive with async.

@abellina
Copy link
Contributor

abellina commented Sep 13, 2021

@jrhemstad filed this: rapidsai/rmm#868, we need to fix this before we start using the async allocator. He thought it was a quick fix, and that it could be included in 21.10. FYI @sameerz

@jlowe
Copy link
Member

jlowe commented Sep 13, 2021

you could also experiment with using cuda_async_resource as the upstream for arena

It seems that circumvents the fragmentation-solving feature we want from the async allocator. If arena only allocates large chunks from the async allocator, won't we still have fragmentation within the arena blocks that the async allocator cannot solve since the async allocator will be unaware of the sub-utilization of the allocations it sees?

@rongou
Copy link
Contributor Author

rongou commented Sep 14, 2021

The per-thread arenas are just caches for small allocations. If cuda async proves to be slow for small allocations, we can use the arena allocator to speed up these, as in a typical job there are tons of small allocations. The number of free blocks are now capped in each per-thread arena, so in theory it shouldn't cause too much additional fragmentation. If/when we decide to try this, we can probably further tweak the algorithm.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team feature request New feature or request Java Affects Java cuDF API. non-breaking Non-breaking change Performance Performance related issue Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants