-
Notifications
You must be signed in to change notification settings - Fork 240
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] OOM when running NDS queries with UCX and GDS #2504
Comments
libcudf allocations go through RMM, and all RMM allocations are wrapped via the JNI memory handler that traps OOM exceptions and calls back into the JVM to try to spill. Therefore libcudf allocations should be covered, and allocating while spilling should not be a problem as long as there are buffers to spill. Was the spill storage exhausted before the OOM was ultimately thrown to the application? The executor logs should have a warning message logged when the OOM is thrown that looks something like this:
And just before that you should see one or more info log messages that show how much the device spill store had available at the time of each OOM that was trapped:
|
Ah that makes sense. Because gpu concurrency is set to 2, spilling lines are interspersed with adding new tables. Here is the output from Executor task launch worker for task 15904 21/05/24 20:34:38:560 INFO DeviceMemoryEventHandler: Device allocation of 922253344 bytes failed, device store has 33414782208 bytes. Total RMM allocated is 39091921664 bytes.
Executor task launch worker for task 15904 21/05/24 20:34:38:868 INFO DeviceMemoryEventHandler: Spilled 957535936 bytes from the device store
Executor task launch worker for task 15904 21/05/24 20:34:38:868 INFO DeviceMemoryEventHandler: Device allocation of 922253344 bytes failed, device store has 32511059648 bytes. Total RMM allocated is 38166767104 bytes.
Executor task launch worker for task 15904 21/05/24 20:34:39:84 INFO DeviceMemoryEventHandler: Spilled 922488000 bytes from the device store
Executor task launch worker for task 15904 21/05/24 20:34:39:85 INFO DeviceMemoryEventHandler: Device allocation of 922253344 bytes failed, device store has 31592134400 bytes. Total RMM allocated is 37248090112 bytes.
Executor task launch worker for task 15904 21/05/24 20:34:39:314 INFO DeviceMemoryEventHandler: Spilled 923951424 bytes from the device store
Executor task launch worker for task 15904 21/05/24 20:34:39:314 INFO DeviceMemoryEventHandler: Device allocation of 922253344 bytes failed, device store has 30671030144 bytes. Total RMM allocated is 36324637184 bytes.
Executor task launch worker for task 15904 21/05/24 20:34:39:576 INFO DeviceMemoryEventHandler: Spilled 770921856 bytes from the device store
Executor task launch worker for task 15904 21/05/24 20:34:39:576 INFO DeviceMemoryEventHandler: Device allocation of 922253344 bytes failed, device store has 29749299776 bytes. Total RMM allocated is 35249146112 bytes.
Executor task launch worker for task 15904 21/05/24 20:34:40:267 INFO DeviceMemoryEventHandler: Spilled 923513280 bytes from the device store
Executor task launch worker for task 15904 21/05/24 20:34:40:267 INFO DeviceMemoryEventHandler: Device allocation of 922253344 bytes failed, device store has 28831180736 bytes. Total RMM allocated is 34325514240 bytes.
Executor task launch worker for task 15904 21/05/24 20:34:40:510 INFO DeviceMemoryEventHandler: Spilled 924609408 bytes from the device store
Executor task launch worker for task 15904 21/05/24 20:34:40:511 INFO DeviceMemoryEventHandler: Device allocation of 922253344 bytes failed, device store has 27911556992 bytes. Total RMM allocated is 33400769792 bytes.
Executor task launch worker for task 15904 21/05/24 20:34:40:752 INFO DeviceMemoryEventHandler: Spilled 923612928 bytes from the device store
Executor task launch worker for task 15904 21/05/24 20:34:40:752 INFO DeviceMemoryEventHandler: Device allocation of 922253344 bytes failed, device store has 26994072704 bytes. Total RMM allocated is 32477029376 bytes.
Executor task launch worker for task 15904 21/05/24 20:34:41:01 INFO DeviceMemoryEventHandler: Spilled 924478080 bytes from the device store
Executor task launch worker for task 15904 21/05/24 20:34:41:02 INFO DeviceMemoryEventHandler: Device allocation of 922253344 bytes failed, device store has 26074123520 bytes. Total RMM allocated is 31552404992 bytes.
Executor task launch worker for task 15904 21/05/24 20:34:41:254 INFO DeviceMemoryEventHandler: Spilled 922651776 bytes from the device store
Executor task launch worker for task 15904 21/05/24 20:34:41:255 INFO DeviceMemoryEventHandler: Device allocation of 922253344 bytes failed, device store has 25155440768 bytes. Total RMM allocated is 30634233344 bytes.
Executor task launch worker for task 15904 21/05/24 20:34:41:501 INFO DeviceMemoryEventHandler: Spilled 923119872 bytes from the device store
Executor task launch worker for task 15904 21/05/24 20:34:41:502 INFO DeviceMemoryEventHandler: Device allocation of 922253344 bytes failed, device store has 24236214080 bytes. Total RMM allocated is 29715189248 bytes.
Executor task launch worker for task 15904 21/05/24 20:34:41:743 INFO DeviceMemoryEventHandler: Spilled 922323648 bytes from the device store
Executor task launch worker for task 15904 21/05/24 20:34:41:743 INFO DeviceMemoryEventHandler: Device allocation of 922253344 bytes failed, device store has 23317938560 bytes. Total RMM allocated is 28797192704 bytes.
Executor task launch worker for task 15904 21/05/24 20:34:41:997 INFO DeviceMemoryEventHandler: Spilled 922265664 bytes from the device store
Executor task launch worker for task 15904 21/05/24 20:34:41:997 INFO DeviceMemoryEventHandler: Device allocation of 922253344 bytes failed, device store has 22399751360 bytes. Total RMM allocated is 27879741696 bytes.
Executor task launch worker for task 15904 21/05/24 20:34:42:240 INFO DeviceMemoryEventHandler: Spilled 923620608 bytes from the device store
Executor task launch worker for task 15904 21/05/24 20:34:42:240 INFO DeviceMemoryEventHandler: Device allocation of 922253344 bytes failed, device store has 21479970944 bytes. Total RMM allocated is 26961197312 bytes.
Executor task launch worker for task 15904 21/05/24 20:34:42:473 INFO DeviceMemoryEventHandler: Spilled 922864128 bytes from the device store
Executor task launch worker for task 15904 21/05/24 20:34:42:473 INFO DeviceMemoryEventHandler: Device allocation of 922253344 bytes failed, device store has 20561154752 bytes. Total RMM allocated is 26041150976 bytes.
Executor task launch worker for task 15904 21/05/24 20:34:42:718 INFO DeviceMemoryEventHandler: Spilled 922604352 bytes from the device store
Executor task launch worker for task 15904 21/05/24 20:34:42:719 INFO DeviceMemoryEventHandler: Device allocation of 922253344 bytes failed, device store has 19642611008 bytes. Total RMM allocated is 25122911744 bytes.
Executor task launch worker for task 15904 21/05/24 20:34:42:955 INFO DeviceMemoryEventHandler: Spilled 922385472 bytes from the device store
Executor task launch worker for task 15904 21/05/24 20:34:42:955 INFO DeviceMemoryEventHandler: Device allocation of 922253344 bytes failed, device store has 18724193792 bytes. Total RMM allocated is 24204867840 bytes.
Executor task launch worker for task 15904 21/05/24 20:34:43:190 INFO DeviceMemoryEventHandler: Spilled 922324800 bytes from the device store
Executor task launch worker for task 15904 21/05/24 20:34:43:191 INFO DeviceMemoryEventHandler: Device allocation of 922253344 bytes failed, device store has 17806062464 bytes. Total RMM allocated is 23287089920 bytes.
Executor task launch worker for task 15904 21/05/24 20:34:43:430 INFO DeviceMemoryEventHandler: Spilled 923287488 bytes from the device store
Executor task launch worker for task 15904 21/05/24 20:34:43:431 INFO DeviceMemoryEventHandler: Device allocation of 922253344 bytes failed, device store has 16886832512 bytes. Total RMM allocated is 22368237824 bytes.
Executor task launch worker for task 15904 21/05/24 20:34:43:671 INFO DeviceMemoryEventHandler: Spilled 922334400 bytes from the device store
Executor task launch worker for task 15904 21/05/24 20:34:43:672 INFO DeviceMemoryEventHandler: Device allocation of 922253344 bytes failed, device store has 15968340032 bytes. Total RMM allocated is 21450115584 bytes.
Executor task launch worker for task 15904 21/05/24 20:34:43:918 INFO DeviceMemoryEventHandler: Spilled 924839040 bytes from the device store
Executor task launch worker for task 15904 21/05/24 20:34:43:918 INFO DeviceMemoryEventHandler: Device allocation of 922253344 bytes failed, device store has 15047275904 bytes. Total RMM allocated is 20528381184 bytes.
Executor task launch worker for task 15904 21/05/24 20:34:44:424 INFO DeviceMemoryEventHandler: Spilled 922786816 bytes from the device store
Executor task launch worker for task 15904 21/05/24 20:34:44:424 INFO DeviceMemoryEventHandler: Device allocation of 922253344 bytes failed, device store has 14128526656 bytes. Total RMM allocated is 19609404160 bytes.
Executor task launch worker for task 15904 21/05/24 20:34:44:668 INFO DeviceMemoryEventHandler: Spilled 922778304 bytes from the device store
Executor task launch worker for task 15904 21/05/24 20:34:44:669 INFO DeviceMemoryEventHandler: Device allocation of 922253344 bytes failed, device store has 13209531136 bytes. Total RMM allocated is 18690748160 bytes.
Executor task launch worker for task 15904 21/05/24 20:34:44:921 INFO DeviceMemoryEventHandler: Spilled 923547456 bytes from the device store
Executor task launch worker for task 15904 21/05/24 20:34:44:922 INFO DeviceMemoryEventHandler: Device allocation of 922253344 bytes failed, device store has 12289936768 bytes. Total RMM allocated is 17771073792 bytes.
Executor task launch worker for task 15904 21/05/24 20:34:45:368 INFO DeviceMemoryEventHandler: Spilled 923549376 bytes from the device store
Executor task launch worker for task 15904 21/05/24 20:34:45:368 INFO DeviceMemoryEventHandler: Device allocation of 922253344 bytes failed, device store has 11369205760 bytes. Total RMM allocated is 16851182336 bytes.
Executor task launch worker for task 15904 21/05/24 20:34:45:874 INFO DeviceMemoryEventHandler: Spilled 923342784 bytes from the device store
Executor task launch worker for task 15904 21/05/24 20:34:45:875 INFO DeviceMemoryEventHandler: Device allocation of 922253344 bytes failed, device store has 10449756736 bytes. Total RMM allocated is 15931006208 bytes.
Executor task launch worker for task 15904 21/05/24 20:34:46:117 INFO DeviceMemoryEventHandler: Spilled 922883520 bytes from the device store
Executor task launch worker for task 15904 21/05/24 20:34:46:117 INFO DeviceMemoryEventHandler: Device allocation of 922253344 bytes failed, device store has 9530909248 bytes. Total RMM allocated is 15012487936 bytes.
Executor task launch worker for task 15904 21/05/24 20:34:46:345 INFO DeviceMemoryEventHandler: Spilled 922749696 bytes from the device store
Executor task launch worker for task 15904 21/05/24 20:34:46:345 INFO DeviceMemoryEventHandler: Device allocation of 922253344 bytes failed, device store has 8612094976 bytes. Total RMM allocated is 14094076416 bytes.
Executor task launch worker for task 15904 21/05/24 20:34:46:580 INFO DeviceMemoryEventHandler: Spilled 922618688 bytes from the device store
Executor task launch worker for task 15904 21/05/24 20:34:46:581 INFO DeviceMemoryEventHandler: Device allocation of 922253344 bytes failed, device store has 7693204736 bytes. Total RMM allocated is 13198959872 bytes.
Executor task launch worker for task 15904 21/05/24 20:34:46:830 INFO DeviceMemoryEventHandler: Spilled 924134720 bytes from the device store
Executor task launch worker for task 15904 21/05/24 20:34:46:830 INFO DeviceMemoryEventHandler: Device allocation of 922253344 bytes failed, device store has 6775202112 bytes. Total RMM allocated is 12279469312 bytes.
Executor task launch worker for task 15904 21/05/24 20:34:47:85 INFO DeviceMemoryEventHandler: Spilled 923583680 bytes from the device store
Executor task launch worker for task 15904 21/05/24 20:34:47:85 INFO DeviceMemoryEventHandler: Device allocation of 922253344 bytes failed, device store has 5855502400 bytes. Total RMM allocated is 11360095744 bytes.
Executor task launch worker for task 15904 21/05/24 20:34:47:329 INFO DeviceMemoryEventHandler: Spilled 922276480 bytes from the device store
Executor task launch worker for task 15904 21/05/24 20:34:47:330 INFO DeviceMemoryEventHandler: Device allocation of 922253344 bytes failed, device store has 4937363712 bytes. Total RMM allocated is 10442732032 bytes.
Executor task launch worker for task 15904 21/05/24 20:34:47:582 INFO DeviceMemoryEventHandler: Spilled 923401216 bytes from the device store
Executor task launch worker for task 15904 21/05/24 20:34:47:583 INFO DeviceMemoryEventHandler: Device allocation of 922253344 bytes failed, device store has 4018006976 bytes. Total RMM allocated is 9524222208 bytes.
Executor task launch worker for task 15904 21/05/24 20:34:47:827 INFO DeviceMemoryEventHandler: Spilled 922386816 bytes from the device store
Executor task launch worker for task 15904 21/05/24 20:34:47:828 INFO DeviceMemoryEventHandler: Device allocation of 922253344 bytes failed, device store has 3099492224 bytes. Total RMM allocated is 8604730880 bytes.
Executor task launch worker for task 15904 21/05/24 20:34:48:98 INFO DeviceMemoryEventHandler: Spilled 923372800 bytes from the device store
Executor task launch worker for task 15904 21/05/24 20:34:48:99 INFO DeviceMemoryEventHandler: Device allocation of 922253344 bytes failed, device store has 2180239168 bytes. Total RMM allocated is 7685363712 bytes.
Executor task launch worker for task 15904 21/05/24 20:34:48:349 INFO DeviceMemoryEventHandler: Spilled 922616960 bytes from the device store
Executor task launch worker for task 15904 21/05/24 20:34:48:349 INFO DeviceMemoryEventHandler: Device allocation of 922253344 bytes failed, device store has 1261718912 bytes. Total RMM allocated is 6767174144 bytes.
Executor task launch worker for task 15904 21/05/24 20:34:48:606 INFO DeviceMemoryEventHandler: Spilled 923679424 bytes from the device store
Executor task launch worker for task 15904 21/05/24 20:34:48:606 INFO DeviceMemoryEventHandler: Device allocation of 922253344 bytes failed, device store has 342021952 bytes. Total RMM allocated is 5847780864 bytes.
Executor task launch worker for task 15904 21/05/24 20:34:48:717 INFO DeviceMemoryEventHandler: Spilled 342021952 bytes from the device store
Executor task launch worker for task 15904 21/05/24 20:34:48:717 INFO DeviceMemoryEventHandler: Device allocation of 922253344 bytes failed, device store has 3473664 bytes. Total RMM allocated is 5510065920 bytes.
Executor task launch worker for task 15904 21/05/24 20:34:48:722 INFO DeviceMemoryEventHandler: Spilled 3472320 bytes from the device store
Executor task launch worker for task 15904 21/05/24 20:34:48:722 INFO DeviceMemoryEventHandler: Device allocation of 922253344 bytes failed, device store has 4190400 bytes. Total RMM allocated is 5510883840 bytes.
Executor task launch worker for task 15904 21/05/24 20:34:48:725 INFO DeviceMemoryEventHandler: Spilled 4190400 bytes from the device store
Executor task launch worker for task 15904 21/05/24 20:34:48:725 INFO DeviceMemoryEventHandler: Device allocation of 922253344 bytes failed, device store has 494592 bytes. Total RMM allocated is 5507719936 bytes.
Executor task launch worker for task 15904 21/05/24 20:34:48:725 INFO DeviceMemoryEventHandler: Spilled 494592 bytes from the device store
Executor task launch worker for task 15904 21/05/24 20:34:48:725 INFO DeviceMemoryEventHandler: Device allocation of 922253344 bytes failed, device store has 0 bytes. Total RMM allocated is 5509950464 bytes.
Executor task launch worker for task 15904 21/05/24 20:34:48:725 WARN DeviceMemoryEventHandler: Device store exhausted, unable to allocate 922253344 bytes. Total RMM allocated is 5524215552 bytes. Maybe the memory is just super fragmented? I do see a lot of tiny shuffle buffers. |
There is a config feature to perform a heap dump on GPU OOM. It may be useful to take a look at the resulting heap dump and see what |
Looks like these errors happen with GDS disabled as well, so it's not specific to GDS spilling. |
Agree with @jlowe. The heap dump on OOM feature would help figure out what's going on here. |
We see a similar issue in Q64 TPCDS @ 3TB. In this case, it is a query that normally passes, but earlier in the month, it failed with what looks like a fragmented pool:
I'll take a look at running this query and capturing the heap at various places to see if we have leaks, it may overlap with this task. |
I got a hprof dump from one of the OOM queries. Looks like all the |
That's great @rongou. As far as I understand, spill is blocking, so if we wanted to allocate X, and we freed Y where Y >= X, but we still can't allocate, it probably means that RMM is fragmented. But, I can't recall what happens right after we free to target Y. Do we allocate in the same stack (holding a lock) or do we free the lock? If we return and release the lock then there could be a race with another task. In this case, the key seems to be in the first, and the last two lines of the log:
This was a ~40GB GPU, and was mostly full at 20:34:38:560. At 20:34:48:725, we gave up because we freed 33GB from the GPU store. But we still can't allocate 900MB in the last entry, and RMM is telling us it sees 5GB allocated. So either the RMM allocated number is not accurate (because of races or some other issue), or we have enough fragmentation that a 900MB block can't be found. |
Looking at tweaking the arena allocator that might help with this. |
Filed this rapidsai/rmm#813 to be able to get more info from ARENA in cases like this, I believe that is a pre-requisite for this issue. |
I think we may be able to learn quite a bit by leveraging the tracking allocator to build a rough approximation of the memory map. Minimally it could be used to help verify we don't have a small GPU memory leak that is aggravating the fragmentation. |
Thanks @jlowe will take a look at that then to make progress. |
I've made some progress on this and have a theory. If I set the I came to this variable because of the logic on If I set the Because of the interleaving, the blocks likely cannot be merged https://github.com/rapidsai/rmm/blob/branch-21.10/include/rmm/mr/device/detail/arena.hpp#L201, as they are not contiguous in VA space. |
I tracked allocations and frees and their respective sizes, especially separating the blocks under I am also tracking gaps between blocks: (i.e. |
@rongou has opened a PR for this issue here: rapidsai/rmm#845, so we are mostly working there. I'll close this issue once we have a cuDF that includes the fix. |
At this point this issue is waiting for Rong's PR. I made a comment that is going to take a little bit of work, but @rongou mentioned he will be able to get to it (rapidsai/rmm#845 (comment)). The main thing is if local arenas can't reach 0 free blocks, we could very well end up in a similar place as before, but the thought is to fix that so we can achieve 0 free blocks in all arenas on OOM, which ensures arena can get back to initial state. |
We've improved the arena allocator as much as we could, the oom issue is better now but not completely eliminated. Switching to the async allocator should further help. #4515 |
Describe the bug
When running TPC-DS at scale factor 5000 on a Yarn cluster with 8xA100 40 GB GPUs, seeing out-of-memory errors in libcudf code (see full stacktrace below). Probably caused by native code allocating memory while java/scala is actively spilling.
Steps/Code to reproduce bug
TPC-DS at scale factor 5000, query 14a/b, 16, 24b.
Expected behavior
Should not throw oom errors.
Environment details (please complete the following information)
Additional context
Full stacktrace:
@jlowe @abellina @revans2
The text was updated successfully, but these errors were encountered: