-
Notifications
You must be signed in to change notification settings - Fork 240
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] move some deserialization code out of the scope of the gpu-semaphore to increase cpu concurrent #679
Comments
this is the related issue: #622 |
@JustPlay We should be able to move those APIs to make them more generic so we can reuse them in other places. @jlowe I know you don't like callbacks because they make the code hard to reason about and I agree, but I was curious what you though about this situation? |
why not just copy the related code snippet(s) into rapids-plugin? |
We could in the short term, but it is a question of a separation of concerns, and the goals of each project. One of the reasons we put the serialization code in cudf was so that users of the cudf library did not have to know what the exact memory layout. We have changed that somewhat when we started to implement the UCX shuffle, but I think generally it still holds true that a user of cudf should not need to know the memory layout to get decent performance. Cudf is trying to be a generic backend and if we have run into an issue with Spark, that is not some odd corner case of how spark handles a specific SQL command, then others would likely want that same functionality. In this case I can totally see that other users might also want to know when data is completely off of the GPU, or when external I/O has completed and GPU processing is ready to start. |
I don't mind callbacks within the JVM so much, as those can be traced without too much effort. What becomes really hard to trace are callbacks from JNI into the JVM, as the IDE won't help you there. If we keep the serialization logic in cudf, we could solve this in a couple of ways:
|
The main point of the GPU semaphore is to limit the amount of GPU memory consumed by limiting concurrent tasks on the GPU. The general rule is: if a task has unspillable buffers on the GPU then it must be holding the GPU semaphore. Building the hash map necessarily means the task has buffers on the GPU, therefore it holds the semaphore. Without it, a lot of tasks could all build large hash tables in GPU memory such that no single task can load a batch of the stream table and make progress without triggering an OOM. Our plan to fix this involves making the buffers spillable while it is collecting columnar batches. Once the buffers are spillable, the task can release the semaphore while waiting for new batches to arrive. If GPU memory pressure from other tasks gets too high then these buffers can be spilled out of device memory to avoid the OOM becoming fatal to the application. |
The issue is about memory management. The We agree that in the ideal case I/O should not happen while the semaphore is held, but there is not a simple solution to this. We can fix it for the deserialization code, like was requested, but it will only help for the very first batch being read. We need a more holistic approach to this to make it work properly and quickly. We are still trying to think through what the ideal solution would look like, but with the 0.2.0 release close by, we have been a bit distracted trying to get everything ready for that. In general I see a few things we want.
|
can you make more detailed explanation about |
When the buffer is needed later, the buffer's ID can be used to lookup and acquire (lock) the buffer via If all of the GPU memory being held by a task is in the form of unlocked, spillable buffers then it should be able to let go of the GPU semaphore while performing other operations, since a new task arriving will at worst force the old tasks's buffers out of the device store. It will not lead to an OOM that wouldn't also happen with the old task. |
I am a little confused。Whether |
The gpu semaphore limits the task concurrency,even the number of cpu core is bigger than gpu concurrency,they can not run once some of them are taking the semaphore (before release it)。So the shuffle data reading (disk io, net, cpu decompress) can not overlap the gpu processing, so offen, the cpu/net/io can not feed enough data to gpu Without spilling, the gpu semaphore can not be released, So can we make some of the cpu part (e.g. data reading) async? e.g. we can add a backgroud aync cpu thead pool to reading data for the spark tasks, the async thread reading data from disk/net to memory, so the spark task can task shuffled data directly from memory. When the async thread reading data quickly enough, the gpu will always has enough data to processing, the gpu semaphore limiting will not be a problem I am not familiar with the shuffle fetch in cpu, not familiar with the code path of shuffle fetch in rapids,so i need some help |
I'm not sure a thread pool is the solution in this case. Spark already has a thread pool for these types of situations, and has a bunch of features to avoid doing a DDOS during the shuffle. Instead what we need to to is in JCudfSerializtion have a callback between reading the data into host memory and putting that data on the GPU. In this case I think it is in between these two lines.
will copy the data from the host buffer to the device and then turn it into a table. After that is done then you would change the shuffle code to grab the semaphore as a part of the callback instead. spark-rapids/sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuColumnarBatchSerializer.scala Line 152 in fdf731c
This only fixes the reading side of the issue. Very similar changes could be made for the write side. |
In the |
@revans2 In RapidsShuffleManager, i see there exists a producer-consumer model, is this a async thread pool to do shuffled data fetch? |
what do you mean by reading side and writing side? @revans2 |
@JustPlay there is a lot going on here with shuffle. The small change that I explained here is a simple thing that we can do right away to improve performance when spark shuffle is being used. You are right it will not fix the issue totally, especially in the case of a join. But like I said here that is going to take an in-depth design to work out how to fix it. If you want to try and do that design I would be happy to review it, but as it stands right now I have other things on my plate that prevent me from doing it.
In a shuffle there are two phases. The output of one tast is written and becomes the input to another task that reads it. The GPU semaphore is involved in both cases, although the reading side tends to be more of a problem, because it is generally remote. |
We have modify rapids and tested it,the performance improvement is negligible Just like what you have mentioned
|
We will take a deeper look at it, and hope we have the ability to try to impl them。Thanks。 @revans2 |
Is your feature request related to a problem? Please describe.
move some deserialization code out of the scope of the gpu-semaphore to increase cpu concurrent
Describe the solution you'd like
spark-rapids/sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuColumnarBatchSerializer.scala
Line 152 in 625cd6c
JCudfSerialization.readTableFrom() contains both
high-overhead cpu code
andgpu-code
, thegpu-code (include device memory allocation and copy)
should be executed within the scope of gpu-semaphore, but thecpu-code
DO NOT NEEDhttps://github.com/rapidsai/cudf/blob/19237a09fe9b80706ef463580c6a646f6f9ec2ae/java/src/main/java/ai/rapids/cudf/JCudfSerialization.java#L1505
this function i think do the following things
the step 1,2 do not need the gpu-semaphore, and it's better to get out of the gpu-semaphore
https://github.com/rapidsai/cudf/blob/19237a09fe9b80706ef463580c6a646f6f9ec2ae/java/src/main/java/ai/rapids/cudf/JCudfSerialization.java#L1522
only this line of code need the protect of the gpu-semaphore
all code before it do not touch the device memory, only touch host-buffer (pin-memory), so i think those code can be executed without holding the gpu-semaphore (just like the Scan operator)
I thinks, we should move some code from cuDF into rapids
Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.
Additional context
Add any other context, code examples, or references to existing implementations about the feature request here.
The text was updated successfully, but these errors were encountered: