-
Notifications
You must be signed in to change notification settings - Fork 240
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[QST] TPC-DS Query-93 on gpu is slow #622
Comments
@chenrui17 I have not been able to reproduce this yet, but I am going to keep trying. I have tried with scale factor 200 on a single node on top of spark-3.0.0, but looking at the logs you put in the other issue, it looks like you are using 3.0.1 with adaptive query execution enabled and a 1TB data set. But the detailed log you gave was for the CPU run, so I am not sure if these configs were the same for both runs.
Was the data partitioned? This can have a huge impact on GPU performance because of small files. |
I have been playing around with AQE with this query and I am seeing a lot of slowness when reading the input data because it highly partitions the output data. The slowness you are seeing could be related to fetching lots of small partitions. If you are running with 3000 partitions on the GPU side it could be related to this. |
@revans2 i confirmed that the data is partitionned |
in fact, i use a 100GB data set ,why you say i use a 1TB data set . and about AQE configuration , i will attemp to turn off , About AQE, I'll try to turn it off for testing |
about "spark.sql.shuffle.partitions " , I found that it did affect performance , but if i set it smaller , it often OOM on big data set ,like 1TB data set, even I use concurrentTask=1 ,and rapids.BatchSize=256m,but it also OOM to TPC-DS query 14a |
In the log file you attached the input path included I ran with a partitioned 100 GB data set on a single 16GB-V100. I can try and play around with the data size to see what I ended up with, but I had a much larger batch size then that too. It sounds like you might be running into issues with PTDS and memory fragmentation in RMM. @jlowe what do you think. Because I have been running with a 1GB batch size a 1 GB input partition size, and I was able to run with a shuffle.partitions of 2 with PTDS disabled. |
We've definitely seen increased OOM issues with per-thread default stream enabled, which lead us to turned that off in the recent cudf-0.15-SNAPSHOT builds. @chenrui17 I would recommend retrying with smaller partitions with a cudf built with PTDS disabled. If you are getting the latest cudf-0.15-SNAPSHOT builds from Sonatype then it should have PTDS off. If you're building cudf on your own, make sure you do not specify PER_THREAD_DEFAULT_STREAM or specify PER_THREAD_DEFAULT_STREAM=OFF in both the cpp and java builds. |
The amount of data displayed in the path is inconsistent with the actual data data size ; |
Yes, it has been disabled in the snapshot builds since the 20200828.210405-58 snapshot. |
@revans2 In addition, I would like to ask a question, do I need to switch the spark version from spark 3.0.1 to 3.0.0 |
@jlowe my nsys file and history ui is about 100m + , how can i send it to you . |
@chenrui17 you should not need to switch from 3.0.1 to 3.0.0. We "support" both of them, but until 3.0.1 is released it is a moving target, but the Rapids Accelerator requires 3.0.1+ to be able to support Adaptive Query Execution. |
@chenrui17 I looked at the nsys trace. I don't see any task taking 36 seconds, as reported in the GPU join metrics in the history UI. I'm guessing this trace was of an executor that did not run the task that hit that long build time, so it's not necessarily representative of what happened on that task. None of the build or joins took very long in the trace once the task started processing. I did see what appears to be tasks spread out in the second stage, and I'm assuming this is waiting for the GPU semaphore. I can't know for sure since Java NVTX ranges were missing from the trace. To add them, you would need to build your own cudf jar, as the snapshot-published version has NVTX ranges turned off for performance (i.e.: published snapshot builds libcudf with We've tried to reproduce this behavior locally but have not seen such large discrepancies between the join build time metric and one of the coalesce batch collect metrics above it. |
I spoke too soon. @revans2 was able to reproduce the issue, and we discovered that GpuCoalesceBatches can call its input iterator's @chenrui17 to answer your original question, I'm fairly confident this large amount of time is mostly time the task spent waiting for its turn on the GPU. |
I mean what you said is task time is mainly wating for semaphore to fight for the right to use gpu. ? |
The main bottleneck during this portion of the query is waiting to acquire the semaphore. However the tasks owning the semaphore are not making full use of the GPU. The main portion of time they are spending is decompressing shuffle data on the CPU and copying it down to the GPU. They need to own the GPU semaphore during this phase because they are placing data onto the GPU. The whole point of the GPU semaphore is to prevent too many tasks from placing data onto the GPU at the same time and exhausting the GPU's memory. Essentially the main bottleneck in that stage is dealing with the shuffle data and transfer to the GPU, because that's what's taking so long for the tasks holding the GPU semaphore to release it. Once the shuffle data is loaded on the GPU, the rest of the stage processing is quite fast. The RapidsShuffleManager was designed explicitly to target this shuffle problem, as it tries to keep shuffle targets in GPU memory and not rely on the CPU for compression/decompression which can be a bottleneck. Unfortunately there are a number of issues with RapidsShuffleManager that prevent it from working well in all situations, but we're actively working on improving it. Our goal is to eventually have that shuffle manager be the preferred shuffle when using the RAPIDS Accelerator, even if the cluster does not have RDMA-capable hardware.
If you add more GPUs (and thus more executors, since an executor with the plugin can only control 1 GPU), yes, performance should be improved to a point. This would be similar to adding executors to a CPU job. If you have enough GPUs in your cluster so that CPU_cores_per_executor == concurrent_GPU_tasks then no task will ever wait on the GPU semaphore. |
So, does it only a time counting problem? what it the root cause why tpc-ds 93 is so slow? |
I think this is common for most query but why waiting-on gpu semaphore hurt the performance? increasing BTW: |
Some questions,
BTW: // spark-defaults.conf |
|
I can not find the code snippet(s) used to compare the batch's size (or the coalesced batch's size) with |
Two questions: why what is the strategy for judging whether to insert a |
It should be on both sides, and is a bug if it is not. What query did you run and what config settings did you use for this? My guess is that it is probably related to AQE in some way, but I am just speculating at this point. |
spark-rapids/sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuCoalesceBatches.scala Lines 268 to 323 in 2ee1a8d
The code is in a generic place because we use it for both the host size and device side concat. The target size is inserted in I should also clarify that the only time you should not see a GpuCoalesceBatch after a shuffle is if the data right after the shuffle is being sent directly to the host side. In all other cases it should be inserted. |
pic1-4 with |
but if AQE really take effect,should there exists a |
Since this issue originated as a question about slowness on query 93, I filed #698 to track the missing coalesce issue separately. |
@JustPlay Yes, I would expect to see a |
@chenrui17 is there anything left to answer or can this be closed? |
Closing since there was no response. Please reopen or file a new question if needed. |
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
this is follow on rapidsai/cudf#6107
@jlowe I will use nsight system to profile this query later and upload the file
The text was updated successfully, but these errors were encountered: