You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I think I know what triggered this. In the new code, we will always use Spark's build-side table to build the hash table used in the join. However in the old code, the hash table was being built implicitly as part of the libcudf join call, and libcudf's inner join logic would choose whichever table had the fewest rows as the build-side table. For query 82, there is an inner join where the right side table is enormous compared to the left side table. That means in the old code we would build a tiny table from the left, whereas in the new code we build an enormous table from the right. Building the hash table is the most expensive part of the join, so we end up losing badly to the code that chose the smaller table as the build table for inner joins.
It looks like the changes in this PR: #3288 are causing a regression in performance for q82 (to several minutes).
I noticed a lot of time spent in the following code and @jlowe suggested to revert. After doing so, we are able to get back to ~20 seconds.
The text was updated successfully, but these errors were encountered: