Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] NoClassDefFoundError with caller classloader off in GpuShuffleCoalesceIterator in local-cluster #5513

Closed
gerashegalov opened this issue May 17, 2022 · 1 comment · Fixed by #5614
Assignees
Labels
bug Something isn't working P0 Must have for release

Comments

@gerashegalov
Copy link
Collaborator

Describe the bug
running test_cartesian_join_special_case_count fails with
Caused by: java.lang.NoClassDefFoundError: com/nvidia/spark/rapids/Arm

E                       at com.nvidia.spark.rapids.GpuShuffleCoalesceIterator.$anonfun$next$3(GpuShuffleCoalesceExec.scala:214)
E                       at com.nvidia.spark.rapids.Arm.withResource(Arm.scala:28)
E                       at com.nvidia.spark.rapids.Arm.withResource$(Arm.scala:26)
E                       at com.nvidia.spark.rapids.GpuShuffleCoalesceIterator.withResource(GpuShuffleCoalesceExec.scala:191)
E                       at com.nvidia.spark.rapids.GpuShuffleCoalesceIterator.$anonfun$next$2(GpuShuffleCoalesceExec.scala:213)
E                       at com.nvidia.spark.rapids.Arm.withResource(Arm.scala:28)
E                       at com.nvidia.spark.rapids.Arm.withResource$(Arm.scala:26)
E                       at com.nvidia.spark.rapids.GpuShuffleCoalesceIterator.withResource(GpuShuffleCoalesceExec.scala:191)
E                       at com.nvidia.spark.rapids.GpuShuffleCoalesceIterator.$anonfun$next$1(GpuShuffleCoalesceExec.scala:207)
E                       at com.nvidia.spark.rapids.Arm.withResource(Arm.scala:28)
E                       at com.nvidia.spark.rapids.Arm.withResource$(Arm.scala:26)
E                       at com.nvidia.spark.rapids.GpuShuffleCoalesceIterator.withResource(GpuShuffleCoalesceExec.scala:191)
E                       at com.nvidia.spark.rapids.GpuShuffleCoalesceIterator.next(GpuShuffleCoalesceExec.scala:206)
E                       at com.nvidia.spark.rapids.GpuShuffleCoalesceIterator.next(GpuShuffleCoalesceExec.scala:191)
E                       at com.nvidia.spark.rapids.GpuHashAggregateIterator.aggregateInputBatches(aggregate.scala:283)
E                       at com.nvidia.spark.rapids.GpuHashAggregateIterator.$anonfun$next$2(aggregate.scala:238)
E                       at scala.Option.getOrElse(Option.scala:189)
E                       at com.nvidia.spark.rapids.GpuHashAggregateIterator.next(aggregate.scala:235)
E                       at com.nvidia.spark.rapids.GpuHashAggregateIterator.next(aggregate.scala:181)
E                       at com.nvidia.spark.rapids.ColumnarToRowIterator.$anonfun$fetchNextBatch$2(GpuColumnarToRowExec.scala:241)
E                       at com.nvidia.spark.rapids.Arm.withResource(Arm.scala:28)
E                       at com.nvidia.spark.rapids.Arm.withResource$(Arm.scala:26)
E                       at com.nvidia.spark.rapids.ColumnarToRowIterator.withResource(GpuColumnarToRowExec.scala:187)
E                       at com.nvidia.spark.rapids.ColumnarToRowIterator.fetchNextBatch(GpuColumnarToRowExec.scala:238)
E                       at com.nvidia.spark.rapids.ColumnarToRowIterator.loadNextBatch(GpuColumnarToRowExec.scala:215)
E                       at com.nvidia.spark.rapids.ColumnarToRowIterator.hasNext(GpuColumnarToRowExec.scala:255)
E                       at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
E                       at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:349)
E                       at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898)
E                       at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898)
E                       at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
E                       at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
E                       at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
E                       at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
E                       at org.apache.spark.scheduler.Task.run(Task.scala:131)
E                       at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
E                       at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1462)
E                       at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
E                       at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
E                       at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
E                       ... 1 more

Steps/Code to reproduce bug
invoke:

TEST_PARALLEL=0 \
  PYSP_TEST_spark_rapids_force_caller_classloader=false \
  NUM_LOCAL_EXECS=1 \
  ./integration_tests/run_pyspark_from_build.sh -k test_cartesian_join_special_case_count

Expected behavior
Should pass

Environment details (please complete the following information)

  • Environment location: Standalone,

  • Additional context

Add any other context about the problem here.

originally reported by h/t @pxLi

@gerashegalov gerashegalov added bug Something isn't working ? - Needs Triage Need team to review and classify labels May 17, 2022
@sameerz sameerz removed the ? - Needs Triage Need team to review and classify label May 17, 2022
@gerashegalov gerashegalov removed their assignment May 20, 2022
@gerashegalov
Copy link
Collaborator Author

This issue is caused by #4588 since 22.04 adding a Scala class ai.rapids.cudf.HostConcatResultUtil. Our build assumes that all classes ai.rapids are Java classes from the rapidsai/cudf repo without any compatibility issues and keeps them in the conventional jar location. HostConcatResultUtil is a scala class and had direct references to Arm that is not visible to the conventional classloader. So when we are not forcefully modifying the "conventional" classloader, we hit the issue

@gerashegalov gerashegalov self-assigned this May 24, 2022
@gerashegalov gerashegalov added the P0 Must have for release label May 24, 2022
@gerashegalov gerashegalov added this to the May 23 - Jun 3 milestone May 24, 2022
rapids-bot bot pushed a commit to rapidsai/cudf that referenced this issue May 24, 2022
to be usable in different packages of spark-rapids to address 
NVIDIA/spark-rapids#5513

Signed-off-by: Gera Shegalov <gera@apache.org>

Authors:
  - Gera Shegalov (https://github.com/gerashegalov)

Approvers:
  - Jason Lowe (https://github.com/jlowe)

URL: #10949
gerashegalov added a commit that referenced this issue May 25, 2022
Don't use ai.rapidsai.cudf package for spark-rapids Scala classes. Otherwise it is going to be loaded by conventional classloader and fail to load referenced classes out of the shimmed areas.
- Move the class 
- add a smoking test to prevent this sort of regressions in premerge

Closes #5513. 

Depends on rapidsai/cudf#10949
    
Signed-off-by: Gera Shegalov <gera@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working P0 Must have for release
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants