[BUG] CDH integration tests ClassNotFoundException: com.nvidia.spark.rapids.spark321cdh.RapidsShuffleManager #6417

tgravescs · 2022-08-25T18:19:00Z

Describe the bug
the cloudera integration tests failed with:

09:03:43  22/08/25 14:03:43 INFO client.TransportClientFactory: Successfully created connection to rl-r7525-d32-u35.raplab.nvidia.com/10.150.166.217:33094 after 4 ms (0 ms spent in bootstraps)
09:03:43  Exception in thread "main" java.lang.reflect.UndeclaredThrowableException
09:03:43  	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1916)
09:03:43  	at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:61)
09:03:43  	at org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:419)
09:03:43  	at org.apache.spark.executor.YarnCoarseGrainedExecutorBackend$.main(YarnCoarseGrainedExecutorBackend.scala:81)
09:03:43  	at org.apache.spark.executor.YarnCoarseGrainedExecutorBackend.main(YarnCoarseGrainedExecutorBackend.scala)
09:03:43  Caused by: java.lang.ClassNotFoundException: com.nvidia.spark.rapids.spark321cdh.RapidsShuffleManager

Need to investigate what is going on

The text was updated successfully, but these errors were encountered:

tgravescs · 2022-08-26T19:22:05Z

so a workaround for this is to specify the rapids jar in the extraClassPath variable, on yarn it has to be ./. The weird thing is I thought that would already be in the classpath on YARN so not sure if something changed on the cluster or I'm thinking about it wrong.

gerashegalov · 2022-08-26T20:11:32Z

This is most likely caused by #5646.

And for Spark Standalone RapidsShuffleManager always requires extraClassPath per #5796.

zhanga5 · 2022-08-30T02:08:05Z

encountered below error after include spark.executor.extraClassPath=xxx

[2022-08-29T03:14:04.539Z] ../../src/main/python/hash_aggregate_test.py::test_hash_grpby_sum[{'spark.rapids.sql.variableFloatAgg.enabled': 'true', 'spark.rapids.sql.hasNans': 'false', 'spark.rapids.sql.castStringToFloat.enabled': 'true'}-[('a', Long), ('b', Integer), ('c', Long)]][IGNORE_ORDER, INCOMPAT, APPROXIMATE_FLOAT] 22/08/29 03:14:04 WARN scheduler.TaskSetManager: Lost task 1.0 in stage 0.0 (TID 1) (rl-r7525-d32-u07.raplab.nvidia.com executor 2): java.lang.IllegalStateException: unread block data
[2022-08-29T03:14:04.539Z] 	at java.io.ObjectInputStream$BlockDataInputStream.setBlockDataMode(ObjectInputStream.java:2900)
[2022-08-29T03:14:04.539Z] 	at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1701)
[2022-08-29T03:14:04.539Z] 	at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2404)
[2022-08-29T03:14:04.539Z] 	at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2328)
[2022-08-29T03:14:04.539Z] 	at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2186)
[2022-08-29T03:14:04.539Z] 	at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1666)
[2022-08-29T03:14:04.539Z] 	at java.io.ObjectInputStream.readObject(ObjectInputStream.java:502)
[2022-08-29T03:14:04.539Z] 	at java.io.ObjectInputStream.readObject(ObjectInputStream.java:460)
[2022-08-29T03:14:04.539Z] 	at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:76)
[2022-08-29T03:14:04.539Z] 	at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:115)
[2022-08-29T03:14:04.539Z] 	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:466)
[2022-08-29T03:14:04.539Z] 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
[2022-08-29T03:14:04.539Z] 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
[2022-08-29T03:14:04.539Z] 	at java.lang.Thread.run(Thread.java:748)

abellina · 2022-08-30T13:36:55Z

@zhanga5, the stack here seems to indicate that an executor failed to deserialize a task, and as such I don't think this is the first error. Can you provide a command we can use to repro? Also, is there a bug associated with this finding?

abellina · 2022-08-31T14:10:50Z

This is a difference in 22.10 (from 22.08) and it started happening with this change: #6044, because now we are not including extraClassPath unless a PYSP_TEST_spark_shuffle_manager contains RapidsShuffleManager, but the scripts that run these tests are being inconsistent on this variable. The scripts for the CDH cluster are not setting it, so we are not setting extraClassPath.

We should add extraClassPath, and also use a path relative to the container in the yarn case at least, and leave the absolute path for other tests.

abellina · 2022-08-31T21:13:56Z

I have made a change internally for our integration script. I think that's all that is needed here. I'll close once that change is confirmed to work.

zhanga5 · 2022-09-01T06:07:49Z

I have made a change internally for our integration script. I think that's all that is needed here. I'll close once that change is confirmed to work.

it worked as expected with my quick testing. We may close this issue

tgravescs added bug Something isn't working ? - Needs Triage Need team to review and classify P0 Must have for release labels Aug 25, 2022

gerashegalov mentioned this issue Aug 29, 2022

Epic for the fallout from removing force.caller.classloader=true in ShimLoader #6444

Closed

4 tasks

sameerz removed the ? - Needs Triage Need team to review and classify label Aug 30, 2022

sameerz assigned abellina Aug 30, 2022

tgravescs closed this as completed Sep 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] CDH integration tests ClassNotFoundException: com.nvidia.spark.rapids.spark321cdh.RapidsShuffleManager #6417

[BUG] CDH integration tests ClassNotFoundException: com.nvidia.spark.rapids.spark321cdh.RapidsShuffleManager #6417

tgravescs commented Aug 25, 2022

tgravescs commented Aug 26, 2022

gerashegalov commented Aug 26, 2022

zhanga5 commented Aug 30, 2022

abellina commented Aug 30, 2022

abellina commented Aug 31, 2022 •

edited

Loading

abellina commented Aug 31, 2022

zhanga5 commented Sep 1, 2022

[BUG] CDH integration tests ClassNotFoundException: com.nvidia.spark.rapids.spark321cdh.RapidsShuffleManager #6417

[BUG] CDH integration tests ClassNotFoundException: com.nvidia.spark.rapids.spark321cdh.RapidsShuffleManager #6417

Comments

tgravescs commented Aug 25, 2022

tgravescs commented Aug 26, 2022

gerashegalov commented Aug 26, 2022

zhanga5 commented Aug 30, 2022

abellina commented Aug 30, 2022

abellina commented Aug 31, 2022 • edited Loading

abellina commented Aug 31, 2022

zhanga5 commented Sep 1, 2022

abellina commented Aug 31, 2022 •

edited

Loading