Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] CDH integration tests ClassNotFoundException: com.nvidia.spark.rapids.spark321cdh.RapidsShuffleManager #6417

Closed
Tracked by #6444
tgravescs opened this issue Aug 25, 2022 · 7 comments
Assignees
Labels
bug Something isn't working P0 Must have for release

Comments

@tgravescs
Copy link
Collaborator

Describe the bug
the cloudera integration tests failed with:

09:03:43  22/08/25 14:03:43 INFO client.TransportClientFactory: Successfully created connection to rl-r7525-d32-u35.raplab.nvidia.com/10.150.166.217:33094 after 4 ms (0 ms spent in bootstraps)
09:03:43  Exception in thread "main" java.lang.reflect.UndeclaredThrowableException
09:03:43  	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1916)
09:03:43  	at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:61)
09:03:43  	at org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:419)
09:03:43  	at org.apache.spark.executor.YarnCoarseGrainedExecutorBackend$.main(YarnCoarseGrainedExecutorBackend.scala:81)
09:03:43  	at org.apache.spark.executor.YarnCoarseGrainedExecutorBackend.main(YarnCoarseGrainedExecutorBackend.scala)
09:03:43  Caused by: java.lang.ClassNotFoundException: com.nvidia.spark.rapids.spark321cdh.RapidsShuffleManager

Need to investigate what is going on

@tgravescs tgravescs added bug Something isn't working ? - Needs Triage Need team to review and classify P0 Must have for release labels Aug 25, 2022
@tgravescs
Copy link
Collaborator Author

so a workaround for this is to specify the rapids jar in the extraClassPath variable, on yarn it has to be ./. The weird thing is I thought that would already be in the classpath on YARN so not sure if something changed on the cluster or I'm thinking about it wrong.

@gerashegalov
Copy link
Collaborator

This is most likely caused by #5646.

And for Spark Standalone RapidsShuffleManager always requires extraClassPath per #5796.

@zhanga5
Copy link
Contributor

zhanga5 commented Aug 30, 2022

encountered below error after include spark.executor.extraClassPath=xxx

[2022-08-29T03:14:04.539Z] ../../src/main/python/hash_aggregate_test.py::test_hash_grpby_sum[{'spark.rapids.sql.variableFloatAgg.enabled': 'true', 'spark.rapids.sql.hasNans': 'false', 'spark.rapids.sql.castStringToFloat.enabled': 'true'}-[('a', Long), ('b', Integer), ('c', Long)]][IGNORE_ORDER, INCOMPAT, APPROXIMATE_FLOAT] 22/08/29 03:14:04 WARN scheduler.TaskSetManager: Lost task 1.0 in stage 0.0 (TID 1) (rl-r7525-d32-u07.raplab.nvidia.com executor 2): java.lang.IllegalStateException: unread block data
[2022-08-29T03:14:04.539Z] 	at java.io.ObjectInputStream$BlockDataInputStream.setBlockDataMode(ObjectInputStream.java:2900)
[2022-08-29T03:14:04.539Z] 	at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1701)
[2022-08-29T03:14:04.539Z] 	at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2404)
[2022-08-29T03:14:04.539Z] 	at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2328)
[2022-08-29T03:14:04.539Z] 	at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2186)
[2022-08-29T03:14:04.539Z] 	at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1666)
[2022-08-29T03:14:04.539Z] 	at java.io.ObjectInputStream.readObject(ObjectInputStream.java:502)
[2022-08-29T03:14:04.539Z] 	at java.io.ObjectInputStream.readObject(ObjectInputStream.java:460)
[2022-08-29T03:14:04.539Z] 	at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:76)
[2022-08-29T03:14:04.539Z] 	at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:115)
[2022-08-29T03:14:04.539Z] 	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:466)
[2022-08-29T03:14:04.539Z] 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
[2022-08-29T03:14:04.539Z] 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
[2022-08-29T03:14:04.539Z] 	at java.lang.Thread.run(Thread.java:748)

@abellina
Copy link
Collaborator

@zhanga5, the stack here seems to indicate that an executor failed to deserialize a task, and as such I don't think this is the first error. Can you provide a command we can use to repro? Also, is there a bug associated with this finding?

@sameerz sameerz removed the ? - Needs Triage Need team to review and classify label Aug 30, 2022
@abellina
Copy link
Collaborator

abellina commented Aug 31, 2022

This is a difference in 22.10 (from 22.08) and it started happening with this change: #6044, because now we are not including extraClassPath unless a PYSP_TEST_spark_shuffle_manager contains RapidsShuffleManager, but the scripts that run these tests are being inconsistent on this variable. The scripts for the CDH cluster are not setting it, so we are not setting extraClassPath.

We should add extraClassPath, and also use a path relative to the container in the yarn case at least, and leave the absolute path for other tests.

@abellina
Copy link
Collaborator

I have made a change internally for our integration script. I think that's all that is needed here. I'll close once that change is confirmed to work.

@zhanga5
Copy link
Contributor

zhanga5 commented Sep 1, 2022

I have made a change internally for our integration script. I think that's all that is needed here. I'll close once that change is confirmed to work.

it worked as expected with my quick testing. We may close this issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working P0 Must have for release
Projects
None yet
Development

No branches or pull requests

5 participants