Skip to content
This repository has been archived by the owner on Apr 19, 2023. It is now read-only.

Getting error when trying to run xgboost examples on Spark 3.1.2 #45

Open
rpnkj29 opened this issue Oct 25, 2021 · 11 comments
Open

Getting error when trying to run xgboost examples on Spark 3.1.2 #45

rpnkj29 opened this issue Oct 25, 2021 · 11 comments
Assignees

Comments

@rpnkj29
Copy link

rpnkj29 commented Oct 25, 2021

I am getting error when trying to run Nyc Taxi or mortgage examples with Spark 3.1.2 operator in Kubernetes. We are submitting our Sparkapplication via Kubectl and getting below error. I tried with different version of spark catalyst jar (3.0.0 and 3.1.2) but still same.

Traceback (most recent call last):
File "/tmp/spark-a0673c21-9c04-4ba0-ae54-13b825af94e7/mortgage.py", line 78, in
model = with_benchmark('Training', lambda: classifier.fit(train_data))
File "/tmp/spark-a0673c21-9c04-4ba0-ae54-13b825af94e7/mortgage.py", line 74, in with_benchmark
result = action()
File "/tmp/spark-a0673c21-9c04-4ba0-ae54-13b825af94e7/mortgage.py", line 78, in
model = with_benchmark('Training', lambda: classifier.fit(train_data))
File "/opt/spark/python/lib/pyspark.zip/pyspark/ml/base.py", line 161, in fit
File "/opt/spark/python/lib/pyspark.zip/pyspark/ml/wrapper.py", line 335, in _fit
File "/opt/spark/python/lib/pyspark.zip/pyspark/ml/wrapper.py", line 332, in _fit_java
File "/opt/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1304, in call
File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 111, in deco
File "/opt/spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 326, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o82.fit.
: java.lang.BootstrapMethodError: java.lang.NoClassDefFoundError: org/apache/spark/sql/catalyst/expressions/TimeSub
at com.nvidia.spark.rapids.shims.spark300.Spark300Shims.getExprs(Spark300Shims.scala:251)
at com.nvidia.spark.rapids.shims.spark301.Spark301Shims.getExprs(Spark301Shims.scala:84)
at com.nvidia.spark.rapids.GpuOverrides$.(GpuOverrides.scala:2544)
at com.nvidia.spark.rapids.GpuOverrides$.(GpuOverrides.scala)
at org.apache.spark.sql.rapids.execution.InternalColumnarRddConverter$.convert(InternalColumnarRddConverter.scala:477)
at com.nvidia.spark.rapids.ColumnarRdd$.convert(ColumnarRdd.scala:47)
at com.nvidia.spark.rapids.ColumnarRdd.convert(ColumnarRdd.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at ml.dmlc.xgboost4j.scala.spark.rapids.GpuUtils$.toColumnarRdd(GpuUtils.scala:39)
at ml.dmlc.xgboost4j.scala.spark.rapids.GpuXGBoost$.trainOnGpuInternal(GpuXGBoost.scala:240)
at ml.dmlc.xgboost4j.scala.spark.rapids.GpuXGBoost$.trainDistributedOnGpu(GpuXGBoost.scala:186)
at ml.dmlc.xgboost4j.scala.spark.rapids.GpuXGBoost$.trainOnGpu(GpuXGBoost.scala:91)
at ml.dmlc.xgboost4j.scala.spark.rapids.GpuXGBoost$.fitOnGpu(GpuXGBoost.scala:52)
at ml.dmlc.xgboost4j.scala.spark.XGBoostClassifier.fit(XGBoostClassifier.scala:170)
at ml.dmlc.xgboost4j.scala.spark.XGBoostClassifier.fit(XGBoostClassifier.scala:41)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.NoClassDefFoundError: org/apache/spark/sql/catalyst/expressions/TimeSub
... 29 more
Caused by: java.lang.ClassNotFoundException: org.apache.spark.sql.catalyst.expressions.TimeSub
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)
at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
... 29 more

@wbo4958
Copy link
Collaborator

wbo4958 commented Oct 27, 2021

Hi @rpnkj29, Which rapids plugin version are you using? Looks like it's old, Could you try the latested?

@rpnkj29
Copy link
Author

rpnkj29 commented Oct 27, 2021

Hi @wbo4958 Thanks for replying.

I am using below version of CuDF and Rapids jar:
cudf-0.19.2-cuda11.jar
rapids-4-spark_2.12-0.5.0.jar
xgboost4j_3.0-1.4.2-0.1.0.jar
spark-catalyst_2.12-3.0.0.jar
xgboost4j-spark_3.0-1.4.2-0.1.0.jar

Earlier I tried with latest versions but it did not work as it does not support latest versions as per this link (xgboost : https://github.com/NVIDIA/spark-xgboost-examples/blob/spark-3/examples/notebooks/python/taxi-gpu.ipynb):
cudf-21.08.2-cuda11.jar
rapids-4-spark_2.12-21.08.0.jar

@wbo4958
Copy link
Collaborator

wbo4958 commented Oct 27, 2021

Hi @rpnkj29, the xgboost 1.4.2-0.1.0 should support cudf/rapids 21.08, Please try these jars

@rpnkj29
Copy link
Author

rpnkj29 commented Oct 27, 2021

Hi @wbo4958 I tried with newer jars as well.. they give below error for example link (https://github.com/NVIDIA/spark-xgboost-examples/blob/spark-3/examples/notebooks/python/taxi-gpu.ipynb) at this stage (Train the Data with Benchmark) :

21/10/20 12:23:05 INFO SparkContext: Successfully stopped SparkContext
Traceback (most recent call last):
File "/tmp/spark-f7af1c7b-a70b-4290-841b-6c52532f972d/nyc-xgboost.py", line 58, in
model = with_benchmark('Training', lambda: regressor.fit(train_data))
File "/tmp/spark-f7af1c7b-a70b-4290-841b-6c52532f972d/nyc-xgboost.py", line 54, in with_benchmark
result = action()
File "/tmp/spark-f7af1c7b-a70b-4290-841b-6c52532f972d/nyc-xgboost.py", line 58, in
model = with_benchmark('Training', lambda: regressor.fit(train_data))
File "/opt/spark/python/lib/pyspark.zip/pyspark/ml/base.py", line 161, in fit
File "/opt/spark/python/lib/pyspark.zip/pyspark/ml/wrapper.py", line 335, in _fit
File "/opt/spark/python/lib/pyspark.zip/pyspark/ml/wrapper.py", line 332, in _fit_java
File "/opt/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1304, in call
File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 111, in deco
File "/opt/spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 326, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o90.fit.
: ml.dmlc.xgboost4j.java.XGBoostError: XGBoostModel training failed
at ml.dmlc.xgboost4j.scala.spark.XGBoost$.postTrackerReturnProcessing(XGBoost.scala:753)
at ml.dmlc.xgboost4j.scala.spark.rapids.GpuXGBoost$.trainDistributedOnGpu(GpuXGBoost.scala:198)
at ml.dmlc.xgboost4j.scala.spark.rapids.GpuXGBoost$.trainOnGpu(GpuXGBoost.scala:150)
at ml.dmlc.xgboost4j.scala.spark.rapids.GpuXGBoost$.fitOnGpu(GpuXGBoost.scala:111)
at ml.dmlc.xgboost4j.scala.spark.XGBoostRegressor.fit(XGBoostRegressor.scala:170)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)

21/10/20 12:23:05 INFO ShutdownHookManager: Shutdown hook called

@wbo4958
Copy link
Collaborator

wbo4958 commented Oct 27, 2021

Hi @rpnkj29 Could you give more executor/driver logs?

@rpnkj29
Copy link
Author

rpnkj29 commented Oct 27, 2021

xgboost-driverlogs.txt

Hi @wbo4958 I have attached full log from driver pod when it went into error state. Please let me know if more information is required. thanks

@wbo4958
Copy link
Collaborator

wbo4958 commented Oct 27, 2021

Thx @rpnkj29, It seems you were running xgboost sample on cluster mode (standalone?), the log for driver looks good. The error should be caused by executor side. Could you help to provide the executor log?

@wbo4958 wbo4958 self-assigned this Oct 28, 2021
@wbo4958
Copy link
Collaborator

wbo4958 commented Nov 1, 2021

@rpnkj29 Hi there, any update?

@rpnkj29
Copy link
Author

rpnkj29 commented Nov 1, 2021

executor-logs.txt

Hi @wbo4958 apologies for the delay. Please find the executor logs for this error.

@rpnkj29
Copy link
Author

rpnkj29 commented Nov 5, 2021

Hi @wbo4958 were you able to check executor pods ?

@rpnkj29
Copy link
Author

rpnkj29 commented Nov 16, 2021

Hi @wbo4958 any update ?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants