-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[JVM-packages] XGBoostModel training failed with Rabit returns with exit code 1 #8003
Comments
Hi, could you please post the log from executors? |
train failed with local mode as well. output log: [15:50:46] [691] train-rmse:0.00520692697203341 [15:50:46] [692] train-rmse:0.00520696681770753 [15:50:46] [693] train-rmse:0.00520700622236835 [15:50:46] [694] train-rmse:0.00520704498998811 [15:50:46] [695] train-rmse:0.00520708323702370 [15:50:47] [696] train-rmse:0.00520712108582036 [15:50:47] [697] train-rmse:0.00520715829172192 [15:50:47] [698] train-rmse:0.00520719506255524 [15:50:47] [699] train-rmse:0.00520723134969795 22/06/17 15:50:47 ERROR XGBoostSpark: the job was aborted due to |
executors log: |
Simplified code:
|
@trivialfis could you please help me out thist problem? |
@wbo4958 Could you please take a look when you are available? |
Sorry for late response, @Jasonzjj , Looks like driver sent the SHUTDOWN from the exeuctor log Driver commanded a shutdown
2022-06-17 15:08:24 [dispatcher-Executor] INFO [YarnCoarseGrainedExecutorBackend:57]: Driver from disconnected during shutdown Could you please double-check? if the executor log is correct one, please fill the executor log with exceptions. |
@wbo4958 Thanks for response, i found the executor log with exceptions like this. 2022-06-28 15:21:01 [Executor task launch worker for task 4.0 in stage 2.0 (TID 90)] INFO [Executor:57]: Running task 4.0 in stage 2.0 (TID 90) |
last, did you solve the problem? |
Got same error run on EMR, but it's ok run in local Mac |
same error |
same error:
|
Could you help to check the python tracker log? Python tracker requires python 3.8+ |
Closing as the tracker is rewritten #10112 |
I encountered the following problems when using xgboost4j-spark 1.6.1 , spark 3.2.1 and scala 2.12,
Can someone help me out?
spark parameters:
--conf spark.sql.shuffle.partitions=200
--conf spark.executor.instances=8
--conf spark.driver.memory=4g \
xgboost parameters:
"num_workers"->8,
"nthread"->1
error:
2022-06-17 06:10:10 [task-result-getter-0] INFO [YarnClusterScheduler:57]: Removed TaskSet 2.0, whose tasks have all completed, from pool
2022-06-17 06:10:10 [dag-scheduler-event-loop] INFO [DAGScheduler:57]: ResultStage 2 (collect at XGBoost.scala:431) finished in 2871.246 s
2022-06-17 06:10:10 [dag-scheduler-event-loop] INFO [DAGScheduler:57]: Job 1 is finished. Cancelling potential speculative or zombie tasks for this job
2022-06-17 06:10:10 [dag-scheduler-event-loop] INFO [YarnClusterScheduler:57]: Killing all running tasks in stage 2: Stage finished
2022-06-17 06:10:10 [Driver] INFO [DAGScheduler:57]: Job 1 finished: collect at XGBoost.scala:431, took 3022.842636 s
2022-06-17 06:10:10 [Driver] INFO [RabitTracker:213]: Tracker Process ends with exit code 1
2022-06-17 06:10:10 [Driver] INFO [XGBoostSpark:433]: Rabit returns with exit code 1
2022-06-17 06:10:11 [Driver] ERROR [XGBoostSpark:455]: the job was aborted due to
ml.dmlc.xgboost4j.java.XGBoostError: XGBoostModel training failed.
at ml.dmlc.xgboost4j.scala.spark.XGBoost$.trainDistributed(XGBoost.scala:435)
at ml.dmlc.xgboost4j.scala.spark.XGBoostRegressor.train(XGBoostRegressor.scala:190)
at ml.dmlc.xgboost4j.scala.spark.XGBoostRegressor.train(XGBoostRegressor.scala:37)
at org.apache.spark.ml.Predictor.fit(Predictor.scala:151)
at SparkPi$.main(SparkPi.scala:52)
at SparkPi.main(SparkPi.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:737)
2022-06-17 06:10:11 [Driver] ERROR [ApplicationMaster:94]: User class threw exception: ml.dmlc.xgboost4j.java.XGBoostError: XGBoostModel training failed.
ml.dmlc.xgboost4j.java.XGBoostError: XGBoostModel training failed.
at ml.dmlc.xgboost4j.scala.spark.XGBoost$.trainDistributed(XGBoost.scala:435)
at ml.dmlc.xgboost4j.scala.spark.XGBoostRegressor.train(XGBoostRegressor.scala:190)
at ml.dmlc.xgboost4j.scala.spark.XGBoostRegressor.train(XGBoostRegressor.scala:37)
at org.apache.spark.ml.Predictor.fit(Predictor.scala:151)
at SparkPi$.main(SparkPi.scala:52)
at SparkPi.main(SparkPi.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:737)
2022-06-17 06:10:11 [Driver] INFO [ApplicationMaster:57]: Final app status: FAILED, exitCode: 15, (reason: User class threw exception: ml.dmlc.xgboost4j.java.XGBoostError: XGBoostModel training failed.
at ml.dmlc.xgboost4j.scala.spark.XGBoost$.trainDistributed(XGBoost.scala:435)
at ml.dmlc.xgboost4j.scala.spark.XGBoostRegressor.train(XGBoostRegressor.scala:190)
at ml.dmlc.xgboost4j.scala.spark.XGBoostRegressor.train(XGBoostRegressor.scala:37)
at org.apache.spark.ml.Predictor.fit(Predictor.scala:151)
at SparkPi$.main(SparkPi.scala:52)
at SparkPi.main(SparkPi.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:737)
The text was updated successfully, but these errors were encountered: