Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[JVM-packages] XGBoostModel training failed with Rabit returns with exit code 1 #8003

Closed
Jasonzjj opened this issue Jun 17, 2022 · 14 comments
Closed

Comments

@Jasonzjj
Copy link

I encountered the following problems when using xgboost4j-spark 1.6.1 , spark 3.2.1 and scala 2.12,

Can someone help me out?

spark parameters:
--conf spark.sql.shuffle.partitions=200
--conf spark.executor.instances=8
--conf spark.driver.memory=4g \

xgboost parameters:
"num_workers"->8,
"nthread"->1

error:
2022-06-17 06:10:10 [task-result-getter-0] INFO [YarnClusterScheduler:57]: Removed TaskSet 2.0, whose tasks have all completed, from pool
2022-06-17 06:10:10 [dag-scheduler-event-loop] INFO [DAGScheduler:57]: ResultStage 2 (collect at XGBoost.scala:431) finished in 2871.246 s
2022-06-17 06:10:10 [dag-scheduler-event-loop] INFO [DAGScheduler:57]: Job 1 is finished. Cancelling potential speculative or zombie tasks for this job
2022-06-17 06:10:10 [dag-scheduler-event-loop] INFO [YarnClusterScheduler:57]: Killing all running tasks in stage 2: Stage finished
2022-06-17 06:10:10 [Driver] INFO [DAGScheduler:57]: Job 1 finished: collect at XGBoost.scala:431, took 3022.842636 s
2022-06-17 06:10:10 [Driver] INFO [RabitTracker:213]: Tracker Process ends with exit code 1
2022-06-17 06:10:10 [Driver] INFO [XGBoostSpark:433]: Rabit returns with exit code 1
2022-06-17 06:10:11 [Driver] ERROR [XGBoostSpark:455]: the job was aborted due to
ml.dmlc.xgboost4j.java.XGBoostError: XGBoostModel training failed.
at ml.dmlc.xgboost4j.scala.spark.XGBoost$.trainDistributed(XGBoost.scala:435)
at ml.dmlc.xgboost4j.scala.spark.XGBoostRegressor.train(XGBoostRegressor.scala:190)
at ml.dmlc.xgboost4j.scala.spark.XGBoostRegressor.train(XGBoostRegressor.scala:37)
at org.apache.spark.ml.Predictor.fit(Predictor.scala:151)
at SparkPi$.main(SparkPi.scala:52)
at SparkPi.main(SparkPi.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:737)
2022-06-17 06:10:11 [Driver] ERROR [ApplicationMaster:94]: User class threw exception: ml.dmlc.xgboost4j.java.XGBoostError: XGBoostModel training failed.
ml.dmlc.xgboost4j.java.XGBoostError: XGBoostModel training failed.
at ml.dmlc.xgboost4j.scala.spark.XGBoost$.trainDistributed(XGBoost.scala:435)
at ml.dmlc.xgboost4j.scala.spark.XGBoostRegressor.train(XGBoostRegressor.scala:190)
at ml.dmlc.xgboost4j.scala.spark.XGBoostRegressor.train(XGBoostRegressor.scala:37)
at org.apache.spark.ml.Predictor.fit(Predictor.scala:151)
at SparkPi$.main(SparkPi.scala:52)
at SparkPi.main(SparkPi.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:737)
2022-06-17 06:10:11 [Driver] INFO [ApplicationMaster:57]: Final app status: FAILED, exitCode: 15, (reason: User class threw exception: ml.dmlc.xgboost4j.java.XGBoostError: XGBoostModel training failed.
at ml.dmlc.xgboost4j.scala.spark.XGBoost$.trainDistributed(XGBoost.scala:435)
at ml.dmlc.xgboost4j.scala.spark.XGBoostRegressor.train(XGBoostRegressor.scala:190)
at ml.dmlc.xgboost4j.scala.spark.XGBoostRegressor.train(XGBoostRegressor.scala:37)
at org.apache.spark.ml.Predictor.fit(Predictor.scala:151)
at SparkPi$.main(SparkPi.scala:52)
at SparkPi.main(SparkPi.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:737)

@trivialfis
Copy link
Member

Hi, could you please post the log from executors?

@Jasonzjj
Copy link
Author

train failed with local mode as well.

output log:
[15:50:46] [690] train-rmse:0.00520688644065732

[15:50:46] [691] train-rmse:0.00520692697203341

[15:50:46] [692] train-rmse:0.00520696681770753

[15:50:46] [693] train-rmse:0.00520700622236835

[15:50:46] [694] train-rmse:0.00520704498998811

[15:50:46] [695] train-rmse:0.00520708323702370

[15:50:47] [696] train-rmse:0.00520712108582036

[15:50:47] [697] train-rmse:0.00520715829172192

[15:50:47] [698] train-rmse:0.00520719506255524

[15:50:47] [699] train-rmse:0.00520723134969795

22/06/17 15:50:47 ERROR XGBoostSpark: the job was aborted due to
ml.dmlc.xgboost4j.java.XGBoostError: XGBoostModel training failed.
at ml.dmlc.xgboost4j.scala.spark.XGBoost$.trainDistributed(XGBoost.scala:435)
at ml.dmlc.xgboost4j.scala.spark.XGBoostRegressor.train(XGBoostRegressor.scala:190)
at ml.dmlc.xgboost4j.scala.spark.XGBoostRegressor.train(XGBoostRegressor.scala:37)
at org.apache.spark.ml.Predictor.fit(Predictor.scala:151)
at $line14.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.(example:44)
at $line14.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw.(example:61)
at $line14.$read$$iw$$iw$$iw$$iw$$iw$$iw.(example:63)
at $line14.$read$$iw$$iw$$iw$$iw$$iw.(example:65)
at $line14.$read$$iw$$iw$$iw$$iw.(example:67)
at $line14.$read$$iw$$iw$$iw.(example:69)
at $line14.$read$$iw$$iw.(example:71)
at $line14.$read$$iw.(example:73)
at $line14.$read.(example:75)
at $line14.$read$.(example:79)
at $line14.$read$.(example)
at $line14.$eval$.$print$lzycompute(example:7)
at $line14.$eval$.$print(example:6)
at $line14.$eval.$print(example)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at scala.tools.nsc.interpreter.IMain$ReadEvalPrint.call(IMain.scala:747)
at scala.tools.nsc.interpreter.IMain$Request.loadAndRun(IMain.scala:1020)
at scala.tools.nsc.interpreter.IMain.$anonfun$interpret$1(IMain.scala:568)
at scala.reflect.internal.util.ScalaClassLoader.asContext(ScalaClassLoader.scala:36)
at scala.reflect.internal.util.ScalaClassLoader.asContext$(ScalaClassLoader.scala:116)
at scala.reflect.internal.util.AbstractFileClassLoader.asContext(AbstractFileClassLoader.scala:41)
at scala.tools.nsc.interpreter.IMain.loadAndRunReq$1(IMain.scala:567)
at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:594)
at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:564)
at scala.tools.nsc.interpreter.ILoop.$anonfun$pasteCommand$11(ILoop.scala:795)
at scala.tools.nsc.interpreter.IMain.withLabel(IMain.scala:111)
at scala.tools.nsc.interpreter.ILoop.interpretCode$1(ILoop.scala:795)
at scala.tools.nsc.interpreter.ILoop.pasteCommand(ILoop.scala:801)
at org.apache.spark.repl.SparkILoop.$anonfun$process$8(SparkILoop.scala:177)
at org.apache.spark.repl.SparkILoop.$anonfun$process$8$adapted(SparkILoop.scala:176)
at scala.collection.immutable.List.foreach(List.scala:431)
at org.apache.spark.repl.SparkILoop.loadInitFiles$1(SparkILoop.scala:176)
at org.apache.spark.repl.SparkILoop.$anonfun$process$4(SparkILoop.scala:166)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at scala.tools.nsc.interpreter.ILoop.$anonfun$mumly$1(ILoop.scala:166)
at scala.tools.nsc.interpreter.IMain.beQuietDuring(IMain.scala:206)
at scala.tools.nsc.interpreter.ILoop.mumly(ILoop.scala:163)
at org.apache.spark.repl.SparkILoop.loopPostInit$1(SparkILoop.scala:153)
at org.apache.spark.repl.SparkILoop.$anonfun$process$10(SparkILoop.scala:221)
at org.apache.spark.repl.SparkILoop.withSuppressedSettings$1(SparkILoop.scala:189)
at org.apache.spark.repl.SparkILoop.startup$1(SparkILoop.scala:201)
at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:236)
at org.apache.spark.repl.Main$.doMain(Main.scala:78)
at org.apache.spark.repl.Main$.main(Main.scala:58)
at org.apache.spark.repl.Main.main(Main.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:958)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1046)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1055)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
ml.dmlc.xgboost4j.java.XGBoostError: XGBoostModel training failed.
at ml.dmlc.xgboost4j.scala.spark.XGBoost$.trainDistributed(XGBoost.scala:435)
at ml.dmlc.xgboost4j.scala.spark.XGBoostRegressor.train(XGBoostRegressor.scala:190)
at ml.dmlc.xgboost4j.scala.spark.XGBoostRegressor.train(XGBoostRegressor.scala:37)
at org.apache.spark.ml.Predictor.fit(Predictor.scala:151)
... 60 elided

@Jasonzjj
Copy link
Author

executors log:
2022-06-17 14:04:32 [dispatcher-Executor] INFO [YarnCoarseGrainedExecutorBackend:57]: Got assigned task 92
2022-06-17 14:04:32 [Executor task launch worker for task 60.0 in stage 1.0 (TID 92)] INFO [Executor:57]: Running task 60.0 in stage 1.0 (TID 92)
2022-06-17 14:04:33 [Executor task launch worker for task 60.0 in stage 1.0 (TID 92)] INFO [FileScanRDD:57]: Reading File path: /train, range: 3976977000-4043259950, partition values: [empty row]
2022-06-17 14:04:53 [Executor task launch worker for task 60.0 in stage 1.0 (TID 92)] INFO [Executor:57]: Finished task 60.0 in stage 1.0 (TID 92). 1759 bytes result sent to driver
2022-06-17 14:04:54 [dispatcher-Executor] INFO [YarnCoarseGrainedExecutorBackend:57]: Got assigned task 102
2022-06-17 14:04:54 [Executor task launch worker for task 6.0 in stage 2.0 (TID 102)] INFO [Executor:57]: Running task 6.0 in stage 2.0 (TID 102)
2022-06-17 14:04:54 [Executor task launch worker for task 6.0 in stage 2.0 (TID 102)] INFO [MapOutputTrackerWorker:57]: Updating epoch to 1 and clearing cache
2022-06-17 14:04:54 [Executor task launch worker for task 6.0 in stage 2.0 (TID 102)] INFO [TorrentBroadcast:57]: Started reading broadcast variable 4 with 1 pieces (estimated total size 4.0 MiB)
2022-06-17 14:04:54 [Executor task launch worker for task 6.0 in stage 2.0 (TID 102)] INFO [TransportClientFactory:310]: Successfully created connection to after 1 ms (0 ms spent in bootstraps)
2022-06-17 14:04:54 [Executor task launch worker for task 6.0 in stage 2.0 (TID 102)] INFO [MemoryStore:57]: Block broadcast_4_piece0 stored as bytes in memory (estimated size 5.2 KiB, free 2.8 GiB)
2022-06-17 14:04:54 [Executor task launch worker for task 6.0 in stage 2.0 (TID 102)] INFO [TorrentBroadcast:57]: Reading broadcast variable 4 took 21 ms
2022-06-17 14:04:54 [Executor task launch worker for task 6.0 in stage 2.0 (TID 102)] INFO [MemoryStore:57]: Block broadcast_4 stored as values in memory (estimated size 9.4 KiB, free 2.8 GiB)
2022-06-17 14:04:55 [Executor task launch worker for task 6.0 in stage 2.0 (TID 102)] INFO [MemoryStore:57]: Block rdd_20_6 stored as values in memory (estimated size 4.8 MiB, free 2.8 GiB)
2022-06-17 14:04:56 [Executor task launch worker for task 6.0 in stage 2.0 (TID 102)] INFO [MapOutputTrackerWorker:57]: Don't have map outputs for shuffle 0, fetching them
2022-06-17 14:04:56 [Executor task launch worker for task 6.0 in stage 2.0 (TID 102)] INFO [MapOutputTrackerWorker:57]: Doing the fetch; tracker endpoint = NettyRpcEndpointRef(spark://MapOutputTracker@) 2022-06-17 14:04:56 [Executor task launch worker for task 6.0 in stage 2.0 (TID 102)] INFO [MapOutputTrackerWorker:57]: Got the map output locations
2022-06-17 14:04:56 [Executor task launch worker for task 6.0 in stage 2.0 (TID 102)] INFO [ShuffleBlockFetcherIterator:57]: Getting 64 (255.7 MiB) non-empty blocks including 8 (32.0 MiB) local and 0 (0.0 B) host-local and 0 (0.0 B) push-merged-local and 56 (223.7 MiB) remote blocks
2022-06-17 14:04:56 [Executor task launch worker for task 6.0 in stage 2.0 (TID 102)] INFO [TransportClientFactory:310]: Successfully created connection to after 11 ms (0 ms spent in bootstraps)
2022-06-17 14:04:56 [Executor task launch worker for task 6.0 in stage 2.0 (TID 102)] INFO [ShuffleBlockFetcherIterator:57]: Started 5 remote fetches in 117 ms
2022-06-17 14:05:14 [Executor task launch worker for task 6.0 in stage 2.0 (TID 102)] INFO [TransportClientFactory:310]: Successfully created connection to after 3 ms (0 ms spent in bootstraps)
2022-06-17 14:05:26 [Executor task launch worker for task 6.0 in stage 2.0 (TID 102)] INFO [TransportClientFactory:310]: Successfully created connection to after 2 ms (0 ms spent in bootstraps)
2022-06-17 14:56:12 [Executor task launch worker for task 6.0 in stage 2.0 (TID 102)] INFO [Executor:57]: 1 block locks were not released by task 6.0 in stage 2.0 (TID 102)
[rdd_20_6]
2022-06-17 14:56:12 [Executor task launch worker for task 6.0 in stage 2.0 (TID 102)] INFO [Executor:57]: Finished task 6.0 in stage 2.0 (TID 102). 1177 bytes result sent to driver
2022-06-17 15:08:23 [dispatcher-Executor] INFO [YarnCoarseGrainedExecutorBackend:57]: Driver commanded a shutdown
2022-06-17 15:08:24 [dispatcher-Executor] INFO [YarnCoarseGrainedExecutorBackend:57]: Driver from disconnected during shutdown
2022-06-17 15:08:24 [dispatcher-Executor] INFO [YarnCoarseGrainedExecutorBackend:57]: Driver from isconnected during shutdown
2022-06-17 15:08:24 [CoarseGrainedExecutorBackend-stop-executor] INFO [MemoryStore:57]: MemoryStore cleared
2022-06-17 15:08:24 [CoarseGrainedExecutorBackend-stop-executor] INFO [BlockManager:57]: BlockManager stopped
2022-06-17 15:08:24 [CoarseGrainedExecutorBackend-stop-executor] INFO [TSDBReporter:222]: stop sending
2022-06-17 15:08:24 [Thread-2] INFO [ShutdownHookManager:57]: Shutdown hook called

@Jasonzjj
Copy link
Author

Jasonzjj commented Jun 17, 2022

Simplified code:
` val xgbReg = new XGBoostRegressor(xgbParam).
setFeaturesCol("features").
setLabelCol("label")

        val xgbInput = spark.read.format("libsvm").load("adata")
        val xgbModel = xgbReg.fit(xgbInput)

        val results = xgbModel.transform(xgbInput)

        xgbModel.nativeBooster.saveModel(fs_out)`

@trivialfis trivialfis changed the title XGBoostModel training failed with Rabit returns with exit code 1 [JVM-packages] XGBoostModel training failed with Rabit returns with exit code 1 Jun 17, 2022
@Jasonzjj
Copy link
Author

@trivialfis could you please help me out thist problem?

@trivialfis
Copy link
Member

@wbo4958 Could you please take a look when you are available?

@wbo4958
Copy link
Contributor

wbo4958 commented Jun 27, 2022

Sorry for late response, @Jasonzjj , Looks like driver sent the SHUTDOWN from the exeuctor log

 Driver commanded a shutdown
2022-06-17 15:08:24 [dispatcher-Executor] INFO [YarnCoarseGrainedExecutorBackend:57]: Driver from disconnected during shutdown

Could you please double-check?

if the executor log is correct one, please fill the executor log with exceptions.

@Jasonzjj
Copy link
Author

@wbo4958 Thanks for response, i found the executor log with exceptions like this.

2022-06-28 15:21:01 [Executor task launch worker for task 4.0 in stage 2.0 (TID 90)] INFO [Executor:57]: Running task 4.0 in stage 2.0 (TID 90)
2022-06-28 15:21:01 [Executor task launch worker for task 4.0 in stage 2.0 (TID 90)] INFO [MapOutputTrackerWorker:57]: Updating epoch to 1 and clearing cache
2022-06-28 15:21:01 [Executor task launch worker for task 4.0 in stage 2.0 (TID 90)] INFO [TorrentBroadcast:57]: Started reading broadcast variable 4 with 1 pieces (estimated total size 4.0 MiB)
2022-06-28 15:21:01 [Executor task launch worker for task 4.0 in stage 2.0 (TID 90)] INFO [MemoryStore:57]: Block broadcast_4_piece0 stored as bytes in memory (estimated size 5.2 KiB, free 2.8 GiB)
2022-06-28 15:21:01 [Executor task launch worker for task 4.0 in stage 2.0 (TID 90)] INFO [TorrentBroadcast:57]: Reading broadcast variable 4 took 39 ms
2022-06-28 15:21:01 [Executor task launch worker for task 4.0 in stage 2.0 (TID 90)] INFO [MemoryStore:57]: Block broadcast_4 stored as values in memory (estimated size 9.4 KiB, free 2.8 GiB)
2022-06-28 15:21:03 [Executor task launch worker for task 4.0 in stage 2.0 (TID 90)] INFO [MemoryStore:57]: Block rdd_20_4 stored as values in memory (estimated size 4.3 MiB, free 2.8 GiB)
2022-06-28 15:21:03 [Executor task launch worker for task 4.0 in stage 2.0 (TID 90)] INFO [MapOutputTrackerWorker:57]: Don't have map outputs for shuffle 0, fetching them
2022-06-28 15:21:03 [Executor task launch worker for task 4.0 in stage 2.0 (TID 90)] INFO [MapOutputTrackerWorker:57]: Doing the fetch; tracker endpoint = NettyRpcEndpointRef(spark://MapOutputTracker@)
2022-06-28 15:21:03 [Executor task launch worker for task 4.0 in stage 2.0 (TID 90)] INFO [MapOutputTrackerWorker:57]: Got the map output locations
2022-06-28 15:21:03 [Executor task launch worker for task 4.0 in stage 2.0 (TID 90)] INFO [ShuffleBlockFetcherIterator:57]: Getting 56 (246.5 MiB) non-empty blocks including 7 (30.8 MiB) local and 0 (0.0 B) host-local and 0 (0.0 B) push-merged-local and 49 (215.7 MiB) remote blocks
2022-06-28 15:21:04 [Executor task launch worker for task 4.0 in stage 2.0 (TID 90)] INFO [ShuffleBlockFetcherIterator:57]: Started 3 remote fetches in 197 ms
2022-06-28 16:14:01 [Executor task launch worker for task 4.0 in stage 2.0 (TID 90)] INFO [Executor:57]: 1 block locks were not released by task 4.0 in stage 2.0 (TID 90)
[rdd_20_4]
2022-06-28 16:14:01 [Executor task launch worker for task 4.0 in stage 2.0 (TID 90)] INFO [Executor:57]: Finished task 4.0 in stage 2.0 (TID 90). 1177 bytes result sent to driver
2022-06-28 16:19:46 [dispatcher-Executor] INFO [YarnCoarseGrainedExecutorBackend:57]: Driver commanded a shutdown
2022-06-28 16:19:47 [dispatcher-Executor] INFO [YarnCoarseGrainedExecutorBackend:57]: Driver from disconnected during shutdown
2022-06-28 16:19:47 [dispatcher-Executor] INFO [YarnCoarseGrainedExecutorBackend:57]: Driver from disconnected during shutdown
2022-06-28 16:19:53 [netty-rpc-connection-1] INFO [TransportClientFactory:206]: Found inactive connection to , creating a new one.
2022-06-28 16:19:53 [executor-heartbeater] WARN [Executor:90]: Issue communicating with driver in heartbeater
org.apache.spark.SparkException: Exception thrown in awaitResult:
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:301)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:103)
at org.apache.spark.executor.Executor.reportHeartBeat(Executor.scala:1037)
at org.apache.spark.executor.Executor.$anonfun$heartbeater$1(Executor.scala:212)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:2048)
at org.apache.spark.Heartbeater$$anon$1.run(Heartbeater.scala:46)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: Failed to connect to xxx.com
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:288)
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:218)
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:230)
at org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:204)
at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:202)
at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:198)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
... 3 more
Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: xxx.com
Caused by: java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:330)
at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:334)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:707)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:655)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:581)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493)
at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986)
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.lang.Thread.run(Thread.java:748)
2022-06-28 16:19:53 [CoarseGrainedExecutorBackend-stop-executor] WARN [TSDBSender:129]: encounter exception in TSDBSender:
java.net.SocketException: Connection reset
at java.net.SocketInputStream.read(SocketInputStream.java:210)
at java.net.SocketInputStream.read(SocketInputStream.java:141)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:735)
at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:678)
at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:706)
at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1587)
at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1492)
at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:480)
at inf.TSDBSender$1.run(TSDBSender.java:89)
at inf.TSDBSender$1.run(TSDBSender.java:80)
at inf.TSDBSender$TSDBSenderAction.runWithCheck(TSDBSender.java:117)
at inf.TSDBSender$TSDBSenderAction.runWithRetries(TSDBSender.java:125)
at inf.TSDBSender.send(TSDBSender.java:98)
at inf.TSDBReporter.sendToActualDataView(TSDBReporter.java:149)
at inf.TSDBReporter.report(TSDBReporter.java:78)
at com.codahale.metrics.ScheduledReporter.report(ScheduledReporter.java:253)
at org.apache.spark.metrics.sink.TSDBSink.report(TSDBSink.scala:134)
at org.apache.spark.metrics.MetricsSystem.$anonfun$report$1(MetricsSystem.scala:118)
at org.apache.spark.metrics.MetricsSystem.$anonfun$report$1$adapted(MetricsSystem.scala:118)
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at org.apache.spark.metrics.MetricsSystem.report(MetricsSystem.scala:118)
at org.apache.spark.executor.Executor.stop(Executor.scala:320)
at org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1$$anon$1.run(CoarseGrainedExecutorBackend.scala:213)
2022-06-28 16:19:53 [CoarseGrainedExecutorBackend-stop-executor] INFO [TSDBSender:131]: Retrying operation in TSDBSender. Retry no.1
2022-06-28 16:19:57 [CoarseGrainedExecutorBackend-stop-executor] INFO [MemoryStore:57]: MemoryStore cleared
2022-06-28 16:19:57 [CoarseGrainedExecutorBackend-stop-executor] INFO [BlockManager:57]: BlockManager stopped
2022-06-28 16:20:01 [CoarseGrainedExecutorBackend-stop-executor] INFO [TSDBReporter:222]: stop sending
2022-06-28 16:20:01 [Thread-2] INFO [ShutdownHookManager:57]: Shutdown hook called

@jbcsimple
Copy link

jbcsimple commented Dec 1, 2022

last, did you solve the problem?

@Wanke15
Copy link

Wanke15 commented Dec 12, 2022

Got same error run on EMR, but it's ok run in local Mac

@jxudata
Copy link

jxudata commented Feb 12, 2023

same error

@diggzhang
Copy link

diggzhang commented Feb 16, 2023

same error:

    sparkVersion = '3.2.1'

    implementation('ml.dmlc:xgboost4j_2.12:1.6.2') 
    implementation('ml.dmlc:xgboost4j-spark_2.12:1.6.2') 
    implementation 'org.scala-lang:scala-library:2.12.15'
ml.dmlc.xgboost4j.java.XGBoostError: XGBoostModel training failed.
        at ml.dmlc.xgboost4j.scala.spark.XGBoost$.trainDistributed(XGBoost.scala:435)
        at ml.dmlc.xgboost4j.scala.spark.XGBoostClassifier.train(XGBoostClassifier.scala:196)
        at ml.dmlc.xgboost4j.scala.spark.XGBoostClassifier.train(XGBoostClassifier.scala:35)
        at org.apache.spark.ml.Predictor.fit(Predictor.scala:151)
        at org.apache.spark.ml.Predictor.fit(Predictor.scala:115)
        at org.apache.spark.ml.Pipeline.$anonfun$fit$5(Pipeline.scala:151)
        at org.apache.spark.ml.MLEvents.withFitEvent(events.scala:130)
        at org.apache.spark.ml.MLEvents.withFitEvent$(events.scala:123)
        at org.apache.spark.ml.util.Instrumentation.withFitEvent(Instrumentation.scala:42)
        at org.apache.spark.ml.Pipeline.$anonfun$fit$4(Pipeline.scala:151)
        at scala.collection.Iterator.foreach(Iterator.scala:943)
        at scala.collection.Iterator.foreach$(Iterator.scala:943)
        at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
        at org.apache.spark.ml.Pipeline.$anonfun$fit$2(Pipeline.scala:147)
        at org.apache.spark.ml.MLEvents.withFitEvent(events.scala:130)
        at org.apache.spark.ml.MLEvents.withFitEvent$(events.scala:123)
        at org.apache.spark.ml.util.Instrumentation.withFitEvent(Instrumentation.scala:42)
        at org.apache.spark.ml.Pipeline.$anonfun$fit$1(Pipeline.scala:133)
        at org.apache.spark.ml.util.Instrumentation$.$anonfun$instrumented$1(Instrumentation.scala:191)
        at scala.util.Try$.apply(Try.scala:213)
        at org.apache.spark.ml.util.Instrumentation$.instrumented(Instrumentation.scala:191)
        at org.apache.spark.ml.Pipeline.fit(Pipeline.scala:133)
        at org.apache.spark.ml.Pipeline.fit(Pipeline.scala:93)
        at org.apache.spark.ml.Estimator.fit(Estimator.scala:59)
        at org.apache.spark.ml.tuning.CrossValidator.$anonfun$fit$7(CrossValidator.scala:174)
        at scala.runtime.java8.JFunction0$mcD$sp.apply(JFunction0$mcD$sp.java:23)
        at scala.concurrent.Future$.$anonfun$apply$1(Future.scala:659)
        at scala.util.Success.$anonfun$map$1(Try.scala:255)
        at scala.util.Success.map(Try.scala:213)
        at scala.concurrent.Future.$anonfun$map$1(Future.scala:292)
        at scala.concurrent.impl.Promise.liftedTree1$1(Promise.scala:33)
        at scala.concurrent.impl.Promise.$anonfun$transform$1(Promise.scala:33)
        at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:64)
        at org.apache.spark.util.ThreadUtils$$anon$1.execute(ThreadUtils.scala:99)
        at scala.concurrent.impl.ExecutionContextImpl$$anon$4.execute(ExecutionContextImpl.scala:138)
        at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:72)
        at scala.concurrent.impl.Promise$KeptPromise$Kept.onComplete(Promise.scala:372)
        at scala.concurrent.impl.Promise$KeptPromise$Kept.onComplete$(Promise.scala:371)
        at scala.concurrent.impl.Promise$KeptPromise$Successful.onComplete(Promise.scala:379)
        at scala.concurrent.impl.Promise.transform(Promise.scala:33)
        at scala.concurrent.impl.Promise.transform$(Promise.scala:31)
        at scala.concurrent.impl.Promise$KeptPromise$Successful.transform(Promise.scala:379)
        at scala.concurrent.Future.map(Future.scala:292)
        at scala.concurrent.Future.map$(Future.scala:292)
        at scala.concurrent.impl.Promise$KeptPromise$Successful.map(Promise.scala:379)
        at scala.concurrent.Future$.apply(Future.scala:659)
        at org.apache.spark.ml.tuning.CrossValidator.$anonfun$fit$6(CrossValidator.scala:182)
        at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
        at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
        at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
        at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
        at scala.collection.TraversableLike.map(TraversableLike.scala:286)
        at scala.collection.TraversableLike.map$(TraversableLike.scala:279)
        at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:198)
        at org.apache.spark.ml.tuning.CrossValidator.$anonfun$fit$4(CrossValidator.scala:172)
        at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
        at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
        at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
        at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
        at scala.collection.TraversableLike.map(TraversableLike.scala:286)
        at scala.collection.TraversableLike.map$(TraversableLike.scala:279)
        at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:198)
        at org.apache.spark.ml.tuning.CrossValidator.$anonfun$fit$1(CrossValidator.scala:166)
        at org.apache.spark.ml.util.Instrumentation$.$anonfun$instrumented$1(Instrumentation.scala:191)
        at scala.util.Try$.apply(Try.scala:213)
        at org.apache.spark.ml.util.Instrumentation$.instrumented(Instrumentation.scala:191)
        at org.apache.spark.ml.tuning.CrossValidator.fit(CrossValidator.scala:137)

@wbo4958
Copy link
Contributor

wbo4958 commented Feb 22, 2023

Could you help to check the python tracker log? Python tracker requires python 3.8+

@trivialfis
Copy link
Member

Closing as the tracker is rewritten #10112

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants