[jvm-packages] XGBoost-Spark reported "ERROR RabitTracker ......" #3834

FredYao · 2018-10-26T21:10:04Z

I was running spark using local mode, following the example posted on the official tutorial (XGBoost4J-Spark Tutorial). While I called val xgbClassificationModel = xgbClassifier.fit(xgbInput), the following error occurred:

scala> val xgbClassificationModel = xgbClassifier.fit(xgbInput)
Tracker started, with env={}
2018-10-26 14:08:36 ERROR RabitTracker:91 - Uncaught exception thrown by worker:
java.lang.InterruptedException
at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:998)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
at scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:202)
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218)
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:153)
at org.apache.spark.util.ThreadUtils$.awaitReady(ThreadUtils.scala:222)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:633)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2027)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2048)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2067)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2092)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:929)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:927)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at org.apache.spark.rdd.RDD.foreachPartition(RDD.scala:927)
at ml.dmlc.xgboost4j.scala.spark.XGBoost$$anonfun$trainDistributed$4$$anon$1.run(XGBoost.scala:233)
[14:08:36] /xgboost/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 2 extra nodes, 0 pruned nodes, max_depth=1
[14:08:36] /xgboost/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 2 extra nodes, 0 pruned nodes, max_depth=1
[14:08:36] /xgboost/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 4 extra nodes, 0 pruned nodes, max_depth=2
[14:08:36] /xgboost/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 4 extra nodes, 0 pruned nodes, max_depth=2
[14:08:36] /xgboost/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 4 extra nodes, 0 pruned nodes, max_depth=2
[14:08:36] /xgboost/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 4 extra nodes, 0 pruned nodes, max_depth=2
[0] train-merror:0.026667
[0] train-merror:0.026667

.......
.......
[14:08:36] /xgboost/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 0 extra nodes, 0 pruned nodes, max_depth=0
[14:08:36] /xgboost/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 2 extra nodes, 0 pruned nodes, max_depth=1
[14:08:36] /xgboost/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 2 extra nodes, 0 pruned nodes, max_depth=1
[99] train-merror:0.000000
ml.dmlc.xgboost4j.java.XGBoostError: XGBoostModel training failed
at ml.dmlc.xgboost4j.scala.spark.XGBoost$.ml$dmlc$xgboost4j$scala$spark$XGBoost$$postTrackerReturnProcessing(XGBoost.scala:283)
at ml.dmlc.xgboost4j.scala.spark.XGBoost$$anonfun$trainDistributed$4.apply(XGBoost.scala:240)
at ml.dmlc.xgboost4j.scala.spark.XGBoost$$anonfun$trainDistributed$4.apply(XGBoost.scala:222)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.immutable.List.foreach(List.scala:381)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.immutable.List.map(List.scala:285)
at ml.dmlc.xgboost4j.scala.spark.XGBoost$.trainDistributed(XGBoost.scala:221)
at ml.dmlc.xgboost4j.scala.spark.XGBoostClassifier.train(XGBoostClassifier.scala:191)
at ml.dmlc.xgboost4j.scala.spark.XGBoostClassifier.train(XGBoostClassifier.scala:48)
at org.apache.spark.ml.Predictor.fit(Predictor.scala:118)
... 49 elided

Did anyone have a clue about this issue?
Thanks in advance.

The text was updated successfully, but these errors were encountered:

CodingCat · 2018-10-27T01:41:43Z

Tracker started, with env={} means your tracker cannot bind to your local ip address, can you check if your hosts file contains 127.0.0.1 or your computer name?

FredYao · 2018-10-27T04:11:28Z

@CodingCat , I am running on Yarn cluster. Everything about the cluster is set by default. I did not change them explicitly.

CodingCat · 2018-10-27T04:13:58Z

Yes, you need to contact your cluster admin say whether hosts files are correctly written

FredYao · 2018-10-28T05:01:58Z

Thank you, @CodingCat, is there other way to bind the ip address to tracker by manual setting?

CodingCat · 2018-10-28T19:12:52Z

see the discussion in #3833

you can add xgboost-tracker.properties in your resources directory

FredYao · 2018-10-29T20:03:32Z

@CodingCat , I have checked the hosts file, its content:
127.0.0.1 localhost.localdomain localhost
TrackerProperties can read this correctly?

henokyen · 2018-11-18T02:50:30Z

has anyone got a solution for this, i am struggling with the same error

CodingCat · 2018-11-18T02:59:07Z

@FredYao @henokyen , add xgboost-tracker.properties with the content host-ip=0.0.0.0 in the resources folder of your jar, and

FredYao · 2018-11-18T03:47:12Z

@henokyen , the first thing is to check /etc/hosts of the job driver/worker, make sure that it has the 127.0.0.1 for localhost.

If the localhost is correct, then check the python version that you used. If it's lower that Python 2.7, the error is then expected. Because the rabbit tracker is supposed to work for Python 2.7+. If this is the case, you can change the tracker to the scala-version while setting the parameters for XGBoostClassifier.
val xgbParam = Map("eta" -> 0.1f, "max_depth" -> 2, ..., "tracker_conf" -> "scala" )

henokyen · 2018-11-19T18:32:42Z

this is the content of my /etc/host

Host Database

localhost is used to configure the loopback interface

when the system is booting. Do not change this entry.

127.0.0.1 localhost
255.255.255.255 broadcasthost
::1 localhost

i have 127.0.0.1 for localhost

I am on MacBook Pro. I am trying to implement this tutorial
https://medium.com/@bogdan.cojocar/pyspark-and-xgboost-integration-tested-on-the-kaggle-titanic-dataset-4e75a568bdb
but I kept getting this error
Tracker started, with env={}
[11:21:07] /Users/nanzhu/code/xgboost/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 70 extra nodes, 0 pruned nodes, max_depth=6
[0] train-error:0.112202
[Stage 10:> (0 + 1) / 1]2018-11-19 11:21:08 ERROR RabitTracker:91 - Uncaught exception thrown by worker:

I am using python version 2.7.10. Any idea? Thanks

hcho3 changed the title ~~XGBoost-Spark reported "ERROR RabitTracker ......"~~ [jvm-packages] XGBoost-Spark reported "ERROR RabitTracker ......" Oct 26, 2018

CodingCat closed this as completed Oct 28, 2018

lock bot locked as resolved and limited conversation to collaborators Feb 17, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[jvm-packages] XGBoost-Spark reported "ERROR RabitTracker ......" #3834

[jvm-packages] XGBoost-Spark reported "ERROR RabitTracker ......" #3834

FredYao commented Oct 26, 2018

CodingCat commented Oct 27, 2018 •

edited

Loading

FredYao commented Oct 27, 2018

CodingCat commented Oct 27, 2018

FredYao commented Oct 28, 2018

CodingCat commented Oct 28, 2018

FredYao commented Oct 29, 2018

henokyen commented Nov 18, 2018

CodingCat commented Nov 18, 2018

FredYao commented Nov 18, 2018 •

edited

Loading

henokyen commented Nov 19, 2018

[jvm-packages] XGBoost-Spark reported "ERROR RabitTracker ......" #3834

[jvm-packages] XGBoost-Spark reported "ERROR RabitTracker ......" #3834

Comments

FredYao commented Oct 26, 2018

CodingCat commented Oct 27, 2018 • edited Loading

FredYao commented Oct 27, 2018

CodingCat commented Oct 27, 2018

FredYao commented Oct 28, 2018

CodingCat commented Oct 28, 2018

FredYao commented Oct 29, 2018

henokyen commented Nov 18, 2018

CodingCat commented Nov 18, 2018

FredYao commented Nov 18, 2018 • edited Loading

henokyen commented Nov 19, 2018

Host Database

localhost is used to configure the loopback interface

when the system is booting. Do not change this entry.

CodingCat commented Oct 27, 2018 •

edited

Loading

FredYao commented Nov 18, 2018 •

edited

Loading