-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Breaking][jvm-packages] add barrier execution mode #7836
Conversation
With introducing barrier execution mode. we don't need to kill SparkContext when some xgboost tasks failed, instead, Spark will do it for us. So in this PR, killSparkContextOnWorkerFailure parameter is deleted.
@trivialfis could you help to review it? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for working on this, being able to remove code and improve robustness is exciting!
- Could you please share more details on what's the breaking change and how users should adapt to it?
- Could you please keep some RABIT mock tests here to show that exceptions are indeed being well handled?
test("throw exception for empty partition in trainingset") { | ||
val paramMap = Map("eta" -> "0.1", "max_depth" -> "6", "silent" -> "1", | ||
"objective" -> "multi:softmax", "num_class" -> "2", "num_round" -> 5, | ||
"num_workers" -> numWorkers, "tree_method" -> "auto") | ||
"objective" -> "binary:logistic", "num_class" -> "2", "num_round" -> 5, | ||
"num_workers" -> numWorkers, "tree_method" -> "auto", "allow_non_zero_for_missing" -> true) | ||
// The Dmatrix will be empty | ||
val trainingDF = buildDataFrame(Seq(XGBLabeledPoint(1.0f, 1, Array(), Array()))) | ||
val trainingDF = buildDataFrame(Seq(XGBLabeledPoint(1.0f, 4, | ||
Array(0, 1, 2, 3), Array(0, 1, 2, 3)))) | ||
val xgb = new XGBoostClassifier(paramMap) | ||
intercept[XGBoostError] { | ||
val model = xgb.fit(trainingDF) | ||
intercept[SparkException] { | ||
xgb.fit(trainingDF) | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these changes strictly related to the support of barrier mode?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Completely not. I just make this test more like the test description.
killSparkContextOnWorkerFailure parameter is deleted. Users are supposed not to use this parameter.
Not exactly. the deleted case test("test SparkContext should not be killed ") is only for |
With the introduction of the barrier execution mode. we don't need to kill SparkContext when some xgboost tasks failed. Instead, Spark will handle the errors for us. So in this PR, `killSparkContextOnWorkerFailure` parameter is deleted.
* [jvm-packages] move the dmatrix building into rabit context (#7823) This fixes the QuantileDeviceDMatrix in distributed environment. * [doc] update the jvm tutorial to 1.6.1 [skip ci] (#7834) * [Breaking][jvm-packages] Use barrier execution mode (#7836) With the introduction of the barrier execution mode. we don't need to kill SparkContext when some xgboost tasks failed. Instead, Spark will handle the errors for us. So in this PR, `killSparkContextOnWorkerFailure` parameter is deleted. * [doc] remove the doc about killing SparkContext [skip ci] (#7840) * [jvm-package] remove the coalesce in barrier mode (#7846) * [jvm-packages] Fix model compatibility (#7845) * Ignore all Java exceptions when looking for Linux musl support (#7844) Co-authored-by: Bobby Wang <wbo4958@gmail.com> Co-authored-by: Michael Allman <msa@allman.ms>
To fix #7835. Credit to @WeichenXu123
By introducing the barrier execution mode, we don't need to kill SparkContext when some xgboost tasks failed, instead, Spark will abort the whole barrier stage which does not depend on SparkListener, we will never encounter the xgboost task hang issue. So in this PR, the killSparkContextOnWorkerFailure parameter is deleted.
Since this PR has deleted Spark SparkParallelismTracker, the timeoutRequestWorkers parameter is not needed anymore. But one test
cross-version model loading (0.82)
will fail if we just delete timeoutRequestWorkers, we just keep this parameter in this PR. Will file a following up PR to delete timeoutRequestWorkers parameters.