[WIP][PySpark] Add XGBoost PySpark API support #7709

wbo4958 · 2022-02-28T10:02:14Z

This PR is to close #7578.

The XGBoost PySpark API is a wrapper of xgboost4j-spark and xgboost4j-spark-gpu. It will be packaged into the existing xgboost python-package.

candalfigomoro · 2022-03-07T19:30:27Z

@wbo4958 Is there a way to set the Scala tracker instead of the Python tracker?

In Java/Scala I set:

xgbClassifier.set("trackerConf", new TrackerConf(0, "scala"));

wbo4958 · 2022-03-08T06:12:20Z

@wbo4958 Is there a way to set the Scala tracker instead of the Python tracker?

In Java/Scala I set:

xgbClassifier.set("trackerConf", new TrackerConf(0, "scala"));

Yeah, looks like we need to add it support in scala.

candalfigomoro · 2022-03-08T11:42:39Z

This is a typical snippet I can have in Java:

        XGBoostClassifier xgbClassifier = new XGBoostClassifier();
        xgbClassifier.set("trackerConf", new TrackerConf(0, "scala"));
        xgbClassifier.setNumWorkers(8);
        xgbClassifier.setNthread(1);
        xgbClassifier.setEta(0.1);
        xgbClassifier.setMaxDepth(6);
        xgbClassifier.setNumRound(1000);
        xgbClassifier.setNumEarlyStoppingRounds(10);
        xgbClassifier.setObjective("binary:logistic");
        xgbClassifier.setScalePosWeight(1.0);
        xgbClassifier.setSubsample(0.5);
        xgbClassifier.setColsampleBytree(0.5);
        xgbClassifier.setTreeMethod("hist");
        xgbClassifier.setFeaturesCol("features");
        xgbClassifier.setLabelCol("label");
        xgbClassifier.setEvalMetric("auc");
        xgbClassifier.setMaximizeEvaluationMetrics(true);  // true for "auc"
        xgbClassifier.setEvalSets(new Map.Map1<>("validation", validation));

It looks like many parameters are missing (is there a way to set eval sets for early stopping?)

wbo4958 · 2022-03-08T23:46:11Z

@candalfigomoro Yeah, I have not ported all the scala params yet, for now, I am focusing on adding integration tests. we can have following PRs to add more params.

trivialfis

Thank you for revising the original PR! This looks much more aligned with Python and Pyspark ml.

Could you please write a tutorial in doc/tutorials/ for newbies like me to get started?

trivialfis · 2022-03-14T09:21:45Z

python-package/xgboost/spark.py

+from pyspark import keyword_only
+from pyspark.ml.common import inherit_doc
+
+from xgboost.ml.dmlc.param import _XGBoostClassifierBase, _XGBoostClassificationModelBase


I think it's more appropriate to use xgboost.pyspark.PySparkXGBClassifier instead of this java style module.

From the convention of PySpark, we should keep the same API name for PySpark and JVM.
See https://github.com/apache/spark/blob/branch-3.2/mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala#L58 and https://github.com/apache/spark/blob/branch-3.2/python/pyspark/ml/classification.py#L1952

That's not the pyspark convention. It's not pyspark.ml.apache.org.xxxx.

trivialfis · 2022-03-14T09:27:18Z

jvm-packages/integration-tests/python/spark_init_internal.py

+_spark = _spark__init()
+
+
+def get_spark_i_know_what_i_am_doing():


So, what's the specific case that we should use this instead of with cpu session? Could you please document it?

the API was copied from spark-rapids, and now I've changed it according to need of xgboost.

@wbo4958 Okay .... but I'm not sure how's that relevant to the question?

trivialfis · 2022-03-14T09:29:26Z

python-package/xgboost/ml/dmlc/param/__init__.py

+from xgboost.ml.dmlc.param.internal import _XGBoostClassifierBase, _XGBoostClassificationModelBase, \
+    _XGBoostRegressionModelBase, _XGBoostRegressorBase
+
+__all__ = ['_XGBoostClassifierBase', '_XGBoostClassificationModelBase',


Is this necessary?

trivialfis · 2022-03-14T09:30:29Z

python-package/xgboost/ml/dmlc/param/internal.py

+    pass
+
+
+@inherit_doc


I'm not entirely sure what's the result of combining this decorator with a custom doc string. Have you checked?

trivialfis · 2022-03-23T14:20:11Z

python-package/xgboost/param/internal.py

+    """
+    Java Regressor for regression tasks.
+
+    .. versionadded:: 3.0.0


Please double-check copied code.

[PySpark] Add XGBoost PySpark API support

bb17dc9

wbo4958 changed the title ~~[PySpark] Add XGBoost PySpark API support~~ [WIP][PySpark] Add XGBoost PySpark API support Feb 28, 2022

trivialfis marked this pull request as draft February 28, 2022 13:00

resolve lint error

c4d8f60

wbo4958 added 3 commits March 9, 2022 18:10

add integration tests

f541f0d

add readme for integration tests

3a0eef1

add regressor and fix bug

473856f

trivialfis reviewed Mar 14, 2022

View reviewed changes

This was referenced Mar 23, 2022

v1.6.0 release note. [skip ci] #7746

Merged

[Roadmap] 1.6.0 Roadmap #7726

Closed

wbo4958 and others added 8 commits March 23, 2022 13:58

update the tests

314eb22

fix couldn't find XGBoostClassifier issue

89dda8f

add tutorial for xgboost-pyspark

7c1496f

move pyspark doc to python packages

bb07127

Formats.

92711c3

Sphinx

ca8a547

Some pylint fixes.

a7cffcc

Export documents.

46e4853

trivialfis reviewed Mar 23, 2022

View reviewed changes

trivialfis added 5 commits March 24, 2022 20:06

Extract iris example, generate module on the fly.

8a145b1

typing.

e1f6c69

Fix CI script.

5a2ef75

Warning.

5651246

Pylint errors.

5b58181

trivialfis and others added 3 commits March 24, 2022 22:21

Action.

5c0c1fc

Move.

9be06b4

add more parameters

f14730a

trivialfis closed this Aug 9, 2022

wbo4958 deleted the xgb.pyspark branch April 23, 2024 09:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP][PySpark] Add XGBoost PySpark API support #7709

[WIP][PySpark] Add XGBoost PySpark API support #7709

wbo4958 commented Feb 28, 2022

candalfigomoro commented Mar 7, 2022

wbo4958 commented Mar 8, 2022

candalfigomoro commented Mar 8, 2022 •

edited

Loading

wbo4958 commented Mar 8, 2022

trivialfis left a comment

trivialfis Mar 14, 2022

wbo4958 Mar 23, 2022

trivialfis Mar 23, 2022

trivialfis Mar 14, 2022

wbo4958 Mar 23, 2022

trivialfis Mar 23, 2022

trivialfis Mar 14, 2022

trivialfis Mar 14, 2022

trivialfis Mar 23, 2022

		_spark = _spark__init()


		def get_spark_i_know_what_i_am_doing():

		pass


		@inherit_doc

[WIP][PySpark] Add XGBoost PySpark API support #7709

[WIP][PySpark] Add XGBoost PySpark API support #7709

Conversation

wbo4958 commented Feb 28, 2022

candalfigomoro commented Mar 7, 2022

wbo4958 commented Mar 8, 2022

candalfigomoro commented Mar 8, 2022 • edited Loading

wbo4958 commented Mar 8, 2022

trivialfis left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

candalfigomoro commented Mar 8, 2022 •

edited

Loading