-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP][PySpark] Add XGBoost PySpark API support #7709
Conversation
@wbo4958 Is there a way to set the Scala tracker instead of the Python tracker? In Java/Scala I set:
|
Yeah, looks like we need to add it support in scala. |
This is a typical snippet I can have in Java:
It looks like many parameters are missing (is there a way to set eval sets for early stopping?) |
@candalfigomoro Yeah, I have not ported all the scala params yet, for now, I am focusing on adding integration tests. we can have following PRs to add more params. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for revising the original PR! This looks much more aligned with Python and Pyspark ml.
Could you please write a tutorial in doc/tutorials/
for newbies like me to get started?
python-package/xgboost/spark.py
Outdated
from pyspark import keyword_only | ||
from pyspark.ml.common import inherit_doc | ||
|
||
from xgboost.ml.dmlc.param import _XGBoostClassifierBase, _XGBoostClassificationModelBase |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's more appropriate to use xgboost.pyspark.PySparkXGBClassifier
instead of this java style module.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the convention of PySpark, we should keep the same API name for PySpark and JVM.
See https://github.com/apache/spark/blob/branch-3.2/mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala#L58 and https://github.com/apache/spark/blob/branch-3.2/python/pyspark/ml/classification.py#L1952
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's not the pyspark convention. It's not pyspark.ml.apache.org.xxxx
.
_spark = _spark__init() | ||
|
||
|
||
def get_spark_i_know_what_i_am_doing(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, what's the specific case that we should use this instead of with cpu session? Could you please document it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the API was copied from spark-rapids, and now I've changed it according to need of xgboost.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wbo4958 Okay .... but I'm not sure how's that relevant to the question?
from xgboost.ml.dmlc.param.internal import _XGBoostClassifierBase, _XGBoostClassificationModelBase, \ | ||
_XGBoostRegressionModelBase, _XGBoostRegressorBase | ||
|
||
__all__ = ['_XGBoostClassifierBase', '_XGBoostClassificationModelBase', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this necessary?
pass | ||
|
||
|
||
@inherit_doc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not entirely sure what's the result of combining this decorator with a custom doc string. Have you checked?
""" | ||
Java Regressor for regression tasks. | ||
|
||
.. versionadded:: 3.0.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please double-check copied code.
This PR is to close #7578.
The XGBoost PySpark API is a wrapper of xgboost4j-spark and xgboost4j-spark-gpu. It will be packaged into the existing xgboost python-package.