Skip to content

Commit

Permalink
udpate
Browse files Browse the repository at this point in the history
  • Loading branch information
wbo4958 committed Sep 25, 2024
1 parent 7cf6ec5 commit 1234701
Show file tree
Hide file tree
Showing 2 changed files with 13 additions and 29 deletions.
18 changes: 3 additions & 15 deletions doc/install.rst
Original file line number Diff line number Diff line change
Expand Up @@ -159,7 +159,7 @@ R
JVM
---

* XGBoost4j/XGBoost4j-Spark
* XGBoost4j-Spark

.. code-block:: xml
:caption: Maven
Expand All @@ -172,11 +172,6 @@ JVM
<dependencies>
...
<dependency>
<groupId>ml.dmlc</groupId>
<artifactId>xgboost4j_${scala.binary.version}</artifactId>
<version>latest_version_num</version>
</dependency>
<dependency>
<groupId>ml.dmlc</groupId>
<artifactId>xgboost4j-spark_${scala.binary.version}</artifactId>
Expand All @@ -188,11 +183,10 @@ JVM
:caption: sbt
libraryDependencies ++= Seq(
"ml.dmlc" %% "xgboost4j" % "latest_version_num",
"ml.dmlc" %% "xgboost4j-spark" % "latest_version_num"
)
* XGBoost4j-GPU/XGBoost4j-Spark-GPU
* XGBoost4j-Spark-GPU

.. code-block:: xml
:caption: Maven
Expand All @@ -205,11 +199,6 @@ JVM
<dependencies>
...
<dependency>
<groupId>ml.dmlc</groupId>
<artifactId>xgboost4j-gpu_${scala.binary.version}</artifactId>
<version>latest_version_num</version>
</dependency>
<dependency>
<groupId>ml.dmlc</groupId>
<artifactId>xgboost4j-spark-gpu_${scala.binary.version}</artifactId>
Expand All @@ -221,15 +210,14 @@ JVM
:caption: sbt
libraryDependencies ++= Seq(
"ml.dmlc" %% "xgboost4j-gpu" % "latest_version_num",
"ml.dmlc" %% "xgboost4j-spark-gpu" % "latest_version_num"
)
This will check out the latest stable version from the Maven Central.

For the latest release version number, please check `release page <https://github.com/dmlc/xgboost/releases>`_.

To enable the GPU algorithm (``device='cuda'``), use artifacts ``xgboost4j-gpu_2.12`` and ``xgboost4j-spark-gpu_2.12`` instead (note the ``gpu`` suffix).
To enable the GPU algorithm (``device='cuda'``), use artifacts ``xgboost4j-spark-gpu_2.12`` instead (note the ``gpu`` suffix).


.. note:: Windows not supported in the JVM package
Expand Down
24 changes: 10 additions & 14 deletions doc/jvm/xgboost4j_spark_gpu_tutorial.rst
Original file line number Diff line number Diff line change
Expand Up @@ -71,7 +71,7 @@ To make the Iris dataset recognizable to XGBoost, we need to encode the String-t
label, i.e. "class", to the Double-typed label.

One way to convert the String-typed label to Double is to use Spark's built-in feature transformer
`StringIndexer <https://spark.apache.org/docs/2.3.1/api/scala/index.html#org.apache.spark.ml.feature.StringIndexer>`_.
`StringIndexer <https://spark.apache.org/docs/latest/api/scala/org/apache/spark/ml/feature/StringIndexer.html>`_.
But this feature is not accelerated in RAPIDS Accelerator, which means it will fall back
to CPU. Instead, we use an alternative way to achieve the same goal with the following code:

Expand Down Expand Up @@ -107,10 +107,10 @@ With window operations, we have mapped the string column of labels to label indi
Training
========

The GPU version of XGBoost-Spark supports both regression and classification
XGBoost4j-Spark-Gpu supports regression, classification and ranking
models. Although we use the Iris dataset in this tutorial to show how we use
``XGBoost/XGBoost4J-Spark-GPU`` to resolve a multi-classes classification problem, the
usage in Regression is very similar to classification.
``XGBoost4J-Spark-GPU`` to resolve a multi-classes classification problem, the
usage in Regression and Ranking is very similar to classification.

To train a XGBoost model for classification, we need to define a XGBoostClassifier first:

Expand Down Expand Up @@ -168,12 +168,13 @@ model can then be used in other tasks like prediction.
Prediction
==========

When we get a model, either a XGBoostClassificationModel or a XGBoostRegressionModel, it takes a DataFrame as an input,
When we get a model, a XGBoostClassificationModel or a XGBoostRegressionModel or a XGBoostRankerModel, it takes a DataFrame as an input,
reads the column containing feature vectors, predicts for each feature vector, and outputs a new DataFrame
with the following columns by default:

* XGBoostClassificationModel will output margins (``rawPredictionCol``), probabilities(``probabilityCol``) and the eventual prediction labels (``predictionCol``) for each possible label.
* XGBoostRegressionModel will output prediction a label(``predictionCol``).
* XGBoostRankerModel will output prediction a label(``predictionCol``).

.. code-block:: scala
Expand Down Expand Up @@ -226,25 +227,20 @@ would be ``"spark.task.resource.gpu.amount=1/spark.executor.cores"``. However, i
using a XGBoost version earlier than 2.1.0 or a Spark standalone cluster version below 3.4.0,
you still need to set ``"spark.task.resource.gpu.amount"`` equal to ``"spark.executor.resource.gpu.amount"``.

.. note::

As of now, the stage-level scheduling feature in XGBoost is limited to the Spark standalone cluster mode.
However, we have plans to expand its compatibility to YARN and Kubernetes once Spark 3.5.1 is officially released.

Assuming that the application main class is "Iris" and the application jar is "iris-1.0.0.jar",`
provided below is an instance demonstrating how to submit the xgboost application to an Apache
Spark Standalone cluster.

.. code-block:: bash
rapids_version=23.10.0
xgboost_version=2.0.1
rapids_version=24.08.0
xgboost_version=$LATEST_VERSION
main_class=Iris
app_jar=iris-1.0.0.jar
spark-submit \
--master $master \
--packages com.nvidia:rapids-4-spark_2.12:${rapids_version},ml.dmlc:xgboost4j-gpu_2.12:${xgboost_version},ml.dmlc:xgboost4j-spark-gpu_2.12:${xgboost_version} \
--packages com.nvidia:rapids-4-spark_2.12:${rapids_version},ml.dmlc:xgboost4j-spark-gpu_2.12:${xgboost_version} \
--conf spark.executor.cores=12 \
--conf spark.task.cpus=1 \
--conf spark.executor.resource.gpu.amount=1 \
Expand All @@ -255,7 +251,7 @@ Spark Standalone cluster.
--class ${main_class} \
${app_jar}
* First, we need to specify the ``RAPIDS Accelerator, xgboost4j-gpu, xgboost4j-spark-gpu`` packages by ``--packages``
* First, we need to specify the ``RAPIDS Accelerator, xgboost4j-spark-gpu`` packages by ``--packages``
* Second, ``RAPIDS Accelerator`` is a Spark plugin, so we need to configure it by specifying ``spark.plugins=com.nvidia.spark.SQLPlugin``

For details about other ``RAPIDS Accelerator`` other configurations, please refer to the `configuration <https://nvidia.github.io/spark-rapids/docs/configs.html>`_.
Expand Down

0 comments on commit 1234701

Please sign in to comment.