From c8e0f45b1e8f3b5705b68e0f79f768dc4c9c60df Mon Sep 17 00:00:00 2001 From: Jiaming Yuan Date: Wed, 26 Oct 2022 15:45:54 +0800 Subject: [PATCH] [backport][pyspark] Improve tutorial on enabling GPU support. (#8385) [skip ci] - Quote the databricks doc on how to manage dependencies. - Some wording changes. Co-authored-by: Philip Hyunsu Cho --- doc/tutorials/spark_estimator.rst | 95 +++++++++++++++++++------------ 1 file changed, 60 insertions(+), 35 deletions(-) diff --git a/doc/tutorials/spark_estimator.rst b/doc/tutorials/spark_estimator.rst index acacada0b3eb..44e7a957513b 100644 --- a/doc/tutorials/spark_estimator.rst +++ b/doc/tutorials/spark_estimator.rst @@ -83,17 +83,52 @@ generate result dataset with 3 new columns: XGBoost PySpark GPU support *************************** -XGBoost PySpark supports GPU training and prediction. To enable GPU support, first you -need to install the XGBoost and the `cuDF `_ -package. Then you can set `use_gpu` parameter to `True`. +XGBoost PySpark fully supports GPU acceleration. Users are not only able to enable +efficient training but also utilize their GPUs for the whole PySpark pipeline including +ETL and inference. In below sections, we will walk through an example of training on a +PySpark standalone GPU cluster. To get started, first we need to install some additional +packages, then we can set the `use_gpu` parameter to `True`. -Below tutorial demonstrates how to train a model with XGBoost PySpark GPU on Spark -standalone cluster. +Prepare the necessary packages +============================== + +Aside from the PySpark and XGBoost modules, we also need the `cuDF +`_ package for handling Spark dataframe. We +recommend using either Conda or Virtualenv to manage python dependencies for PySpark +jobs. Please refer to `How to Manage Python Dependencies in PySpark +`_ +for more details on PySpark dependency management. + +In short, to create a Python environment that can be sent to a remote cluster using +virtualenv and pip: + +.. code-block:: bash + + python -m venv xgboost_env + source xgboost_env/bin/activate + pip install pyarrow pandas venv-pack xgboost + # https://rapids.ai/pip.html#install + pip install cudf-cu11 --extra-index-url=https://pypi.ngc.nvidia.com + venv-pack -o xgboost_env.tar.gz + +With Conda: + +.. code-block:: bash + + conda create -y -n xgboost_env -c conda-forge conda-pack python=3.9 + conda activate xgboost_env + # use conda when the supported version of xgboost (1.7) is released on conda-forge + pip install xgboost + conda install cudf pyarrow pandas -c rapids -c nvidia -c conda-forge + conda pack -f -o xgboost_env.tar.gz Write your PySpark application ============================== +Below snippet is a small example for training xgboost model with PySpark. Notice that we are +using a list of feature names and the additional parameter ``use_gpu``: + .. code-block:: python from xgboost.spark import SparkXGBRegressor @@ -127,26 +162,11 @@ Write your PySpark application predict_df = model.transform(test_df) predict_df.show() -Prepare the necessary packages -============================== - -We recommend using Conda or Virtualenv to manage python dependencies -in PySpark. Please refer to -`How to Manage Python Dependencies in PySpark `_. - -.. code-block:: bash - - conda create -y -n xgboost-env -c conda-forge conda-pack python=3.9 - conda activate xgboost-env - pip install xgboost - conda install cudf -c rapids -c nvidia -c conda-forge - conda pack -f -o xgboost-env.tar.gz - Submit the PySpark application ============================== -Assuming you have configured your Spark cluster with GPU support, if not yet, please +Assuming you have configured your Spark cluster with GPU support. Otherwise, please refer to `spark standalone configuration with GPU support `_. .. code-block:: bash @@ -158,10 +178,13 @@ refer to `spark standalone configuration with GPU support :7077 \ --conf spark.executor.resource.gpu.amount=1 \ --conf spark.task.resource.gpu.amount=1 \ - --archives xgboost-env.tar.gz#environment \ + --archives xgboost_env.tar.gz#environment \ xgboost_app.py +The submit command sends the Python environment created by pip or conda along with the +specification of GPU allocation. We will revisit this command later on. + Model Persistence ================= @@ -186,26 +209,27 @@ To export the underlying booster model used by XGBoost: # the same booster object returned by xgboost.train booster: xgb.Booster = model.get_booster() booster.predict(...) - booster.save_model("model.json") + booster.save_model("model.json") # or model.ubj, depending on your choice of format. -This booster is shared by other Python interfaces and can be used by other language -bindings like the C and R packages. Lastly, one can extract a booster file directly from -saved spark estimator without going through the getter: +This booster is not only shared by other Python interfaces but also used by all the +XGBoost bindings including the C, Java, and the R package. Lastly, one can extract the +booster file directly from a saved spark estimator without going through the getter: .. code-block:: python import xgboost as xgb bst = xgb.Booster() + # Loading the model saved in previous snippet bst.load_model("/tmp/xgboost-pyspark-model/model/part-00000") -Accelerate the whole pipeline of xgboost pyspark -================================================ -With `RAPIDS Accelerator for Apache Spark `_, -you can accelerate the whole pipeline (ETL, Train, Transform) for xgboost pyspark -without any code change by leveraging GPU. +Accelerate the whole pipeline for xgboost pyspark +================================================= -Below is a simple example submit command for enabling GPU acceleration: +With `RAPIDS Accelerator for Apache Spark `_, you +can leverage GPUs to accelerate the whole pipeline (ETL, Train, Transform) for xgboost +pyspark without any Python code change. An example submit command is shown below with +additional spark configurations and dependencies: .. code-block:: bash @@ -219,8 +243,9 @@ Below is a simple example submit command for enabling GPU acceleration: --packages com.nvidia:rapids-4-spark_2.12:22.08.0 \ --conf spark.plugins=com.nvidia.spark.SQLPlugin \ --conf spark.sql.execution.arrow.maxRecordsPerBatch=1000000 \ - --archives xgboost-env.tar.gz#environment \ + --archives xgboost_env.tar.gz#environment \ xgboost_app.py -When rapids plugin is enabled, both of the JVM rapids plugin and the cuDF Python are -required for the acceleration. +When rapids plugin is enabled, both of the JVM rapids plugin and the cuDF Python package +are required. More configuration options can be found in the RAPIDS link above along with +details on the plugin.