Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update docs for the 22.04 release[skip ci] #4997

Merged
merged 36 commits into from
Apr 7, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
36 commits
Select commit Hold shift + click to select a range
2e817ac
Update 2204 doc including add a download page section
viadea Mar 21, 2022
5a565fe
Update docs/FAQ.md
viadea Mar 21, 2022
4eba485
Update docs/FAQ.md
viadea Mar 21, 2022
50b8173
Update docs/FAQ.md
viadea Mar 21, 2022
5a5f0cf
Update docs/FAQ.md
viadea Mar 21, 2022
99df03d
reword on FAQ
viadea Mar 21, 2022
b9a0a6f
Change to CUDA 11.5 in FAQ guide
viadea Mar 22, 2022
247b7d0
Update docs/FAQ.md
viadea Mar 22, 2022
4c19d8b
Update docs/FAQ.md
viadea Mar 22, 2022
54ee0cb
Update docs/FAQ.md
viadea Mar 22, 2022
5844c99
Update docs/FAQ.md
viadea Mar 22, 2022
8fd9e2e
Update docs/FAQ.md
viadea Mar 22, 2022
a9ff4db
Update docs/FAQ.md
viadea Mar 22, 2022
b9f9d70
Update docs/FAQ.md
viadea Mar 22, 2022
5e39bea
Update docs/get-started/getting-started-databricks.md
viadea Mar 22, 2022
7957ba1
minor wording change in FAQ
viadea Mar 22, 2022
05bae05
Add some notes in GCP guide
viadea Mar 22, 2022
9acae90
Add avro reader
viadea Mar 22, 2022
10ae0bf
Add CDP /CDS versions in FAQ
viadea Mar 23, 2022
2c27550
resolve conflict
viadea Mar 24, 2022
6eb298f
add spark 3.3
viadea Mar 24, 2022
750af1a
resolve conflict
viadea Mar 24, 2022
1d8c1c4
Merge branch 'branch-22.04' into 2204-doc
viadea Mar 24, 2022
49119c6
Add 3.3.0
viadea Mar 24, 2022
3fdf64c
remove 330
viadea Mar 25, 2022
882af8b
Add support email
viadea Mar 25, 2022
b0bab03
Update docs/download.md
viadea Apr 1, 2022
b38b4b3
Update docs/download.md
viadea Apr 1, 2022
dca6600
delete generate-init-script-cuda11.ipynb
viadea Apr 1, 2022
6c76439
reformatted init script
viadea Apr 1, 2022
25ea4d6
MIG FAQ update
viadea Apr 1, 2022
756b7e0
Modified CLDR/EMR section in download
viadea Apr 2, 2022
353b8f4
Update docs/get-started/getting-started-databricks.md
viadea Apr 5, 2022
b46562e
Update docs/get-started/getting-started-gcp.md
viadea Apr 5, 2022
3d551f9
Update docs/get-started/getting-started-databricks.md
viadea Apr 5, 2022
9eb8765
Update docs/FAQ.md
viadea Apr 5, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
45 changes: 40 additions & 5 deletions docs/FAQ.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,9 +20,13 @@ process, we try to stay on top of these changes and release updates as quickly a
The RAPIDS Accelerator for Apache Spark officially supports:
- [Apache Spark](get-started/getting-started-on-prem.md)
- [AWS EMR 6.2+](get-started/getting-started-aws-emr.md)
- [Databricks Runtime 7.3, 9.1](get-started/getting-started-databricks.md)
- [Databricks Runtime 9.1, 10.4](get-started/getting-started-databricks.md)
- [Google Cloud Dataproc 2.0](get-started/getting-started-gcp.md)
- [Azure Synapse](get-started/getting-started-azure-synapse-analytics.md)
- Cloudera provides the plugin packaged through
[CDS 3.2](https://docs.cloudera.com/cdp-private-cloud-base/7.1.7/cds-3/topics/spark-spark-3-overview.html)
which is supported on the following
[CDP Private Cloud Base releases](https://docs.cloudera.com/cdp-private-cloud-base/7.1.7/cds-3/topics/spark-3-requirements.html).

Most distributions based on a supported Apache Spark version should work, but because the plugin
replaces parts of the physical plan that Apache Spark considers to be internal the code for those
Expand All @@ -40,7 +44,7 @@ The plugin is tested and supported on V100, T4, A2, A10, A30 and A100 datacenter
to run the plugin on GeForce desktop hardware with Volta or better architectures. GeForce hardware
does not support [CUDA forward
compatibility](https://docs.nvidia.com/deploy/cuda-compatibility/index.html#forward-compatibility-title),
and will need CUDA 11.2 installed. If not, the following error will be displayed:
and will need CUDA 11.5 installed. If not, the following error will be displayed:

```
ai.rapids.cudf.CudaException: forward compatibility was attempted on non supported HW
Expand Down Expand Up @@ -75,8 +79,9 @@ Turing or Ampere.

Currently a limited set of SQL and DataFrame operations are supported, please see the
[configs](configs.md) and [supported operations](supported_ops.md) for a more complete list of what
is supported. Some of structured streaming is likely to be accelerated, but it has not been an area
of focus right now. Other areas like MLLib, GraphX or RDDs are not accelerated.
is supported. Some of the MLlib functions, such as `PCA` are supported.
Some of structured streaming is likely to be accelerated, but it has not been an area
of focus right now. Other areas like GraphX or RDDs are not accelerated.

### Is the Spark `Dataset` API supported?

Expand Down Expand Up @@ -370,7 +375,9 @@ There are multiple reasons why this a problematic configuration:
### Is [Multi-Instance GPU (MIG)](https://docs.nvidia.com/cuda/mig/index.html) supported?

Yes, but it requires support from the underlying cluster manager to isolate the MIG GPU instance
for each executor (e.g.: by setting `CUDA_VISIBLE_DEVICES` or other means).
for each executor (e.g.: by setting `CUDA_VISIBLE_DEVICES`,
[YARN with docker isolation](https://github.com/NVIDIA/spark-rapids-examples/tree/branch-22.04/examples/MIG-Support)
or other means).

Note that MIG is not recommended for use with the RAPIDS Accelerator since it significantly
reduces the amount of GPU memory that can be used by the Accelerator for each executor instance.
Expand All @@ -379,6 +386,9 @@ without MIG. Also note that the UCX-based shuffle plugin will not work as well i
configuration because
[MIG does not support direct GPU to GPU transfers](https://docs.nvidia.com/datacenter/tesla/mig-user-guide/index.html#app-considerations).

However MIG can be advantageous if the cluster is intended to be shared amongst other processes
(like ML / DL jobs).

### How can I run custom expressions/UDFs on the GPU?

The RAPIDS Accelerator provides the following solutions for running
Expand Down Expand Up @@ -468,3 +478,28 @@ later) finishes before the slow task that triggered speculation. If the speculat
finishes first then that's good, it is working as intended. If many tasks are speculating, but the
original task always finishes first then this is a pure loss, the speculation is adding load to
the Spark cluster with no benefit.

### Why is my query in GPU mode slower than CPU mode?

Below are some troubleshooting tips on GPU query performance issue:
* Identify the most time consuming part of the query. You can use the
[Profiling tool](./spark-profiling-tool.md) to process the Spark event log to get more insights of
the query performance. For example, if I/O is the bottleneck, we suggest optimizing the backend
storage I/O performance because the most suitable query type is computation bound instead of
I/O or network bound.

* Make sure at least the most time consuming part of the query is on the GPU. Please refer to
[Getting Started on Spark workload qualification](./get-started/getting-started-workload-qualification.md)
tgravescs marked this conversation as resolved.
Show resolved Hide resolved
for more details. Ideally we hope the whole query is fully on the GPU, but if some minor part of
the query, eg. a small JDBC table scan, can not run on the GPU, it won't cause much performance
overhead. If there are some CPU fallbacks, check if those are some known features which can be
enabled by turning on some RAPIDS Accelerator parameters. If the features needed do not exist in
the most recent release of the RAPIDS Accelerator, please file a
[feature request](https://github.com/NVIDIA/spark-rapids/issues) with a minimum reproducing example.

* Tune the Spark and RAPIDS Accelerator parameters such as `spark.sql.shuffle.partitions`,
`spark.sql.files.maxPartitionBytes` and `spark.rapids.sql.concurrentGpuTasks` as these configurations can affect performance of queries significantly.
Please refer to [Tuning Guide](./tuning-guide.md) for more details.



2 changes: 1 addition & 1 deletion docs/additional-functionality/ml-integration.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,7 @@ access to any of the memory that RMM is holding.
## Spark ML Algorithms Supported by RAPIDS Accelerator

The [spark-rapids-examples repository](https://github.com/NVIDIA/spark-rapids-examples) provides a
[working example](https://github.com/NVIDIA/spark-rapids-examples/tree/branch-22.02/examples/Spark-cuML/pca)
[working example](https://github.com/NVIDIA/spark-rapids-examples/tree/branch-22.04/examples/Spark-cuML/pca)
of accelerating the `transform` API for
[Principal Component Analysis (PCA)](https://spark.apache.org/docs/latest/mllib-dimensionality-reduction#principal-component-analysis-pca).
The example leverages the [RAPIDS accelerated UDF interface](rapids-udfs.md) to provide a native
Expand Down
1 change: 0 additions & 1 deletion docs/demo/Databricks/generate-init-script-cuda11.ipynb

This file was deleted.

50 changes: 49 additions & 1 deletion docs/demo/Databricks/generate-init-script.ipynb
Original file line number Diff line number Diff line change
@@ -1 +1,49 @@
{"cells":[{"cell_type":"code","source":["dbutils.fs.mkdirs(\"dbfs:/databricks/init_scripts/\")\n \ndbutils.fs.put(\"/databricks/init_scripts/init.sh\",\"\"\"\n#!/bin/bash\nsudo wget -O /databricks/jars/rapids-4-spark_2.12-22.02.0.jar https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.12/22.02.0/rapids-4-spark_2.12-22.02.0.jar\nsudo wget -O /databricks/jars/cudf-22.02.0-cuda11.jar https://repo1.maven.org/maven2/ai/rapids/cudf/22.02.0/cudf-22.02.0-cuda11.jar\"\"\", True)"],"metadata":{},"outputs":[],"execution_count":1},{"cell_type":"code","source":["%sh\ncd ../../dbfs/databricks/init_scripts\npwd\nls -ltr\ncat init.sh"],"metadata":{},"outputs":[],"execution_count":2},{"cell_type":"code","source":[""],"metadata":{},"outputs":[],"execution_count":3}],"metadata":{"name":"generate-init-script","notebookId":2645746662301564},"nbformat":4,"nbformat_minor":0}
{
sameerz marked this conversation as resolved.
Show resolved Hide resolved
"cells":[
{
"cell_type":"code",
"source":[
"dbutils.fs.mkdirs(\"dbfs:/databricks/init_scripts/\")\n \ndbutils.fs.put(\"/databricks/init_scripts/init.sh\",\"\"\"\n#!/bin/bash\nsudo wget -O /databricks/jars/rapids-4-spark_2.12-22.04.0.jar https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.12/22.04.0/rapids-4-spark_2.12-22.04.0.jar\nsudo wget -O /databricks/jars/cudf-22.04.0-cuda11.jar https://repo1.maven.org/maven2/ai/rapids/cudf/22.04.0/cudf-22.04.0-cuda11.jar\"\"\", True)"
],
"metadata":{

},
"outputs":[

],
"execution_count":1
},
{
"cell_type":"code",
"source":[
"%sh\ncd ../../dbfs/databricks/init_scripts\npwd\nls -ltr\ncat init.sh"
],
"metadata":{

},
"outputs":[

],
"execution_count":2
},
{
"cell_type":"code",
"source":[
""
],
"metadata":{

},
"outputs":[

],
"execution_count":3
}
],
"metadata":{
"name":"generate-init-script",
"notebookId":2645746662301564
},
"nbformat":4,
"nbformat_minor":0
}
96 changes: 89 additions & 7 deletions docs/download.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,67 @@ cuDF jar, that is either preinstalled in the Spark classpath on all nodes or sub
that uses the RAPIDS Accelerator For Apache Spark. See the [getting-started
guide](https://nvidia.github.io/spark-rapids/Getting-Started/) for more details.

## Release v22.04.0
Hardware Requirements:

The plugin is tested on the following architectures:

GPU Models: NVIDIA V100, T4 and A2/A10/A30/A100 GPUs

Software Requirements:

OS: Ubuntu 18.04, Ubuntu 20.04 or CentOS 7, CentOS 8

CUDA & NVIDIA Drivers*: 11.x & v450.80.02+

Apache Spark 3.1.1, 3.1.2, 3.1.3, 3.2.0, 3.2.1, Databricks 9.1 ML LTS or 10.4 ML LTS Runtime and GCP Dataproc 2.0

Python 3.6+, Scala 2.12, Java 8

*Some hardware may have a minimum driver version greater than v450.80.02+. Check the GPU spec sheet
for your hardware's minimum driver version.

*For Cloudera and EMR support, please refer to the
[Distributions](./FAQ.md#which-distributions-are-supported) section of the FAQ.

### Download v22.04.0
* Download the [RAPIDS
Accelerator for Apache Spark 22.04.0 jar](https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.12/22.04.0/rapids-4-spark_2.12-22.04.0.jar)
* Download the [RAPIDS cuDF 22.04.0 jar](https://repo1.maven.org/maven2/ai/rapids/cudf/22.04.0/cudf-22.04.0-cuda11.jar)

This package is built against CUDA 11.5 and has [CUDA forward
compatibility](https://docs.nvidia.com/deploy/cuda-compatibility/index.html) enabled. It is tested
on V100, T4, A2, A10, A30 and A100 GPUs with CUDA 11.0-11.5. For those using other types of GPUs which
do not have CUDA forward compatibility (for example, GeForce), CUDA 11.5 is required. Users will
need to ensure the minimum driver (450.80.02) and CUDA toolkit are installed on each Spark node.

### Verify signature
* Download the [RAPIDS Accelerator for Apache Spark 22.04.0 jar](https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.12/22.04.0/rapids-4-spark_2.12-22.04.0.jar)
and [RAPIDS Accelerator for Apache Spark 22.04.0 jars.asc](https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.12/22.04.0/rapids-4-spark_2.12-22.04.0.jar.asc)
* Download the [PUB_KEY](https://keys.openpgp.org/search?q=sw-spark@nvidia.com).
* Import the public key: `gpg --import PUB_KEY`
* Verify the signature: `gpg --verify rapids-4-spark_2.12-22.04.0.jar.asc rapids-4-spark_2.12-22.04.0.jar`

The output if signature verify:

gpg: Good signature from "NVIDIA Spark (For the signature of spark-rapids release jars) <sw-spark@nvidia.com>"

### Release Notes
New functionality and performance improvements for this release include:
* Avro reader for primitive types
* ExistenceJoin support
* ArrayExists support
* GetArrayStructFields support
* Function str_to_map support
* Function percent_rank support
* Regular expression support for function split on string
* Support function approx_percentile in reduction context
* Support function element_at with non-literal index
* Spark cuSpatial UDF

For a detailed list of changes, please refer to the
[CHANGELOG](https://github.com/NVIDIA/spark-rapids/blob/main/CHANGELOG.md).

## Release v22.02.0
Hardware Requirements:

Expand All @@ -31,13 +92,16 @@ Software Requirements:

CUDA & NVIDIA Drivers*: 11.x & v450.80.02+

Apache Spark 3.0.1, 3.0.2, 3.0.3, 3.1.1, 3.1.2, 3.2.0, 3.2.1, Cloudera CDP 7.1.6, 7.1.7, Databricks 7.3 ML LTS or 9.1 ML LTS Runtime and GCP Dataproc 2.0
Apache Spark 3.0.1, 3.0.2, 3.0.3, 3.1.1, 3.1.2, 3.2.0, 3.2.1, Databricks 7.3 ML LTS or 9.1 ML LTS Runtime and GCP Dataproc 2.0

Python 3.6+, Scala 2.12, Java 8

*Some hardware may have a minimum driver version greater than v450.80.02+. Check the GPU spec sheet
for your hardware's minimum driver version.

*For Cloudera and EMR support, please refer to the
[Distributions](./FAQ.md#which-distributions-are-supported) section of the FAQ.

### Download v22.02.0
* Download the [RAPIDS
Accelerator for Apache Spark 22.02.0 jar](https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.12/22.02.0/rapids-4-spark_2.12-22.02.0.jar)
Expand Down Expand Up @@ -94,13 +158,16 @@ Software Requirements:

CUDA & NVIDIA Drivers*: 11.x & v450.80.02+

Apache Spark 3.0.1, 3.0.2, 3.0.3, 3.1.1, 3.1.2, 3.2.0, Cloudera CDP 7.1.6, 7.1.7, Databricks 7.3 ML LTS or 9.1 ML LTS Runtime and GCP Dataproc 2.0
Apache Spark 3.0.1, 3.0.2, 3.0.3, 3.1.1, 3.1.2, 3.2.0, Databricks 7.3 ML LTS or 9.1 ML LTS Runtime and GCP Dataproc 2.0

Python 3.6+, Scala 2.12, Java 8

*Some hardware may have a minimum driver version greater than v450.80.02+. Check the GPU spec sheet
for your hardware's minimum driver version.

*For Cloudera and EMR support, please refer to the
[Distributions](./FAQ.md#which-distributions-are-supported) section of the FAQ.

### Download v21.12.0
* Download the [RAPIDS
Accelerator for Apache Spark 21.12.0 jar](https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.12/21.12.0/rapids-4-spark_2.12-21.12.0.jar)
Expand Down Expand Up @@ -158,13 +225,16 @@ Software Requirements:

CUDA & NVIDIA Drivers*: 11.0-11.4 & v450.80.02+

Apache Spark 3.0.1, 3.0.2, 3.0.3, 3.1.1, 3.1.2, 3.2.0, Cloudera CDP 7.1.6, 7.1.7, Databricks 7.3 ML LTS or 8.2 ML Runtime, GCP Dataproc 2.0, and Azure Synapse
Apache Spark 3.0.1, 3.0.2, 3.0.3, 3.1.1, 3.1.2, 3.2.0, Databricks 7.3 ML LTS or 8.2 ML Runtime, GCP Dataproc 2.0, and Azure Synapse

Python 3.6+, Scala 2.12, Java 8

*Some hardware may have a minimum driver version greater than v450.80.02+. Check the GPU spec sheet
for your hardware's minimum driver version.

*For Cloudera and EMR support, please refer to the
[Distributions](./FAQ.md#which-distributions-are-supported) section of the FAQ.

### Download v21.10.0
* Download the [RAPIDS
Accelerator for Apache Spark 21.10.0 jar](https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.12/21.10.0/rapids-4-spark_2.12-21.10.0.jar)
Expand Down Expand Up @@ -214,13 +284,16 @@ Software Requirements:

CUDA & NVIDIA Drivers*: 11.0-11.4 & v450.80.02+

Apache Spark 3.0.1, 3.0.2, 3.0.3, 3.1.1, 3.1.2, Cloudera CDP 7.1.6, 7.1.7, Databricks 7.3 ML LTS or 8.2 ML Runtime, and GCP Dataproc 2.0
Apache Spark 3.0.1, 3.0.2, 3.0.3, 3.1.1, 3.1.2, Databricks 7.3 ML LTS or 8.2 ML Runtime, and GCP Dataproc 2.0

Python 3.6+, Scala 2.12, Java 8

*Some hardware may have a minimum driver version greater than v450.80.02+. Check the GPU spec sheet
for your hardware's minimum driver version.

*For Cloudera and EMR support, please refer to the
[Distributions](./FAQ.md#which-distributions-are-supported) section of the FAQ.

### Download v21.08.0
* Download the [RAPIDS
Accelerator for Apache Spark 21.08.0 jar](https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.12/21.08.0/rapids-4-spark_2.12-21.08.0.jar)
Expand Down Expand Up @@ -267,13 +340,16 @@ Software Requirements:

CUDA & NVIDIA Drivers*: 11.0 or 11.2 & v450.80.02+

Apache Spark 3.0.1, 3.0.2, 3.1.1, 3.1.2, Cloudera CDP 7.1.7, Databricks 7.3 ML LTS or 8.2 ML Runtime, and GCP Dataproc 2.0
Apache Spark 3.0.1, 3.0.2, 3.1.1, 3.1.2, Databricks 7.3 ML LTS or 8.2 ML Runtime, and GCP Dataproc 2.0

Python 3.6+, Scala 2.12, Java 8

*Some hardware may have a minimum driver version greater than v450.80.02+. Check the GPU spec sheet
for your hardware's minimum driver version.

*For Cloudera and EMR support, please refer to the
[Distributions](./FAQ.md#which-distributions-are-supported) section of the FAQ.

### Download v21.06.2
* Download the [RAPIDS
Accelerator for Apache Spark 21.06.2 jar](https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.12/21.06.2/rapids-4-spark_2.12-21.06.2.jar)
Expand Down Expand Up @@ -307,13 +383,16 @@ Software Requirements:

CUDA & NVIDIA Drivers*: 11.0 or 11.2 & v450.80.02+

Apache Spark 3.0.1, 3.0.2, 3.1.1, 3.1.2, Cloudera CDP 7.1.7, and GCP Dataproc 2.0
Apache Spark 3.0.1, 3.0.2, 3.1.1, 3.1.2, and GCP Dataproc 2.0

Python 3.6+, Scala 2.12, Java 8

*Some hardware may have a minimum driver version greater than v450.80.02+. Check the GPU spec sheet
for your hardware's minimum driver version.

*For Cloudera and EMR support, please refer to the
[Distributions](./FAQ.md#which-distributions-are-supported) section of the FAQ.

### Download v21.06.1
* Download the [RAPIDS
Accelerator for Apache Spark 21.06.1 jar](https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.12/21.06.1/rapids-4-spark_2.12-21.06.1.jar)
Expand Down Expand Up @@ -351,13 +430,16 @@ Software Requirements:

CUDA & NVIDIA Drivers*: 11.0 or 11.2 & v450.80.02+

Apache Spark 3.0.1, 3.0.2, 3.1.1, 3.1.2, Cloudera CDP 7.1.7, Databricks 8.2 ML Runtime, and GCP Dataproc 2.0
Apache Spark 3.0.1, 3.0.2, 3.1.1, 3.1.2, Databricks 8.2 ML Runtime, and GCP Dataproc 2.0

Python 3.6+, Scala 2.12, Java 8

*Some hardware may have a minimum driver version greater than v450.80.02+. Check the GPU spec sheet
for your hardware's minimum driver version.

*For Cloudera and EMR support, please refer to the
[Distributions](./FAQ.md#which-distributions-are-supported) section of the FAQ.

### Download v21.06.0
* Download the [RAPIDS
Accelerator for Apache Spark 21.06.0 jar](https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.12/21.06.0/rapids-4-spark_2.12-21.06.0.jar)
Expand Down
Loading