Skip to content

Commit

Permalink
Remove the hasNans config and update the doc (#6512)
Browse files Browse the repository at this point in the history
* remove the hasnan conf

Signed-off-by: remzi <13716567376yh@gmail.com>

* update docs

Signed-off-by: remzi <13716567376yh@gmail.com>

* update compatibility

Signed-off-by: remzi <13716567376yh@gmail.com>

* remove hasNan from spark2 and notebook

Signed-off-by: remzi <13716567376yh@gmail.com>

* remove the hasNan config from IT

Signed-off-by: remzi <13716567376yh@gmail.com>

Signed-off-by: remzi <13716567376yh@gmail.com>
  • Loading branch information
HaoYang670 authored Sep 15, 2022
1 parent 3643749 commit 11da318
Show file tree
Hide file tree
Showing 24 changed files with 75 additions and 203 deletions.
13 changes: 0 additions & 13 deletions docs/compatibility.md
Original file line number Diff line number Diff line change
Expand Up @@ -69,19 +69,6 @@ worked with plain Spark. Starting from 22.06 this is behavior is enabled by defa
the config
[`spark.rapids.sql.variableFloatAgg.enabled`](configs.md#sql.variableFloatAgg.enabled).

Additionally, some aggregations on floating point columns that contain `NaN` can produce results
different from Spark in versions prior to Spark 3.1.0. If it is known with certainty that the
floating point columns do not contain `NaN`, set
[`spark.rapids.sql.hasNans`](configs.md#sql.hasNans) to `false` to run GPU enabled aggregations on
them.

In the case of a distinct count on `NaN` values, prior to Spark 3.1.0, the issue only shows up if
you have different `NaN` values. There are several different binary values that are all considered
to be `NaN` by floating point. The plugin treats all of these as the same value, where as Spark
treats them all as different values. Because this is considered to be rare we do not disable
distinct count for floating point values even if
[`spark.rapids.sql.hasNans`](configs.md#sql.hasNans) is `true`.

### `0.0` vs `-0.0`

Floating point allows zero to be encoded as `0.0` and `-0.0`, but the IEEE standard says that they
Expand Down
1 change: 0 additions & 1 deletion docs/configs.md
Original file line number Diff line number Diff line change
Expand Up @@ -109,7 +109,6 @@ Name | Description | Default Value
<a name="sql.format.parquet.write.enabled"></a>spark.rapids.sql.format.parquet.write.enabled|When set to false disables parquet output acceleration|true
<a name="sql.format.parquet.writer.int96.enabled"></a>spark.rapids.sql.format.parquet.writer.int96.enabled|When set to false, disables accelerated parquet write if the spark.sql.parquet.outputTimestampType is set to INT96|true
<a name="sql.hasExtendedYearValues"></a>spark.rapids.sql.hasExtendedYearValues|Spark 3.2.0+ extended parsing of years in dates and timestamps to support the full range of possible values. Prior to this it was limited to a positive 4 digit year. The Accelerator does not support the extended range yet. This config indicates if your data includes this extended range or not, or if you don't care about getting the correct values on values with the extended range.|true
<a name="sql.hasNans"></a>spark.rapids.sql.hasNans|Config to indicate if your data has NaN's. Cudf doesn't currently support NaN's properly so you can get corrupt data if you have NaN's in your data and it runs on the GPU.|true
<a name="sql.hashOptimizeSort.enabled"></a>spark.rapids.sql.hashOptimizeSort.enabled|Whether sorts should be inserted after some hashed operations to improve output ordering. This can improve output file sizes when saving to columnar formats.|false
<a name="sql.improvedFloatOps.enabled"></a>spark.rapids.sql.improvedFloatOps.enabled|For some floating point operations spark uses one way to compute the value and the underlying cudf implementation can use an improved algorithm. In some cases this can result in cudf producing an answer when spark overflows.|true
<a name="sql.improvedTimeOps.enabled"></a>spark.rapids.sql.improvedTimeOps.enabled|When set to true, some operators will avoid overflowing by converting epoch days directly to seconds without first converting to microseconds|false
Expand Down
2 changes: 1 addition & 1 deletion docs/demo/AWS-EMR/Mortgage-ETL-GPU-EMR.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@
" \"driverMemory\": \"4000M\",\n",
" \"driverCores\": 2,\n",
" \"executorMemory\": \"4000M\",\n",
" \"conf\": {\"spark.sql.adaptive.enabled\": \"false\", \"spark.dynamicAllocation.enabled\": \"false\", \"spark.executor.instances\":2, \"spark.executor.cores\":2, \"spark.rapids.sql.explain\":\"ALL\", \"spark.task.cpus\":\"1\", \"spark.rapids.sql.concurrentGpuTasks\":\"2\", \"spark.rapids.memory.pinnedPool.size\":\"2G\", \"spark.executor.memoryOverhead\":\"2G\", \"spark.executor.extraJavaOptions\":\"-Dai.rapids.cudf.prefer-pinned=true\", \"spark.sql.files.maxPartitionBytes\":\"512m\", \"spark.executor.resource.gpu.amount\":\"1\", \"spark.task.resource.gpu.amount\":\"0.5\", \"spark.plugins\":\"com.nvidia.spark.SQLPlugin\", \"spark.rapids.sql.hasNans\":\"false\", \"spark.rapids.sql.batchSizeBytes\":\"512M\", \"spark.rapids.sql.reader.batchSizeBytes\":\"768M\", \"spark.sql.cache.serializer\" : \"com.nvidia.spark.ParquetCachedBatchSerializer\"}\n",
" \"conf\": {\"spark.sql.adaptive.enabled\": \"false\", \"spark.dynamicAllocation.enabled\": \"false\", \"spark.executor.instances\":2, \"spark.executor.cores\":2, \"spark.rapids.sql.explain\":\"ALL\", \"spark.task.cpus\":\"1\", \"spark.rapids.sql.concurrentGpuTasks\":\"2\", \"spark.rapids.memory.pinnedPool.size\":\"2G\", \"spark.executor.memoryOverhead\":\"2G\", \"spark.executor.extraJavaOptions\":\"-Dai.rapids.cudf.prefer-pinned=true\", \"spark.sql.files.maxPartitionBytes\":\"512m\", \"spark.executor.resource.gpu.amount\":\"1\", \"spark.task.resource.gpu.amount\":\"0.5\", \"spark.plugins\":\"com.nvidia.spark.SQLPlugin\", \"spark.rapids.sql.batchSizeBytes\":\"512M\", \"spark.rapids.sql.reader.batchSizeBytes\":\"768M\", \"spark.sql.cache.serializer\" : \"com.nvidia.spark.ParquetCachedBatchSerializer\"}\n",
"}"
]
},
Expand Down
39 changes: 19 additions & 20 deletions docs/demo/Databricks/Mortgage-ETL-db.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@
},
{
"cell_type": "code",
"execution_count": 0,
"execution_count": null,
"metadata": {
"application/vnd.databricks.v1+cell": {
"inputWidgets": {},
Expand Down Expand Up @@ -90,7 +90,7 @@
},
{
"cell_type": "code",
"execution_count": 0,
"execution_count": null,
"metadata": {
"application/vnd.databricks.v1+cell": {
"inputWidgets": {},
Expand Down Expand Up @@ -264,7 +264,7 @@
},
{
"cell_type": "code",
"execution_count": 0,
"execution_count": null,
"metadata": {
"application/vnd.databricks.v1+cell": {
"inputWidgets": {},
Expand Down Expand Up @@ -409,7 +409,7 @@
},
{
"cell_type": "code",
"execution_count": 0,
"execution_count": null,
"metadata": {
"application/vnd.databricks.v1+cell": {
"inputWidgets": {},
Expand Down Expand Up @@ -508,7 +508,7 @@
},
{
"cell_type": "code",
"execution_count": 0,
"execution_count": null,
"metadata": {
"application/vnd.databricks.v1+cell": {
"inputWidgets": {},
Expand Down Expand Up @@ -647,7 +647,7 @@
},
{
"cell_type": "code",
"execution_count": 0,
"execution_count": null,
"metadata": {
"application/vnd.databricks.v1+cell": {
"inputWidgets": {},
Expand Down Expand Up @@ -712,7 +712,7 @@
},
{
"cell_type": "code",
"execution_count": 0,
"execution_count": null,
"metadata": {
"application/vnd.databricks.v1+cell": {
"inputWidgets": {},
Expand Down Expand Up @@ -785,7 +785,7 @@
},
{
"cell_type": "code",
"execution_count": 0,
"execution_count": null,
"metadata": {
"application/vnd.databricks.v1+cell": {
"inputWidgets": {},
Expand Down Expand Up @@ -859,7 +859,7 @@
},
{
"cell_type": "code",
"execution_count": 0,
"execution_count": null,
"metadata": {
"application/vnd.databricks.v1+cell": {
"inputWidgets": {},
Expand Down Expand Up @@ -986,7 +986,7 @@
},
{
"cell_type": "code",
"execution_count": 0,
"execution_count": null,
"metadata": {
"application/vnd.databricks.v1+cell": {
"inputWidgets": {},
Expand Down Expand Up @@ -1081,7 +1081,7 @@
},
{
"cell_type": "code",
"execution_count": 0,
"execution_count": null,
"metadata": {
"application/vnd.databricks.v1+cell": {
"inputWidgets": {},
Expand Down Expand Up @@ -1153,7 +1153,7 @@
},
{
"cell_type": "code",
"execution_count": 0,
"execution_count": null,
"metadata": {
"application/vnd.databricks.v1+cell": {
"inputWidgets": {},
Expand Down Expand Up @@ -1236,7 +1236,7 @@
},
{
"cell_type": "code",
"execution_count": 0,
"execution_count": null,
"metadata": {
"application/vnd.databricks.v1+cell": {
"inputWidgets": {},
Expand Down Expand Up @@ -1312,7 +1312,7 @@
},
{
"cell_type": "code",
"execution_count": 0,
"execution_count": null,
"metadata": {
"application/vnd.databricks.v1+cell": {
"inputWidgets": {},
Expand Down Expand Up @@ -1378,7 +1378,7 @@
},
{
"cell_type": "code",
"execution_count": 0,
"execution_count": null,
"metadata": {
"application/vnd.databricks.v1+cell": {
"inputWidgets": {},
Expand Down Expand Up @@ -1447,7 +1447,7 @@
},
{
"cell_type": "code",
"execution_count": 0,
"execution_count": null,
"metadata": {
"application/vnd.databricks.v1+cell": {
"inputWidgets": {},
Expand Down Expand Up @@ -1527,7 +1527,7 @@
},
{
"cell_type": "code",
"execution_count": 0,
"execution_count": null,
"metadata": {
"application/vnd.databricks.v1+cell": {
"inputWidgets": {},
Expand Down Expand Up @@ -1597,7 +1597,7 @@
},
{
"cell_type": "code",
"execution_count": 0,
"execution_count": null,
"metadata": {
"application/vnd.databricks.v1+cell": {
"inputWidgets": {},
Expand Down Expand Up @@ -1647,7 +1647,6 @@
"# CPU run, set to false\n",
"# spark.conf.set('spark.rapids.sql.enabled', 'false')\n",
"spark.conf.set('spark.sql.files.maxPartitionBytes', '1G')\n",
"spark.conf.set(\"spark.rapids.sql.hasNans\", \"false\")\n",
"# use GPU to read CSV\n",
"spark.conf.set(\"spark.rapids.sql.csv.read.double.enabled\", \"true\")"
]
Expand All @@ -1668,7 +1667,7 @@
},
{
"cell_type": "code",
"execution_count": 0,
"execution_count": null,
"metadata": {
"application/vnd.databricks.v1+cell": {
"inputWidgets": {},
Expand Down
4 changes: 1 addition & 3 deletions docs/demo/GCP/Mortgage-ETL.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -63,7 +63,6 @@
"conf.set(\"spark.executor.memory\", \"20g\")\n",
"conf.set(\"spark.sql.files.maxPartitionBytes\", \"512m\")\n",
"conf.set(\"spark.executor.resource.gpu.amount\", \"1\")\n",
"conf.set(\"spark.rapids.sql.hasNans\", \"false\")\n",
"conf.set(\"spark.rapids.sql.csv.read.double.enabled\", \"true\")\n",
"conf.set('spark.rapids.sql.batchSizeBytes', '512M')\n",
"conf.set('spark.rapids.sql.reader.batchSizeBytes', '768M')\n",
Expand Down Expand Up @@ -850,7 +849,6 @@
"# CPU run, set to false\n",
"# spark.conf.set('spark.rapids.sql.enabled', 'false')\n",
"spark.conf.set('spark.sql.files.maxPartitionBytes', '1G')\n",
"spark.conf.set(\"spark.rapids.sql.hasNans\", \"false\")\n",
"# use GPU to read CSV\n",
"spark.conf.set(\"spark.rapids.sql.csv.read.double.enabled\", \"true\")"
]
Expand Down Expand Up @@ -917,4 +915,4 @@
},
"nbformat": 4,
"nbformat_minor": 5
}
}
1 change: 0 additions & 1 deletion docs/demo/GCP/mortgage-xgboost4j-gpu-scala.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -214,7 +214,6 @@
"source": [
"// Build the spark session and data reader as usual\n",
"val sparkSession = SparkSession.builder.appName(\"mortgage-gpu\").getOrCreate\n",
"sparkSession.conf.set(\"spark.rapids.sql.hasNans\", false)\n",
"val reader = sparkSession.read"
]
},
Expand Down
2 changes: 1 addition & 1 deletion docs/get-started/getting-started-gcp.md
Original file line number Diff line number Diff line change
Expand Up @@ -216,7 +216,7 @@ gcloud dataproc jobs submit spark \
--region=$REGION \
--class=com.nvidia.spark.examples.mortgage.GPUMain \
--jars=gs://${GCS_BUCKET}/scala/sample_xgboost_apps-${VERSION}-SNAPSHOT-jar-with-dependencies.jar \
--properties=spark.executor.cores=${SPARK_NUM_CORES_PER_EXECUTOR},spark.task.cpus=${SPARK_NUM_CORES_PER_EXECUTOR},spark.executor.memory=${SPARK_EXECUTOR_MEMORY},spark.executor.memoryOverhead=${SPARK_EXECUTOR_MEMORYOVERHEAD},spark.executor.resource.gpu.amount=1,spark.task.resource.gpu.amount=1,spark.rapids.sql.hasNans=false,spark.rapids.sql.batchSizeBytes=512M,spark.rapids.sql.reader.batchSizeBytes=768M,spark.rapids.sql.variableFloatAgg.enabled=true,spark.rapids.memory.gpu.pooling.enabled=false,spark.dynamicAllocation.enabled=false \
--properties=spark.executor.cores=${SPARK_NUM_CORES_PER_EXECUTOR},spark.task.cpus=${SPARK_NUM_CORES_PER_EXECUTOR},spark.executor.memory=${SPARK_EXECUTOR_MEMORY},spark.executor.memoryOverhead=${SPARK_EXECUTOR_MEMORYOVERHEAD},spark.executor.resource.gpu.amount=1,spark.task.resource.gpu.amount=1,spark.rapids.sql.batchSizeBytes=512M,spark.rapids.sql.reader.batchSizeBytes=768M,spark.rapids.sql.variableFloatAgg.enabled=true,spark.rapids.memory.gpu.pooling.enabled=false,spark.dynamicAllocation.enabled=false \
-- \
-dataPath=train::${DATA_PATH}/train \
-dataPath=trans::${DATA_PATH}/eval \
Expand Down
1 change: 0 additions & 1 deletion docs/spark-profiling-tool.md
Original file line number Diff line number Diff line change
Expand Up @@ -243,7 +243,6 @@ Compare Rapids Properties which are set explicitly:
|spark.rapids.sql.concurrentGpuTasks |null |2 |
|spark.rapids.sql.enabled |false |true |
|spark.rapids.sql.explain |null |NOT_ON_GPU|
|spark.rapids.sql.hasNans |null |FALSE |
|spark.rapids.sql.incompatibleOps.enabled |null |true |
+-------------------------------------------+----------+----------+
```
Expand Down
8 changes: 4 additions & 4 deletions docs/supported_ops.md
Original file line number Diff line number Diff line change
Expand Up @@ -16666,8 +16666,8 @@ are limited.
<td>S</td>
<td>S</td>
<td>S</td>
<td><em>PS<br/>Input must not contain NaNs and spark.rapids.sql.hasNans must be false.</em></td>
<td><em>PS<br/>Input must not contain NaNs and spark.rapids.sql.hasNans must be false.</em></td>
<td>S</td>
<td>S</td>
<td>S</td>
<td><em>PS<br/>UTC is only supported TZ for TIMESTAMP</em></td>
<td>S</td>
Expand Down Expand Up @@ -16730,8 +16730,8 @@ are limited.
<td>S</td>
<td>S</td>
<td>S</td>
<td><em>PS<br/>Input must not contain NaNs and spark.rapids.sql.hasNans must be false.</em></td>
<td><em>PS<br/>Input must not contain NaNs and spark.rapids.sql.hasNans must be false.</em></td>
<td>S</td>
<td>S</td>
<td>S</td>
<td><em>PS<br/>UTC is only supported TZ for TIMESTAMP</em></td>
<td>S</td>
Expand Down
4 changes: 0 additions & 4 deletions docs/tuning-guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -286,10 +286,6 @@ from the main [columnar batch size](#columnar-batch-size) setting. Some transco
load CSV files then write Parquet files) need to lower this setting when using large task input
partition sizes to avoid GPU out of memory errors.

## Set hasNans flag to False
If your data has float values but doesn't contain NaNs set [`spark.rapids.sql.hasNans`](configs.md#sql.hasNans) to
`false` to get the benefit of running on the GPU

## Metrics

Custom Spark SQL Metrics are available which can help identify performance bottlenecks in a query.
Expand Down
3 changes: 1 addition & 2 deletions integration_tests/src/main/python/array_test.py
Original file line number Diff line number Diff line change
Expand Up @@ -332,8 +332,7 @@ def test_array_min_max_all_nulls(data_gen):
def test_array_concat_decimal(data_gen):
assert_gpu_and_cpu_are_equal_collect(
lambda spark : debug_df(unary_op_df(spark, ArrayGen(data_gen)).selectExpr(
'concat(a, a)')),
conf=no_nans_conf)
'concat(a, a)')))

@pytest.mark.parametrize('data_gen', orderable_gens + nested_gens_sample, ids=idfn)
def test_array_repeat_with_count_column(data_gen):
Expand Down
1 change: 0 additions & 1 deletion integration_tests/src/main/python/data_gen.py
Original file line number Diff line number Diff line change
Expand Up @@ -1018,7 +1018,6 @@ def gen_scalars_for_sql(data_gen, count, seed=0, force_no_nulls=False):
nested_gens_sample = array_gens_sample + struct_gens_sample_with_decimal128 + map_gens_sample + decimal_128_map_gens

ansi_enabled_conf = {'spark.sql.ansi.enabled': 'true'}
no_nans_conf = {'spark.rapids.sql.hasNans': 'false'}
legacy_interval_enabled_conf = {'spark.sql.legacy.interval.enabled': 'true'}

def copy_and_update(conf, *more_confs):
Expand Down
Loading

0 comments on commit 11da318

Please sign in to comment.