Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove the hasNans config and update the doc #6512

Merged
merged 6 commits into from
Sep 15, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 0 additions & 13 deletions docs/compatibility.md
Original file line number Diff line number Diff line change
Expand Up @@ -69,19 +69,6 @@ worked with plain Spark. Starting from 22.06 this is behavior is enabled by defa
the config
[`spark.rapids.sql.variableFloatAgg.enabled`](configs.md#sql.variableFloatAgg.enabled).

Additionally, some aggregations on floating point columns that contain `NaN` can produce results
different from Spark in versions prior to Spark 3.1.0. If it is known with certainty that the
floating point columns do not contain `NaN`, set
[`spark.rapids.sql.hasNans`](configs.md#sql.hasNans) to `false` to run GPU enabled aggregations on
them.

In the case of a distinct count on `NaN` values, prior to Spark 3.1.0, the issue only shows up if
you have different `NaN` values. There are several different binary values that are all considered
to be `NaN` by floating point. The plugin treats all of these as the same value, where as Spark
treats them all as different values. Because this is considered to be rare we do not disable
distinct count for floating point values even if
[`spark.rapids.sql.hasNans`](configs.md#sql.hasNans) is `true`.

### `0.0` vs `-0.0`

Floating point allows zero to be encoded as `0.0` and `-0.0`, but the IEEE standard says that they
Expand Down
1 change: 0 additions & 1 deletion docs/configs.md
Original file line number Diff line number Diff line change
Expand Up @@ -109,7 +109,6 @@ Name | Description | Default Value
<a name="sql.format.parquet.write.enabled"></a>spark.rapids.sql.format.parquet.write.enabled|When set to false disables parquet output acceleration|true
<a name="sql.format.parquet.writer.int96.enabled"></a>spark.rapids.sql.format.parquet.writer.int96.enabled|When set to false, disables accelerated parquet write if the spark.sql.parquet.outputTimestampType is set to INT96|true
<a name="sql.hasExtendedYearValues"></a>spark.rapids.sql.hasExtendedYearValues|Spark 3.2.0+ extended parsing of years in dates and timestamps to support the full range of possible values. Prior to this it was limited to a positive 4 digit year. The Accelerator does not support the extended range yet. This config indicates if your data includes this extended range or not, or if you don't care about getting the correct values on values with the extended range.|true
<a name="sql.hasNans"></a>spark.rapids.sql.hasNans|Config to indicate if your data has NaN's. Cudf doesn't currently support NaN's properly so you can get corrupt data if you have NaN's in your data and it runs on the GPU.|true
<a name="sql.hashOptimizeSort.enabled"></a>spark.rapids.sql.hashOptimizeSort.enabled|Whether sorts should be inserted after some hashed operations to improve output ordering. This can improve output file sizes when saving to columnar formats.|false
<a name="sql.improvedFloatOps.enabled"></a>spark.rapids.sql.improvedFloatOps.enabled|For some floating point operations spark uses one way to compute the value and the underlying cudf implementation can use an improved algorithm. In some cases this can result in cudf producing an answer when spark overflows.|true
<a name="sql.improvedTimeOps.enabled"></a>spark.rapids.sql.improvedTimeOps.enabled|When set to true, some operators will avoid overflowing by converting epoch days directly to seconds without first converting to microseconds|false
Expand Down
2 changes: 1 addition & 1 deletion docs/demo/AWS-EMR/Mortgage-ETL-GPU-EMR.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@
" \"driverMemory\": \"4000M\",\n",
" \"driverCores\": 2,\n",
" \"executorMemory\": \"4000M\",\n",
" \"conf\": {\"spark.sql.adaptive.enabled\": \"false\", \"spark.dynamicAllocation.enabled\": \"false\", \"spark.executor.instances\":2, \"spark.executor.cores\":2, \"spark.rapids.sql.explain\":\"ALL\", \"spark.task.cpus\":\"1\", \"spark.rapids.sql.concurrentGpuTasks\":\"2\", \"spark.rapids.memory.pinnedPool.size\":\"2G\", \"spark.executor.memoryOverhead\":\"2G\", \"spark.executor.extraJavaOptions\":\"-Dai.rapids.cudf.prefer-pinned=true\", \"spark.sql.files.maxPartitionBytes\":\"512m\", \"spark.executor.resource.gpu.amount\":\"1\", \"spark.task.resource.gpu.amount\":\"0.5\", \"spark.plugins\":\"com.nvidia.spark.SQLPlugin\", \"spark.rapids.sql.hasNans\":\"false\", \"spark.rapids.sql.batchSizeBytes\":\"512M\", \"spark.rapids.sql.reader.batchSizeBytes\":\"768M\", \"spark.sql.cache.serializer\" : \"com.nvidia.spark.ParquetCachedBatchSerializer\"}\n",
" \"conf\": {\"spark.sql.adaptive.enabled\": \"false\", \"spark.dynamicAllocation.enabled\": \"false\", \"spark.executor.instances\":2, \"spark.executor.cores\":2, \"spark.rapids.sql.explain\":\"ALL\", \"spark.task.cpus\":\"1\", \"spark.rapids.sql.concurrentGpuTasks\":\"2\", \"spark.rapids.memory.pinnedPool.size\":\"2G\", \"spark.executor.memoryOverhead\":\"2G\", \"spark.executor.extraJavaOptions\":\"-Dai.rapids.cudf.prefer-pinned=true\", \"spark.sql.files.maxPartitionBytes\":\"512m\", \"spark.executor.resource.gpu.amount\":\"1\", \"spark.task.resource.gpu.amount\":\"0.5\", \"spark.plugins\":\"com.nvidia.spark.SQLPlugin\", \"spark.rapids.sql.batchSizeBytes\":\"512M\", \"spark.rapids.sql.reader.batchSizeBytes\":\"768M\", \"spark.sql.cache.serializer\" : \"com.nvidia.spark.ParquetCachedBatchSerializer\"}\n",
"}"
]
},
Expand Down
39 changes: 19 additions & 20 deletions docs/demo/Databricks/Mortgage-ETL-db.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@
},
{
"cell_type": "code",
"execution_count": 0,
"execution_count": null,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know why this line is modified. What I did is just removing the line containing hasNan

"metadata": {
"application/vnd.databricks.v1+cell": {
"inputWidgets": {},
Expand Down Expand Up @@ -90,7 +90,7 @@
},
{
"cell_type": "code",
"execution_count": 0,
"execution_count": null,
"metadata": {
"application/vnd.databricks.v1+cell": {
"inputWidgets": {},
Expand Down Expand Up @@ -264,7 +264,7 @@
},
{
"cell_type": "code",
"execution_count": 0,
"execution_count": null,
"metadata": {
"application/vnd.databricks.v1+cell": {
"inputWidgets": {},
Expand Down Expand Up @@ -409,7 +409,7 @@
},
{
"cell_type": "code",
"execution_count": 0,
"execution_count": null,
"metadata": {
"application/vnd.databricks.v1+cell": {
"inputWidgets": {},
Expand Down Expand Up @@ -508,7 +508,7 @@
},
{
"cell_type": "code",
"execution_count": 0,
"execution_count": null,
"metadata": {
"application/vnd.databricks.v1+cell": {
"inputWidgets": {},
Expand Down Expand Up @@ -647,7 +647,7 @@
},
{
"cell_type": "code",
"execution_count": 0,
"execution_count": null,
"metadata": {
"application/vnd.databricks.v1+cell": {
"inputWidgets": {},
Expand Down Expand Up @@ -712,7 +712,7 @@
},
{
"cell_type": "code",
"execution_count": 0,
"execution_count": null,
"metadata": {
"application/vnd.databricks.v1+cell": {
"inputWidgets": {},
Expand Down Expand Up @@ -785,7 +785,7 @@
},
{
"cell_type": "code",
"execution_count": 0,
"execution_count": null,
"metadata": {
"application/vnd.databricks.v1+cell": {
"inputWidgets": {},
Expand Down Expand Up @@ -859,7 +859,7 @@
},
{
"cell_type": "code",
"execution_count": 0,
"execution_count": null,
"metadata": {
"application/vnd.databricks.v1+cell": {
"inputWidgets": {},
Expand Down Expand Up @@ -986,7 +986,7 @@
},
{
"cell_type": "code",
"execution_count": 0,
"execution_count": null,
"metadata": {
"application/vnd.databricks.v1+cell": {
"inputWidgets": {},
Expand Down Expand Up @@ -1081,7 +1081,7 @@
},
{
"cell_type": "code",
"execution_count": 0,
"execution_count": null,
"metadata": {
"application/vnd.databricks.v1+cell": {
"inputWidgets": {},
Expand Down Expand Up @@ -1153,7 +1153,7 @@
},
{
"cell_type": "code",
"execution_count": 0,
"execution_count": null,
"metadata": {
"application/vnd.databricks.v1+cell": {
"inputWidgets": {},
Expand Down Expand Up @@ -1236,7 +1236,7 @@
},
{
"cell_type": "code",
"execution_count": 0,
"execution_count": null,
"metadata": {
"application/vnd.databricks.v1+cell": {
"inputWidgets": {},
Expand Down Expand Up @@ -1312,7 +1312,7 @@
},
{
"cell_type": "code",
"execution_count": 0,
"execution_count": null,
"metadata": {
"application/vnd.databricks.v1+cell": {
"inputWidgets": {},
Expand Down Expand Up @@ -1378,7 +1378,7 @@
},
{
"cell_type": "code",
"execution_count": 0,
"execution_count": null,
"metadata": {
"application/vnd.databricks.v1+cell": {
"inputWidgets": {},
Expand Down Expand Up @@ -1447,7 +1447,7 @@
},
{
"cell_type": "code",
"execution_count": 0,
"execution_count": null,
"metadata": {
"application/vnd.databricks.v1+cell": {
"inputWidgets": {},
Expand Down Expand Up @@ -1527,7 +1527,7 @@
},
{
"cell_type": "code",
"execution_count": 0,
"execution_count": null,
"metadata": {
"application/vnd.databricks.v1+cell": {
"inputWidgets": {},
Expand Down Expand Up @@ -1597,7 +1597,7 @@
},
{
"cell_type": "code",
"execution_count": 0,
"execution_count": null,
"metadata": {
"application/vnd.databricks.v1+cell": {
"inputWidgets": {},
Expand Down Expand Up @@ -1647,7 +1647,6 @@
"# CPU run, set to false\n",
"# spark.conf.set('spark.rapids.sql.enabled', 'false')\n",
"spark.conf.set('spark.sql.files.maxPartitionBytes', '1G')\n",
"spark.conf.set(\"spark.rapids.sql.hasNans\", \"false\")\n",
"# use GPU to read CSV\n",
"spark.conf.set(\"spark.rapids.sql.csv.read.double.enabled\", \"true\")"
]
Expand All @@ -1668,7 +1667,7 @@
},
{
"cell_type": "code",
"execution_count": 0,
"execution_count": null,
"metadata": {
"application/vnd.databricks.v1+cell": {
"inputWidgets": {},
Expand Down
4 changes: 1 addition & 3 deletions docs/demo/GCP/Mortgage-ETL.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -63,7 +63,6 @@
"conf.set(\"spark.executor.memory\", \"20g\")\n",
"conf.set(\"spark.sql.files.maxPartitionBytes\", \"512m\")\n",
"conf.set(\"spark.executor.resource.gpu.amount\", \"1\")\n",
"conf.set(\"spark.rapids.sql.hasNans\", \"false\")\n",
"conf.set(\"spark.rapids.sql.csv.read.double.enabled\", \"true\")\n",
"conf.set('spark.rapids.sql.batchSizeBytes', '512M')\n",
"conf.set('spark.rapids.sql.reader.batchSizeBytes', '768M')\n",
Expand Down Expand Up @@ -850,7 +849,6 @@
"# CPU run, set to false\n",
"# spark.conf.set('spark.rapids.sql.enabled', 'false')\n",
"spark.conf.set('spark.sql.files.maxPartitionBytes', '1G')\n",
"spark.conf.set(\"spark.rapids.sql.hasNans\", \"false\")\n",
"# use GPU to read CSV\n",
"spark.conf.set(\"spark.rapids.sql.csv.read.double.enabled\", \"true\")"
]
Expand Down Expand Up @@ -917,4 +915,4 @@
},
"nbformat": 4,
"nbformat_minor": 5
}
}
1 change: 0 additions & 1 deletion docs/demo/GCP/mortgage-xgboost4j-gpu-scala.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -214,7 +214,6 @@
"source": [
"// Build the spark session and data reader as usual\n",
"val sparkSession = SparkSession.builder.appName(\"mortgage-gpu\").getOrCreate\n",
"sparkSession.conf.set(\"spark.rapids.sql.hasNans\", false)\n",
"val reader = sparkSession.read"
]
},
Expand Down
2 changes: 1 addition & 1 deletion docs/get-started/getting-started-gcp.md
Original file line number Diff line number Diff line change
Expand Up @@ -216,7 +216,7 @@ gcloud dataproc jobs submit spark \
--region=$REGION \
--class=com.nvidia.spark.examples.mortgage.GPUMain \
--jars=gs://${GCS_BUCKET}/scala/sample_xgboost_apps-${VERSION}-SNAPSHOT-jar-with-dependencies.jar \
--properties=spark.executor.cores=${SPARK_NUM_CORES_PER_EXECUTOR},spark.task.cpus=${SPARK_NUM_CORES_PER_EXECUTOR},spark.executor.memory=${SPARK_EXECUTOR_MEMORY},spark.executor.memoryOverhead=${SPARK_EXECUTOR_MEMORYOVERHEAD},spark.executor.resource.gpu.amount=1,spark.task.resource.gpu.amount=1,spark.rapids.sql.hasNans=false,spark.rapids.sql.batchSizeBytes=512M,spark.rapids.sql.reader.batchSizeBytes=768M,spark.rapids.sql.variableFloatAgg.enabled=true,spark.rapids.memory.gpu.pooling.enabled=false,spark.dynamicAllocation.enabled=false \
--properties=spark.executor.cores=${SPARK_NUM_CORES_PER_EXECUTOR},spark.task.cpus=${SPARK_NUM_CORES_PER_EXECUTOR},spark.executor.memory=${SPARK_EXECUTOR_MEMORY},spark.executor.memoryOverhead=${SPARK_EXECUTOR_MEMORYOVERHEAD},spark.executor.resource.gpu.amount=1,spark.task.resource.gpu.amount=1,spark.rapids.sql.batchSizeBytes=512M,spark.rapids.sql.reader.batchSizeBytes=768M,spark.rapids.sql.variableFloatAgg.enabled=true,spark.rapids.memory.gpu.pooling.enabled=false,spark.dynamicAllocation.enabled=false \
-- \
-dataPath=train::${DATA_PATH}/train \
-dataPath=trans::${DATA_PATH}/eval \
Expand Down
1 change: 0 additions & 1 deletion docs/spark-profiling-tool.md
Original file line number Diff line number Diff line change
Expand Up @@ -243,7 +243,6 @@ Compare Rapids Properties which are set explicitly:
|spark.rapids.sql.concurrentGpuTasks |null |2 |
|spark.rapids.sql.enabled |false |true |
|spark.rapids.sql.explain |null |NOT_ON_GPU|
|spark.rapids.sql.hasNans |null |FALSE |
|spark.rapids.sql.incompatibleOps.enabled |null |true |
+-------------------------------------------+----------+----------+
```
Expand Down
8 changes: 4 additions & 4 deletions docs/supported_ops.md
Original file line number Diff line number Diff line change
Expand Up @@ -16666,8 +16666,8 @@ are limited.
<td>S</td>
<td>S</td>
<td>S</td>
<td><em>PS<br/>Input must not contain NaNs and spark.rapids.sql.hasNans must be false.</em></td>
<td><em>PS<br/>Input must not contain NaNs and spark.rapids.sql.hasNans must be false.</em></td>
<td>S</td>
<td>S</td>
<td>S</td>
<td><em>PS<br/>UTC is only supported TZ for TIMESTAMP</em></td>
<td>S</td>
Expand Down Expand Up @@ -16730,8 +16730,8 @@ are limited.
<td>S</td>
<td>S</td>
<td>S</td>
<td><em>PS<br/>Input must not contain NaNs and spark.rapids.sql.hasNans must be false.</em></td>
<td><em>PS<br/>Input must not contain NaNs and spark.rapids.sql.hasNans must be false.</em></td>
<td>S</td>
<td>S</td>
<td>S</td>
<td><em>PS<br/>UTC is only supported TZ for TIMESTAMP</em></td>
<td>S</td>
Expand Down
4 changes: 0 additions & 4 deletions docs/tuning-guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -286,10 +286,6 @@ from the main [columnar batch size](#columnar-batch-size) setting. Some transco
load CSV files then write Parquet files) need to lower this setting when using large task input
partition sizes to avoid GPU out of memory errors.

## Set hasNans flag to False
If your data has float values but doesn't contain NaNs set [`spark.rapids.sql.hasNans`](configs.md#sql.hasNans) to
`false` to get the benefit of running on the GPU

## Metrics

Custom Spark SQL Metrics are available which can help identify performance bottlenecks in a query.
Expand Down
3 changes: 1 addition & 2 deletions integration_tests/src/main/python/array_test.py
Original file line number Diff line number Diff line change
Expand Up @@ -332,8 +332,7 @@ def test_array_min_max_all_nulls(data_gen):
def test_array_concat_decimal(data_gen):
assert_gpu_and_cpu_are_equal_collect(
lambda spark : debug_df(unary_op_df(spark, ArrayGen(data_gen)).selectExpr(
'concat(a, a)')),
conf=no_nans_conf)
'concat(a, a)')))

@pytest.mark.parametrize('data_gen', orderable_gens + nested_gens_sample, ids=idfn)
def test_array_repeat_with_count_column(data_gen):
Expand Down
1 change: 0 additions & 1 deletion integration_tests/src/main/python/data_gen.py
Original file line number Diff line number Diff line change
Expand Up @@ -1018,7 +1018,6 @@ def gen_scalars_for_sql(data_gen, count, seed=0, force_no_nulls=False):
nested_gens_sample = array_gens_sample + struct_gens_sample_with_decimal128 + map_gens_sample + decimal_128_map_gens

ansi_enabled_conf = {'spark.sql.ansi.enabled': 'true'}
no_nans_conf = {'spark.rapids.sql.hasNans': 'false'}
legacy_interval_enabled_conf = {'spark.sql.legacy.interval.enabled': 'true'}

def copy_and_update(conf, *more_confs):
Expand Down
Loading