NVIDIA · HaoYang670 · Sep 15, 2022 · Sep 7, 2022 · Sep 7, 2022 · Sep 14, 2022
diff --git a/docs/compatibility.md b/docs/compatibility.md
@@ -69,19 +69,6 @@ worked with plain Spark. Starting from 22.06 this is behavior is enabled by defa
 the config
 [`spark.rapids.sql.variableFloatAgg.enabled`](configs.md#sql.variableFloatAgg.enabled).
 
-Additionally, some aggregations on floating point columns that contain `NaN` can produce results
-different from Spark in versions prior to Spark 3.1.0.  If it is known with certainty that the
-floating point columns do not contain `NaN`, set
-[`spark.rapids.sql.hasNans`](configs.md#sql.hasNans) to `false` to run GPU enabled aggregations on
-them.
-
-In the case of a distinct count on `NaN` values, prior to Spark 3.1.0, the issue only shows up if
- you have different `NaN` values. There are several different binary values that are all considered
- to be `NaN` by floating point. The plugin treats all of these as the same value, where as Spark
- treats them all as different values. Because this is considered to be rare we do not disable
- distinct count for floating point values even if
- [`spark.rapids.sql.hasNans`](configs.md#sql.hasNans) is `true`.
-
 ### `0.0` vs `-0.0`
 
 Floating point allows zero to be encoded as `0.0` and `-0.0`, but the IEEE standard says that they

diff --git a/docs/configs.md b/docs/configs.md
@@ -109,7 +109,6 @@ Name | Description | Default Value
 <a name="sql.format.parquet.write.enabled"></a>spark.rapids.sql.format.parquet.write.enabled|When set to false disables parquet output acceleration|true
 <a name="sql.format.parquet.writer.int96.enabled"></a>spark.rapids.sql.format.parquet.writer.int96.enabled|When set to false, disables accelerated parquet write if the spark.sql.parquet.outputTimestampType is set to INT96|true
 <a name="sql.hasExtendedYearValues"></a>spark.rapids.sql.hasExtendedYearValues|Spark 3.2.0+ extended parsing of years in dates and timestamps to support the full range of possible values. Prior to this it was limited to a positive 4 digit year. The Accelerator does not support the extended range yet. This config indicates if your data includes this extended range or not, or if you don't care about getting the correct values on values with the extended range.|true
-<a name="sql.hasNans"></a>spark.rapids.sql.hasNans|Config to indicate if your data has NaN's. Cudf doesn't currently support NaN's properly so you can get corrupt data if you have NaN's in your data and it runs on the GPU.|true
 <a name="sql.hashOptimizeSort.enabled"></a>spark.rapids.sql.hashOptimizeSort.enabled|Whether sorts should be inserted after some hashed operations to improve output ordering. This can improve output file sizes when saving to columnar formats.|false
 <a name="sql.improvedFloatOps.enabled"></a>spark.rapids.sql.improvedFloatOps.enabled|For some floating point operations spark uses one way to compute the value and the underlying cudf implementation can use an improved algorithm. In some cases this can result in cudf producing an answer when spark overflows.|true
 <a name="sql.improvedTimeOps.enabled"></a>spark.rapids.sql.improvedTimeOps.enabled|When set to true, some operators will avoid overflowing by converting epoch days directly to seconds without first converting to microseconds|false

diff --git a/docs/demo/AWS-EMR/Mortgage-ETL-GPU-EMR.ipynb b/docs/demo/AWS-EMR/Mortgage-ETL-GPU-EMR.ipynb
@@ -39,7 +39,7 @@
     "    \"driverMemory\": \"4000M\",\n",
     "    \"driverCores\": 2,\n",
     "    \"executorMemory\": \"4000M\",\n",
-    "    \"conf\": {\"spark.sql.adaptive.enabled\": \"false\", \"spark.dynamicAllocation.enabled\": \"false\", \"spark.executor.instances\":2, \"spark.executor.cores\":2, \"spark.rapids.sql.explain\":\"ALL\", \"spark.task.cpus\":\"1\", \"spark.rapids.sql.concurrentGpuTasks\":\"2\", \"spark.rapids.memory.pinnedPool.size\":\"2G\", \"spark.executor.memoryOverhead\":\"2G\", \"spark.executor.extraJavaOptions\":\"-Dai.rapids.cudf.prefer-pinned=true\", \"spark.sql.files.maxPartitionBytes\":\"512m\", \"spark.executor.resource.gpu.amount\":\"1\", \"spark.task.resource.gpu.amount\":\"0.5\", \"spark.plugins\":\"com.nvidia.spark.SQLPlugin\", \"spark.rapids.sql.hasNans\":\"false\", \"spark.rapids.sql.batchSizeBytes\":\"512M\", \"spark.rapids.sql.reader.batchSizeBytes\":\"768M\", \"spark.sql.cache.serializer\" : \"com.nvidia.spark.ParquetCachedBatchSerializer\"}\n",
+    "    \"conf\": {\"spark.sql.adaptive.enabled\": \"false\", \"spark.dynamicAllocation.enabled\": \"false\", \"spark.executor.instances\":2, \"spark.executor.cores\":2, \"spark.rapids.sql.explain\":\"ALL\", \"spark.task.cpus\":\"1\", \"spark.rapids.sql.concurrentGpuTasks\":\"2\", \"spark.rapids.memory.pinnedPool.size\":\"2G\", \"spark.executor.memoryOverhead\":\"2G\", \"spark.executor.extraJavaOptions\":\"-Dai.rapids.cudf.prefer-pinned=true\", \"spark.sql.files.maxPartitionBytes\":\"512m\", \"spark.executor.resource.gpu.amount\":\"1\", \"spark.task.resource.gpu.amount\":\"0.5\", \"spark.plugins\":\"com.nvidia.spark.SQLPlugin\", \"spark.rapids.sql.batchSizeBytes\":\"512M\", \"spark.rapids.sql.reader.batchSizeBytes\":\"768M\", \"spark.sql.cache.serializer\" : \"com.nvidia.spark.ParquetCachedBatchSerializer\"}\n",
     "}"
    ]
   },

diff --git a/docs/demo/Databricks/Mortgage-ETL-db.ipynb b/docs/demo/Databricks/Mortgage-ETL-db.ipynb
@@ -17,7 +17,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 0,
+   "execution_count": null,
    "metadata": {
     "application/vnd.databricks.v1+cell": {
      "inputWidgets": {},
@@ -90,7 +90,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 0,
+   "execution_count": null,
    "metadata": {
     "application/vnd.databricks.v1+cell": {
      "inputWidgets": {},
@@ -264,7 +264,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 0,
+   "execution_count": null,
    "metadata": {
     "application/vnd.databricks.v1+cell": {
      "inputWidgets": {},
@@ -409,7 +409,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 0,
+   "execution_count": null,
    "metadata": {
     "application/vnd.databricks.v1+cell": {
      "inputWidgets": {},
@@ -508,7 +508,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 0,
+   "execution_count": null,
    "metadata": {
     "application/vnd.databricks.v1+cell": {
      "inputWidgets": {},
@@ -647,7 +647,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 0,
+   "execution_count": null,
    "metadata": {
     "application/vnd.databricks.v1+cell": {
      "inputWidgets": {},
@@ -712,7 +712,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 0,
+   "execution_count": null,
    "metadata": {
     "application/vnd.databricks.v1+cell": {
      "inputWidgets": {},
@@ -785,7 +785,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 0,
+   "execution_count": null,
    "metadata": {
     "application/vnd.databricks.v1+cell": {
      "inputWidgets": {},
@@ -859,7 +859,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 0,
+   "execution_count": null,
    "metadata": {
     "application/vnd.databricks.v1+cell": {
      "inputWidgets": {},
@@ -986,7 +986,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 0,
+   "execution_count": null,
    "metadata": {
     "application/vnd.databricks.v1+cell": {
      "inputWidgets": {},
@@ -1081,7 +1081,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 0,
+   "execution_count": null,
    "metadata": {
     "application/vnd.databricks.v1+cell": {
      "inputWidgets": {},
@@ -1153,7 +1153,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 0,
+   "execution_count": null,
    "metadata": {
     "application/vnd.databricks.v1+cell": {
      "inputWidgets": {},
@@ -1236,7 +1236,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 0,
+   "execution_count": null,
    "metadata": {
     "application/vnd.databricks.v1+cell": {
      "inputWidgets": {},
@@ -1312,7 +1312,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 0,
+   "execution_count": null,
    "metadata": {
     "application/vnd.databricks.v1+cell": {
      "inputWidgets": {},
@@ -1378,7 +1378,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 0,
+   "execution_count": null,
    "metadata": {
     "application/vnd.databricks.v1+cell": {
      "inputWidgets": {},
@@ -1447,7 +1447,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 0,
+   "execution_count": null,
    "metadata": {
     "application/vnd.databricks.v1+cell": {
      "inputWidgets": {},
@@ -1527,7 +1527,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 0,
+   "execution_count": null,
    "metadata": {
     "application/vnd.databricks.v1+cell": {
      "inputWidgets": {},
@@ -1597,7 +1597,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 0,
+   "execution_count": null,
    "metadata": {
     "application/vnd.databricks.v1+cell": {
      "inputWidgets": {},
@@ -1647,7 +1647,6 @@
     "# CPU run, set to false\n",
     "# spark.conf.set('spark.rapids.sql.enabled', 'false')\n",
     "spark.conf.set('spark.sql.files.maxPartitionBytes', '1G')\n",
-    "spark.conf.set(\"spark.rapids.sql.hasNans\", \"false\")\n",
     "# use GPU to read CSV\n",
     "spark.conf.set(\"spark.rapids.sql.csv.read.double.enabled\", \"true\")"
    ]
@@ -1668,7 +1667,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 0,
+   "execution_count": null,
    "metadata": {
     "application/vnd.databricks.v1+cell": {
      "inputWidgets": {},

diff --git a/docs/demo/GCP/Mortgage-ETL.ipynb b/docs/demo/GCP/Mortgage-ETL.ipynb
@@ -63,7 +63,6 @@
                 "conf.set(\"spark.executor.memory\", \"20g\")\n",
                 "conf.set(\"spark.sql.files.maxPartitionBytes\", \"512m\")\n",
                 "conf.set(\"spark.executor.resource.gpu.amount\", \"1\")\n",
-                "conf.set(\"spark.rapids.sql.hasNans\", \"false\")\n",
                 "conf.set(\"spark.rapids.sql.csv.read.double.enabled\", \"true\")\n",
                 "conf.set('spark.rapids.sql.batchSizeBytes', '512M')\n",
                 "conf.set('spark.rapids.sql.reader.batchSizeBytes', '768M')\n",
@@ -850,7 +849,6 @@
                 "# CPU run, set to false\n",
                 "# spark.conf.set('spark.rapids.sql.enabled', 'false')\n",
                 "spark.conf.set('spark.sql.files.maxPartitionBytes', '1G')\n",
-                "spark.conf.set(\"spark.rapids.sql.hasNans\", \"false\")\n",
                 "# use GPU to read CSV\n",
                 "spark.conf.set(\"spark.rapids.sql.csv.read.double.enabled\", \"true\")"
             ]
@@ -917,4 +915,4 @@
     },
     "nbformat": 4,
     "nbformat_minor": 5
-}
+}
diff --git a/docs/demo/GCP/mortgage-xgboost4j-gpu-scala.ipynb b/docs/demo/GCP/mortgage-xgboost4j-gpu-scala.ipynb
@@ -214,7 +214,6 @@
    "source": [
     "// Build the spark session and data reader as usual\n",
     "val sparkSession = SparkSession.builder.appName(\"mortgage-gpu\").getOrCreate\n",
-    "sparkSession.conf.set(\"spark.rapids.sql.hasNans\", false)\n",
     "val reader = sparkSession.read"
    ]
   },

diff --git a/docs/get-started/getting-started-gcp.md b/docs/get-started/getting-started-gcp.md
@@ -216,7 +216,7 @@ gcloud dataproc jobs submit spark \
     --region=$REGION \
     --class=com.nvidia.spark.examples.mortgage.GPUMain \
     --jars=gs://${GCS_BUCKET}/scala/sample_xgboost_apps-${VERSION}-SNAPSHOT-jar-with-dependencies.jar \
-    --properties=spark.executor.cores=${SPARK_NUM_CORES_PER_EXECUTOR},spark.task.cpus=${SPARK_NUM_CORES_PER_EXECUTOR},spark.executor.memory=${SPARK_EXECUTOR_MEMORY},spark.executor.memoryOverhead=${SPARK_EXECUTOR_MEMORYOVERHEAD},spark.executor.resource.gpu.amount=1,spark.task.resource.gpu.amount=1,spark.rapids.sql.hasNans=false,spark.rapids.sql.batchSizeBytes=512M,spark.rapids.sql.reader.batchSizeBytes=768M,spark.rapids.sql.variableFloatAgg.enabled=true,spark.rapids.memory.gpu.pooling.enabled=false,spark.dynamicAllocation.enabled=false \
+    --properties=spark.executor.cores=${SPARK_NUM_CORES_PER_EXECUTOR},spark.task.cpus=${SPARK_NUM_CORES_PER_EXECUTOR},spark.executor.memory=${SPARK_EXECUTOR_MEMORY},spark.executor.memoryOverhead=${SPARK_EXECUTOR_MEMORYOVERHEAD},spark.executor.resource.gpu.amount=1,spark.task.resource.gpu.amount=1,spark.rapids.sql.batchSizeBytes=512M,spark.rapids.sql.reader.batchSizeBytes=768M,spark.rapids.sql.variableFloatAgg.enabled=true,spark.rapids.memory.gpu.pooling.enabled=false,spark.dynamicAllocation.enabled=false \
     -- \
     -dataPath=train::${DATA_PATH}/train \
     -dataPath=trans::${DATA_PATH}/eval \

diff --git a/docs/spark-profiling-tool.md b/docs/spark-profiling-tool.md
@@ -243,7 +243,6 @@ Compare Rapids Properties which are set explicitly:
 |spark.rapids.sql.concurrentGpuTasks        |null      |2         |
 |spark.rapids.sql.enabled                   |false     |true      |
 |spark.rapids.sql.explain                   |null      |NOT_ON_GPU|
-|spark.rapids.sql.hasNans                   |null      |FALSE     |
 |spark.rapids.sql.incompatibleOps.enabled   |null      |true      |
 +-------------------------------------------+----------+----------+
 ```

diff --git a/docs/supported_ops.md b/docs/supported_ops.md
@@ -16666,8 +16666,8 @@ are limited.
 <td>S</td>
 <td>S</td>
 <td>S</td>
-<td><em>PS<br/>Input must not contain NaNs and spark.rapids.sql.hasNans must be false.</em></td>
-<td><em>PS<br/>Input must not contain NaNs and spark.rapids.sql.hasNans must be false.</em></td>
+<td>S</td>
+<td>S</td>
 <td>S</td>
 <td><em>PS<br/>UTC is only supported TZ for TIMESTAMP</em></td>
 <td>S</td>
@@ -16730,8 +16730,8 @@ are limited.
 <td>S</td>
 <td>S</td>
 <td>S</td>
-<td><em>PS<br/>Input must not contain NaNs and spark.rapids.sql.hasNans must be false.</em></td>
-<td><em>PS<br/>Input must not contain NaNs and spark.rapids.sql.hasNans must be false.</em></td>
+<td>S</td>
+<td>S</td>
 <td>S</td>
 <td><em>PS<br/>UTC is only supported TZ for TIMESTAMP</em></td>
 <td>S</td>

diff --git a/docs/tuning-guide.md b/docs/tuning-guide.md
@@ -286,10 +286,6 @@ from the main [columnar batch size](#columnar-batch-size) setting.  Some transco
 load CSV files then write Parquet files) need to lower this setting when using large task input
 partition sizes to avoid GPU out of memory errors.
 
-## Set hasNans flag to False
-If your data has float values but doesn't contain NaNs set [`spark.rapids.sql.hasNans`](configs.md#sql.hasNans) to 
-`false` to get the benefit of running on the GPU
-
 ## Metrics
 
 Custom Spark SQL Metrics are available which can help identify performance bottlenecks in a query.

diff --git a/integration_tests/src/main/python/array_test.py b/integration_tests/src/main/python/array_test.py
@@ -332,8 +332,7 @@ def test_array_min_max_all_nulls(data_gen):
 def test_array_concat_decimal(data_gen):
     assert_gpu_and_cpu_are_equal_collect(
         lambda spark : debug_df(unary_op_df(spark, ArrayGen(data_gen)).selectExpr(
-            'concat(a, a)')),
-        conf=no_nans_conf)
+            'concat(a, a)')))
 
 @pytest.mark.parametrize('data_gen', orderable_gens + nested_gens_sample, ids=idfn)
 def test_array_repeat_with_count_column(data_gen):

diff --git a/integration_tests/src/main/python/data_gen.py b/integration_tests/src/main/python/data_gen.py
@@ -1018,7 +1018,6 @@ def gen_scalars_for_sql(data_gen, count, seed=0, force_no_nulls=False):
 nested_gens_sample = array_gens_sample + struct_gens_sample_with_decimal128 + map_gens_sample + decimal_128_map_gens
 
 ansi_enabled_conf = {'spark.sql.ansi.enabled': 'true'}
-no_nans_conf = {'spark.rapids.sql.hasNans': 'false'}
 legacy_interval_enabled_conf = {'spark.sql.legacy.interval.enabled': 'true'}
 
 def copy_and_update(conf, *more_confs):