Support RaiseError [databricks] #5540

wbo4958 · 2022-05-19T10:47:32Z

This PR adds GpuRaiseError to replace RaiseError Expression. It is to fix #5507.

wbo4958 · 2022-05-19T10:49:03Z

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/misc.scala

+    }
+
+    // Take the first one as the error message
+    val msg = input.copyToHost().getUTF8String(0).toString


Is there any necessary to only copy the first row to get the error msg instead of copying the whole column vector?

This is on an error case. The entire job is going to fail. I am not too concerned with failing faster than the CPU. Yes it would be nice to not copy everything. You can do that with getScalarElement, which should keep the code small and clean.

withResource(input.getScalarElement(0)) { scalarMsg => if (!scalarMsg.isValid()) { throw new RuntimeException() } else { throw new RuntimeException(scalarMsg.getJavaString()) } }

Good suggestion. Done.

Signed-off-by: Bobby Wang <wbo4958@gmail.com>

wbo4958 · 2022-05-19T10:50:17Z

build

revans2 · 2022-05-19T14:40:24Z

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/misc.scala

+  override def hasSideEffects: Boolean = true
+
+  override protected def doColumnar(input: GpuColumnVector): ColumnVector = {
+    if (input == null || input.getRowCount <= 0) {


input should never be null. If it is that is an internal error. I am okay with what you are doing but it would be good to know something unexpected happened.

If input has no rows in it, then I don't want to throw an exception. Just return an empty ColumnVector. This I can see actually happening if you have an IF/ELSE to check for error cases. I don't know if there are any corner cases when nothing matched and we got an empty ColumnVector, but I can see it happening.

Yeah, I just changed this and added the related tests. Thx

revans2 · 2022-05-19T14:42:06Z

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/misc.scala

+    }
+
+    // Take the first one as the error message
+    val msg = input.copyToHost().getUTF8String(0).toString


This is on an error case. The entire job is going to fail. I am not too concerned with failing faster than the CPU. Yes it would be nice to not copy everything. You can do that with getScalarElement, which should keep the code small and clean.

withResource(input.getScalarElement(0)) { scalarMsg => if (!scalarMsg.isValid()) { throw new RuntimeException() } else { throw new RuntimeException(scalarMsg.getJavaString()) } }

revans2 · 2022-05-19T14:43:44Z

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/misc.scala

+    }
+
+    // Take the first one as the error message
+    val msg = input.copyToHost().getUTF8String(0).toString


This leaks the host column vector, and I don't know if this is going to do the right thing if that first string is null. We should have an explicit test for when the first column is null, I think we will get an assertion error if they are turned on.

wbo4958 · 2022-05-20T05:57:18Z

build

wbo4958 · 2022-05-20T09:15:30Z

build

abellina · 2022-05-20T12:35:28Z

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/misc.scala

+
+  override protected def doColumnar(input: GpuColumnVector): ColumnVector = {
+    if (input.getRowCount <= 0) {
+      // For the case: when(condition, raise_error())


it would be nice to cover this specific case raise_error() in the python tests, it doesn't seem like we are.

Actually, according to spark, I don't think this raise_error() (no args) is possible:

pyspark.sql.utils.AnalysisException: Invalid number of arguments for function raise_error. Expected: 1; Found: 0; line 1 pos 7

This is possible and should be tested, e.g.:

>>> import pyspark.sql.functions as f >>> df = spark.range(0) >>> df.count() 0 >>> df.select(f.raise_error(f.col("id"))).explain() == Physical Plan == *(1) Project [raise_error(cast(id#12L as string), NullType) AS raise_error(id)#20] +- *(1) Range (0, 0, step=1, splits=12) >>> df.select(f.raise_error(f.col("id"))).collect() []

Yeah, raise_error needs to accept the parameter and I just updated the comment.

Actually, according to spark, I don't think this raise_error() (no args) is possible:

pyspark.sql.utils.AnalysisException: Invalid number of arguments for function raise_error. Expected: 1; Found: 0; line 1 pos 7

Yeah, raise_error needs to accept the parameter and I just updated the comment.

This is possible and should be tested, e.g.:

>>> import pyspark.sql.functions as f >>> df = spark.range(0) >>> df.count() 0 >>> df.select(f.raise_error(f.col("id"))).explain() == Physical Plan == *(1) Project [raise_error(cast(id#12L as string), NullType) AS raise_error(id)#20] +- *(1) Range (0, 0, step=1, splits=12) >>> df.select(f.raise_error(f.col("id"))).collect() []

Done

abellina

LGTM

jlowe · 2022-05-20T13:37:43Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOverrides.scala

-    )
+    ),
+    expr[RaiseError](
+      "throw exception",


Nit: All other descriptions start with a capital letter and are a bit more descriptive, as seen in the generated configs.md docs.

Suggested change

"throw exception",

"Throws an exception",

jlowe · 2022-05-20T13:41:38Z

integration_tests/src/main/python/misc_expr_test.py

+        lambda spark : unary_op_df(spark, short_gen, num_slices=2).select(
+                f.raise_error(f.col('a'))).collect(),
+        conf={},
+        error_message="java.lang.RuntimeException")


The test should verify we are properly conveying the specified error message into the exception rather than just checking for the same exception type, checking both the null first element and non-null first element scenarios.

jlowe · 2022-05-20T13:47:57Z

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/misc.scala

+
+  override protected def doColumnar(input: GpuColumnVector): ColumnVector = {
+    if (input.getRowCount <= 0) {
+      // For the case: when(condition, raise_error())


This is possible and should be tested, e.g.:

>>> import pyspark.sql.functions as f >>> df = spark.range(0) >>> df.count() 0 >>> df.select(f.raise_error(f.col("id"))).explain() == Physical Plan == *(1) Project [raise_error(cast(id#12L as string), NullType) AS raise_error(id)#20] +- *(1) Range (0, 0, step=1, splits=12) >>> df.select(f.raise_error(f.col("id"))).collect() []

wbo4958 · 2022-05-23T03:15:08Z

build

wbo4958 · 2022-05-23T04:14:29Z

build

wbo4958 commented May 19, 2022

View reviewed changes

Support RaiseError

49638fd

Signed-off-by: Bobby Wang <wbo4958@gmail.com>

wbo4958 force-pushed the raise-error branch from 7e0498f to 49638fd Compare May 19, 2022 10:49

wbo4958 changed the title ~~Support RaiseError~~ Support RaiseError [databricks] May 19, 2022

revans2 reviewed May 19, 2022

View reviewed changes

sameerz added the feature request New feature or request label May 20, 2022

resolve comments

97c15df

abellina reviewed May 20, 2022

View reviewed changes

abellina previously approved these changes May 20, 2022

View reviewed changes

jlowe reviewed May 20, 2022

View reviewed changes

resolve comments and add more tests

6b4c926

wbo4958 dismissed abellina’s stale review via 6b4c926 May 23, 2022 03:12

add test

efe27fc

jlowe approved these changes May 23, 2022

View reviewed changes

revans2 approved these changes May 23, 2022

View reviewed changes

abellina approved these changes May 23, 2022

View reviewed changes

abellina merged commit ae27905 into NVIDIA:branch-22.06 May 23, 2022

wbo4958 deleted the raise-error branch May 24, 2022 00:07

razajafri mentioned this pull request Jun 4, 2024

[FEA] Support RaiseError for Spark 4.0.0 #10969

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support RaiseError [databricks] #5540

Support RaiseError [databricks] #5540

wbo4958 commented May 19, 2022

wbo4958 May 19, 2022

revans2 May 19, 2022

wbo4958 May 20, 2022

wbo4958 commented May 19, 2022

revans2 May 19, 2022

wbo4958 May 20, 2022

revans2 May 19, 2022

revans2 May 19, 2022

wbo4958 commented May 20, 2022

wbo4958 commented May 20, 2022

abellina May 20, 2022 •

edited

Loading

abellina May 20, 2022

jlowe May 20, 2022

wbo4958 May 23, 2022

wbo4958 May 23, 2022

abellina left a comment

jlowe May 20, 2022

wbo4958 May 23, 2022

jlowe May 20, 2022

wbo4958 May 23, 2022

jlowe May 20, 2022

wbo4958 commented May 23, 2022

wbo4958 commented May 23, 2022

Support RaiseError [databricks] #5540

Support RaiseError [databricks] #5540

Conversation

wbo4958 commented May 19, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wbo4958 commented May 19, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wbo4958 commented May 20, 2022

wbo4958 commented May 20, 2022

abellina May 20, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

abellina left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wbo4958 commented May 23, 2022

wbo4958 commented May 23, 2022

abellina May 20, 2022 •

edited

Loading