Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] test_cast_float_to_timestamp_ansi_for_nan_inf failed in spark330 #5719

Closed
pxLi opened this issue Jun 2, 2022 · 3 comments · Fixed by #5731
Closed

[BUG] test_cast_float_to_timestamp_ansi_for_nan_inf failed in spark330 #5719

pxLi opened this issue Jun 2, 2022 · 3 comments · Fixed by #5731
Assignees
Labels
bug Something isn't working

Comments

@pxLi
Copy link
Collaborator

pxLi commented Jun 2, 2022

Describe the bug
spark330shim run w/ spark 3.3.1-SNAPSHOT (due to spark 3.3.0 release is not out yet)

10:48:15  =========================== short test summary info ============================
10:48:15  FAILED ../../src/main/python/cast_test.py::test_cast_float_to_timestamp_ansi_for_nan_inf[inf-DoubleType()]
10:48:15  FAILED ../../src/main/python/cast_test.py::test_cast_float_to_timestamp_ansi_for_nan_inf[inf-FloatType()]
10:48:15  FAILED ../../src/main/python/cast_test.py::test_cast_float_to_timestamp_ansi_for_nan_inf[-inf-DoubleType()]
10:48:15  FAILED ../../src/main/python/cast_test.py::test_cast_float_to_timestamp_ansi_for_nan_inf[-inf-FloatType()]
10:48:15  FAILED ../../src/main/python/cast_test.py::test_cast_float_to_timestamp_ansi_for_nan_inf[nan-DoubleType()]
10:48:15  FAILED ../../src/main/python/cast_test.py::test_cast_float_to_timestamp_ansi_for_nan_inf[nan-FloatType()]

error message assert failed,

10:48:15  =================================== FAILURES ===================================
10:48:15  _______ test_cast_float_to_timestamp_ansi_for_nan_inf[inf-DoubleType()] ________
10:48:15  
10:48:15  type = DoubleType(), invalid_value = inf
10:48:15  
10:48:15      @pytest.mark.skipif(is_before_spark_330(), reason="ansi cast throws exception only in 3.3.0+")
10:48:15      @pytest.mark.parametrize('type', [DoubleType(), FloatType()], ids=idfn)
10:48:15      @pytest.mark.parametrize('invalid_value', [float("inf"), float("-inf"), float("nan")])
10:48:15      def test_cast_float_to_timestamp_ansi_for_nan_inf(type, invalid_value):
10:48:15          def fun(spark):
10:48:15              data = [invalid_value]
10:48:15              df = spark.createDataFrame(data, type)
10:48:15              return df.select(f.col('value').cast(TimestampType())).collect()
10:48:15  >       assert_gpu_and_cpu_error(fun, {"spark.sql.ansi.enabled": True}, "java.time.DateTimeException")
10:48:15  
10:48:15  ../../src/main/python/cast_test.py:375: 
10:48:15  _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
10:48:15  ../../src/main/python/asserts.py:572: in assert_gpu_and_cpu_error
10:48:15      assert_py4j_exception(lambda: with_cpu_session(df_fun, conf), error_message)
10:48:15  _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
10:48:15  
10:48:15  func = <function assert_gpu_and_cpu_error.<locals>.<lambda> at 0x7f0757b850d0>
10:48:15  error_message = 'java.time.DateTimeException'
10:48:15  
10:48:15      def assert_py4j_exception(func, error_message):
10:48:15          """
10:48:15          Assert that a specific Java exception is thrown
10:48:15          :param func: a function to be verified
10:48:15          :param error_message: a string such as the one produce by java.lang.Exception.toString
10:48:15          :return: Assertion failure if no exception matching error_message has occurred.
10:48:15          """
10:48:15          with pytest.raises(Py4JJavaError) as py4jError:
10:48:15              func()
10:48:15  >       assert error_message in str(py4jError.value.java_exception)
10:48:15  E       AssertionError
10:48:15  
10:48:15  ../../src/main/python/asserts.py:561: AssertionError
10:48:15  ________ test_cast_float_to_timestamp_ansi_for_nan_inf[inf-FloatType()] ________
10:48:15  
10:48:15  type = FloatType(), invalid_value = inf
10:48:15  
10:48:15      @pytest.mark.skipif(is_before_spark_330(), reason="ansi cast throws exception only in 3.3.0+")
10:48:15      @pytest.mark.parametrize('type', [DoubleType(), FloatType()], ids=idfn)
10:48:15      @pytest.mark.parametrize('invalid_value', [float("inf"), float("-inf"), float("nan")])
10:48:15      def test_cast_float_to_timestamp_ansi_for_nan_inf(type, invalid_value):
10:48:15          def fun(spark):
10:48:15              data = [invalid_value]
10:48:15              df = spark.createDataFrame(data, type)
10:48:15              return df.select(f.col('value').cast(TimestampType())).collect()
10:48:15  >       assert_gpu_and_cpu_error(fun, {"spark.sql.ansi.enabled": True}, "java.time.DateTimeException")
10:48:15  
10:48:15  ../../src/main/python/cast_test.py:375: 
10:48:15  _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
10:48:15  ../../src/main/python/asserts.py:572: in assert_gpu_and_cpu_error
10:48:15      assert_py4j_exception(lambda: with_cpu_session(df_fun, conf), error_message)
10:48:15  _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
10:48:15  
10:48:15  func = <function assert_gpu_and_cpu_error.<locals>.<lambda> at 0x7f07578cee50>
10:48:15  error_message = 'java.time.DateTimeException'
10:48:15  
10:48:15      def assert_py4j_exception(func, error_message):
10:48:15          """
10:48:15          Assert that a specific Java exception is thrown
10:48:15          :param func: a function to be verified
10:48:15          :param error_message: a string such as the one produce by java.lang.Exception.toString
10:48:15          :return: Assertion failure if no exception matching error_message has occurred.
10:48:15          """
10:48:15          with pytest.raises(Py4JJavaError) as py4jError:
10:48:15              func()
10:48:15  >       assert error_message in str(py4jError.value.java_exception)
10:48:15  E       AssertionError
10:48:15  
10:48:15  ../../src/main/python/asserts.py:561: AssertionError
10:48:15  _______ test_cast_float_to_timestamp_ansi_for_nan_inf[-inf-DoubleType()] _______
10:48:15  
10:48:15  type = DoubleType(), invalid_value = -inf
10:48:15  
10:48:15      @pytest.mark.skipif(is_before_spark_330(), reason="ansi cast throws exception only in 3.3.0+")
10:48:15      @pytest.mark.parametrize('type', [DoubleType(), FloatType()], ids=idfn)
10:48:15      @pytest.mark.parametrize('invalid_value', [float("inf"), float("-inf"), float("nan")])
10:48:15      def test_cast_float_to_timestamp_ansi_for_nan_inf(type, invalid_value):
10:48:15          def fun(spark):
10:48:15              data = [invalid_value]
10:48:15              df = spark.createDataFrame(data, type)
10:48:15              return df.select(f.col('value').cast(TimestampType())).collect()
10:48:15  >       assert_gpu_and_cpu_error(fun, {"spark.sql.ansi.enabled": True}, "java.time.DateTimeException")
10:48:15  
10:48:15  ../../src/main/python/cast_test.py:375: 
10:48:15  _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
10:48:15  ../../src/main/python/asserts.py:572: in assert_gpu_and_cpu_error
10:48:15      assert_py4j_exception(lambda: with_cpu_session(df_fun, conf), error_message)
10:48:15  _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
10:48:15  
10:48:15  func = <function assert_gpu_and_cpu_error.<locals>.<lambda> at 0x7f075601cdc0>
10:48:15  error_message = 'java.time.DateTimeException'
10:48:15  
10:48:15      def assert_py4j_exception(func, error_message):
10:48:15          """
10:48:15          Assert that a specific Java exception is thrown
10:48:15          :param func: a function to be verified
10:48:15          :param error_message: a string such as the one produce by java.lang.Exception.toString
10:48:15          :return: Assertion failure if no exception matching error_message has occurred.
10:48:15          """
10:48:15          with pytest.raises(Py4JJavaError) as py4jError:
10:48:15              func()
10:48:15  >       assert error_message in str(py4jError.value.java_exception)
10:48:15  E       AssertionError
10:48:15  
10:48:15  ../../src/main/python/asserts.py:561: AssertionError
10:48:15  _______ test_cast_float_to_timestamp_ansi_for_nan_inf[-inf-FloatType()] ________
10:48:15  
10:48:15  type = FloatType(), invalid_value = -inf
10:48:15  
10:48:15      @pytest.mark.skipif(is_before_spark_330(), reason="ansi cast throws exception only in 3.3.0+")
10:48:15      @pytest.mark.parametrize('type', [DoubleType(), FloatType()], ids=idfn)
10:48:15      @pytest.mark.parametrize('invalid_value', [float("inf"), float("-inf"), float("nan")])
10:48:15      def test_cast_float_to_timestamp_ansi_for_nan_inf(type, invalid_value):
10:48:15          def fun(spark):
10:48:15              data = [invalid_value]
10:48:15              df = spark.createDataFrame(data, type)
10:48:15              return df.select(f.col('value').cast(TimestampType())).collect()
10:48:15  >       assert_gpu_and_cpu_error(fun, {"spark.sql.ansi.enabled": True}, "java.time.DateTimeException")
10:48:15  
10:48:15  ../../src/main/python/cast_test.py:375: 
10:48:15  _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
10:48:15  ../../src/main/python/asserts.py:572: in assert_gpu_and_cpu_error
10:48:15      assert_py4j_exception(lambda: with_cpu_session(df_fun, conf), error_message)
10:48:15  _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
10:48:15  
10:48:15  func = <function assert_gpu_and_cpu_error.<locals>.<lambda> at 0x7f0757aedc10>
10:48:15  error_message = 'java.time.DateTimeException'
10:48:15  
10:48:15      def assert_py4j_exception(func, error_message):
10:48:15          """
10:48:15          Assert that a specific Java exception is thrown
10:48:15          :param func: a function to be verified
10:48:15          :param error_message: a string such as the one produce by java.lang.Exception.toString
10:48:15          :return: Assertion failure if no exception matching error_message has occurred.
10:48:15          """
10:48:15          with pytest.raises(Py4JJavaError) as py4jError:
10:48:15              func()
10:48:15  >       assert error_message in str(py4jError.value.java_exception)
10:48:15  E       AssertionError
10:48:15  
10:48:15  ../../src/main/python/asserts.py:561: AssertionError
10:48:15  _______ test_cast_float_to_timestamp_ansi_for_nan_inf[nan-DoubleType()] ________
10:48:15  
10:48:15  type = DoubleType(), invalid_value = nan
10:48:15  
10:48:15      @pytest.mark.skipif(is_before_spark_330(), reason="ansi cast throws exception only in 3.3.0+")
10:48:15      @pytest.mark.parametrize('type', [DoubleType(), FloatType()], ids=idfn)
10:48:15      @pytest.mark.parametrize('invalid_value', [float("inf"), float("-inf"), float("nan")])
10:48:15      def test_cast_float_to_timestamp_ansi_for_nan_inf(type, invalid_value):
10:48:15          def fun(spark):
10:48:15              data = [invalid_value]
10:48:15              df = spark.createDataFrame(data, type)
10:48:15              return df.select(f.col('value').cast(TimestampType())).collect()
10:48:15  >       assert_gpu_and_cpu_error(fun, {"spark.sql.ansi.enabled": True}, "java.time.DateTimeException")
10:48:15  
10:48:15  ../../src/main/python/cast_test.py:375: 
10:48:15  _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
10:48:15  ../../src/main/python/asserts.py:572: in assert_gpu_and_cpu_error
10:48:15      assert_py4j_exception(lambda: with_cpu_session(df_fun, conf), error_message)
10:48:15  _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
10:48:15  
10:48:15  func = <function assert_gpu_and_cpu_error.<locals>.<lambda> at 0x7f0757adab80>
10:48:15  error_message = 'java.time.DateTimeException'
10:48:15  
10:48:15      def assert_py4j_exception(func, error_message):
10:48:15          """
10:48:15          Assert that a specific Java exception is thrown
10:48:15          :param func: a function to be verified
10:48:15          :param error_message: a string such as the one produce by java.lang.Exception.toString
10:48:15          :return: Assertion failure if no exception matching error_message has occurred.
10:48:15          """
10:48:15          with pytest.raises(Py4JJavaError) as py4jError:
10:48:15              func()
10:48:15  >       assert error_message in str(py4jError.value.java_exception)
10:48:15  E       AssertionError
10:48:15  
10:48:15  ../../src/main/python/asserts.py:561: AssertionError
10:48:15  ________ test_cast_float_to_timestamp_ansi_for_nan_inf[nan-FloatType()] ________
10:48:15  
10:48:15  type = FloatType(), invalid_value = nan
10:48:15  
10:48:15      @pytest.mark.skipif(is_before_spark_330(), reason="ansi cast throws exception only in 3.3.0+")
10:48:15      @pytest.mark.parametrize('type', [DoubleType(), FloatType()], ids=idfn)
10:48:15      @pytest.mark.parametrize('invalid_value', [float("inf"), float("-inf"), float("nan")])
10:48:15      def test_cast_float_to_timestamp_ansi_for_nan_inf(type, invalid_value):
10:48:15          def fun(spark):
10:48:15              data = [invalid_value]
10:48:15              df = spark.createDataFrame(data, type)
10:48:15              return df.select(f.col('value').cast(TimestampType())).collect()
10:48:15  >       assert_gpu_and_cpu_error(fun, {"spark.sql.ansi.enabled": True}, "java.time.DateTimeException")
10:48:15  
10:48:15  ../../src/main/python/cast_test.py:375: 
10:48:15  _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
10:48:15  ../../src/main/python/asserts.py:572: in assert_gpu_and_cpu_error
10:48:15      assert_py4j_exception(lambda: with_cpu_session(df_fun, conf), error_message)
10:48:15  _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
10:48:15  
10:48:15  func = <function assert_gpu_and_cpu_error.<locals>.<lambda> at 0x7f0757912f70>
10:48:15  error_message = 'java.time.DateTimeException'
10:48:15  
10:48:15      def assert_py4j_exception(func, error_message):
10:48:15          """
10:48:15          Assert that a specific Java exception is thrown
10:48:15          :param func: a function to be verified
10:48:15          :param error_message: a string such as the one produce by java.lang.Exception.toString
10:48:15          :return: Assertion failure if no exception matching error_message has occurred.
10:48:15          """
10:48:15          with pytest.raises(Py4JJavaError) as py4jError:
10:48:15              func()
10:48:15  >       assert error_message in str(py4jError.value.java_exception)
10:48:15  E       AssertionError
@pxLi pxLi added bug Something isn't working ? - Needs Triage Need team to review and classify labels Jun 2, 2022
@pxLi pxLi changed the title [BUG] test_cast_float_to_timestamp_ansi_for_nan_inf failed [BUG] test_cast_float_to_timestamp_ansi_for_nan_inf failed in spark330 Jun 2, 2022
@nartal1 nartal1 self-assigned this Jun 2, 2022
@razajafri
Copy link
Collaborator

This is a same as previous failures that we have fixed in Spark 3.3.0+

Spark is moving away from standard Java Exceptions to their own Exceptions. This failure is happening because the java.time.DateTimeException is now being replaced by SparkDateTimeException. The tricky part is that SparkDateTimeException is a private class which takes an errorClass which has a predefined error message. This error message and other messages don't satisfy our requirements of having a general error message for a value in a column as opposed to having a message specifically tailored for a single value. e.g.

Spark could have an error message to the tune of
The value Infinity of the type "DOUBLE" cannot be cast to "TIMESTAMP" because it is malformed. Correct the value as per the syntax, or change its target type. To return NULL instead, use try_cast. If necessary set "spark.sql.ansi.enabled" to "false" to bypass this error.

whereas we have a generic message for a value in the column
The column contains at least a single value that is NaN, Infinity or out-of-range values. To return NULL instead, use 'try_cast'. If necessary set spark.sql.ansi.enabled to false to bypass this error.

To match the above error we have to know whether the column has a positive inf value or a negative inf value which will cost us an extra iteration over the entire column.

Is matching the Exception type really important to us as a product when it's costing us this overhead?
@jlowe @revans2

@nartal1
Copy link
Collaborator

nartal1 commented Jun 3, 2022

Thanks @razajafri for the detailed explanation of the issue. Previous failures were easy to fix as the Exceptions were not private and did not take errorClass . So we could print custom/generic message along with right exception.

Test is failing after this change was backported :https://github.com/apache/spark/pull/36591/files so it would be in Spark-3.3 release. Here they are calling invalidInputInCastToDatetimeError where cannotCastToDateTimeError was called earlier.

I tried throwing SparkDateTimeException from TrampolineUtil but it would still need the exact data type and value for printing the error message.
In the shorterm we could XFAIL this test if that's okay (as the difference is only in the type of Exception thrown).

Adding @tgravescs @gerashegalov if they have any thoughts on this.

@nartal1
Copy link
Collaborator

nartal1 commented Jun 3, 2022

With the above PR, the Exception is close to Spark's error message. Note that I am not extra check to identify if it's a Nan or INF.

org.apache.spark.SparkDateTimeException: The value Nan/Infinity of the type DOUBLE cannot be cast to TIMESTAMP because it is malformed. Correct the value as per the syntax, or change its target type. To return NULL instead, use `try_cast`. If necessary set spark.sql.ansi.enabled to "false" to bypass this error.

@sameerz sameerz removed the ? - Needs Triage Need team to review and classify label Jun 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
4 participants