Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Get different results when casting from timestamp to string #4205

Closed
HaoYang670 opened this issue Nov 24, 2021 · 6 comments · Fixed by #4274
Closed

[BUG] Get different results when casting from timestamp to string #4205

HaoYang670 opened this issue Nov 24, 2021 · 6 comments · Fixed by #4274
Assignees
Labels
bug Something isn't working cudf_dependency An issue or PR with this label depends on a new feature in cudf

Comments

@HaoYang670
Copy link
Collaborator

Describe the bug
When casting "8669-07-22T04:45:57.73" to string, Spark gives "8669-07-22 04:45:57.73", Rapids gives "8669-07-22 04:29:57.73" which is wrong.

Steps/Code to reproduce bug
In CastOpSuite.scala:

  def validTimestamps(session: SparkSession): DataFrame = {
    import session.sqlContext.implicits._
    val timestampStrings = Seq(
      "8669-07-22T04:45:57.73",
      "1920-12-31T11:59:59.999",
      "1969-12-31T23:59:59.999",
      "1969-12-31T23:59:59.999999",
      "1970-01-01T00:00:00.000",
      "1970-01-01T00:00:00.999",
      "1970-01-01T00:00:00.999111",
      "2020-12-31T11:59:59.990",
      "2020-12-31T11:59:59.900",
      "2020-12-31T11:59:59.000",
      "2020-12-31T11:59:50.000",
      "2020-12-31T11:59:00.000",
      "2020-12-31T11:50:00.000",
      "2020-12-31T11:00:00.000"
    )
    val timestamps = timestampStrings
      .map(s => Timestamp.valueOf(LocalDateTime.parse(s)))

    timestamps.toDF("c0")
  }

run mvn test and get result:

- Test all supported casts with in-range values *** FAILED ***
  Cast from TimestampType to StringType failed; ansi=false org.scalatest.exceptions.TestFailedException: 
  Running on the GPU and on the CPU did not match (relaxed float comparison)
  CPU: WrappedArray([1920-12-31 11:59:59.999], [1969-12-31 23:59:59.999], [1969-12-31 23:59:59.999999], [1970-01-01 00:00:00], [1970-01-01 00:00:00.999], [1970-01-01 00:00:00.999111], [2020-12-31 11:00:00], [2020-12-31 11:50:00], [2020-12-31 11:59:00], [2020-12-31 11:59:50], [2020-12-31 11:59:59], [2020-12-31 11:59:59.9], [2020-12-31 11:59:59.99], [8669-07-22 04:45:57.73])
  
  GPU: WrappedArray([1920-12-31 11:59:59.999], [1969-12-31 23:59:59.999], [1969-12-31 23:59:59.999999], [1970-01-01 00:00:00], [1970-01-01 00:00:00.999], [1970-01-01 00:00:00.999111], [2020-12-31 11:00:00], [2020-12-31 11:50:00], [2020-12-31 11:59:00], [2020-12-31 11:59:50], [2020-12-31 11:59:59], [2020-12-31 11:59:59.9], [2020-12-31 11:59:59.99], [8669-07-22 04:29:57.73]) (CastOpSuite.scala:311)

Expected behavior
Rapids should give same result as Spark

Environment details (please complete the following information)

  • Environment location: [Standalone, YARN, Kubernetes, Cloud(specify cloud provider)]
    Spark3.2.0
    Rapids 22.02
    Standalone on my desktop
  • Spark configuration settings related to the issue

Additional context
This bug impacts #4028

@HaoYang670 HaoYang670 added bug Something isn't working ? - Needs Triage Need team to review and classify labels Nov 24, 2021
@revans2
Copy link
Collaborator

revans2 commented Nov 29, 2021

This appears to switch at about "6053-01-23T02:08:00.00" vs "6053-01-23T02:07:59.99". My guess is that there is an overflow happening somewhere in the calculation of the mins.

@revans2
Copy link
Collaborator

revans2 commented Nov 29, 2021

This appears to be a bug in CUDF.

scala> spark.range(128849018879999995L, 128849018880000003L).selectExpr("*", "timestamp_micros(id) as ts").repartition(1).selectExpr("*", "minute(ts)").show(truncate = false)

+------------------+--------------------------+----------+
|id                |ts                        |minute(ts)|
+------------------+--------------------------+----------+
|128849018879999995|6053-01-23 02:07:59.999995|7         |
|128849018879999996|6053-01-23 02:07:59.999996|7         |
|128849018879999997|6053-01-23 02:07:59.999997|7         |
|128849018879999998|6053-01-23 02:07:59.999998|7         |
|128849018879999999|6053-01-23 02:07:59.999999|7         |
|128849018880000000|6053-01-23 02:52:00       |8         |
|128849018880000001|6053-01-23 02:52:00.000001|8         |
|128849018880000002|6053-01-23 02:52:00.000002|8         |
+------------------+--------------------------+----------+


scala> spark.conf.set("spark.rapids.sql.enabled", "false")

scala> spark.range(128849018879999995L, 128849018880000003L).selectExpr("*", "timestamp_micros(id) as ts").repartition(1).selectExpr("*", "minute(ts)").show(truncate = false)
+------------------+--------------------------+----------+
|id                |ts                        |minute(ts)|
+------------------+--------------------------+----------+
|128849018879999995|6053-01-23 02:07:59.999995|7         |
|128849018879999996|6053-01-23 02:07:59.999996|7         |
|128849018879999997|6053-01-23 02:07:59.999997|7         |
|128849018879999998|6053-01-23 02:07:59.999998|7         |
|128849018879999999|6053-01-23 02:07:59.999999|7         |
|128849018880000000|6053-01-23 02:08:00       |8         |
|128849018880000001|6053-01-23 02:08:00.000001|8         |
|128849018880000002|6053-01-23 02:08:00.000002|8         |
+------------------+--------------------------+----------+

minute works just fine.

I'll file an issue against CUDF

@revans2 revans2 added the cudf_dependency An issue or PR with this label depends on a new feature in cudf label Nov 29, 2021
@revans2
Copy link
Collaborator

revans2 commented Nov 29, 2021

I filed rapidsai/cudf#9790 for this.

@revans2 revans2 self-assigned this Nov 29, 2021
@revans2 revans2 added this to the Nov 30 - Dec 10 milestone Nov 29, 2021
@revans2
Copy link
Collaborator

revans2 commented Nov 29, 2021

I targeted the fix to 22.02 instead of a hot fix for 21.12 because it requires a date in 6053 or even later to show up, and I don't think it warrants a hotfix.

@HaoYang670
Copy link
Collaborator Author

CPU OUTPUT: [Row(a='6233-08-04 19:30:55.701'), Row(a='8220-02-25 10:01:15.106'), Row(a='9754-01-21 16:53:02.137'), Row(a='7649-11-16 15:56:04.996'), Row(a='7027-04-09 15:08:52.627')]
GPU OUTPUT: [Row(a='6233-08-04 19:14:55.701'), Row(a='8220-02-25 10:45:15.106'), Row(a='9754-01-21 16:37:02.137'), Row(a='7649-11-16 15:40:04.996'), Row(a='7027-04-09 15:52:52.627')]

Here are more results I get. Minutes in GPU output always equals to (16 + minutes in CPU) mod 60. The difference is a constant number, which might be caused by an overflow

@revans2
Copy link
Collaborator

revans2 commented Nov 30, 2021

Yes I have a patch up against CUDF for this. rapidsai/cudf#9793

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cudf_dependency An issue or PR with this label depends on a new feature in cudf
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants