New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Add support for timeParserPolicy=LEGACY #2875

Merged

andygrove merged 8 commits into NVIDIA:branch-21.08 from andygrove:legacy-timeParserPolicy

Jul 9, 2021

Contributor

andygrove commented Jul 7, 2021 •

edited

Loading

Closes #2860

This PR adds support for parsing strings to date/timestamp when spark.sql.legacy.timeParserPolicy=LEGACY for the following formats:

dd-MM-yyyy
dd/MM/yyyy
yyyy/MM/dd
yyyy-MM-dd
yyyy/MM/dd HH:mm:ss
yyyy-MM-dd HH:mm:ss

We are not 100% compatible with Spark on CPU in all cases so this support is only enabled when spark.rapids.sql.incompatibleDateFormats.enabled is also set to true. We have the following limitations when running on the GPU:

Only 4-digit years are supported
The proleptic Gregorian calendar is used instead of the hybrid Julian+Gregorian calendar that Spark uses in legacy mode, so we produce different results for dates prior to the Gregorian calendar starting.

andygrove added the feature request label

andygrove added this to the July 5 - July 16 milestone

andygrove self-assigned this

revans2 reviewed

View reviewed changes

docs/compatibility.md Outdated

               - Supported on GPU but may produce different results to Spark
               - Unsupported on GPU
-              The formats which are supported on GPU and 100% compatible with Spark are :
+              The formats which are supported on GPU and 100% compatible with Spark vary depending on the setting

Collaborator

revans2 Jul 7, 2021

But from reading below none of the formats are 100% compatible with Spark because even for CORRECTED and EXCEPTION we do not detect trailing characters.

Contributor Author

andygrove Jul 8, 2021

This was badly worded. We do claim that some of the formats are 100% compatible. I have made some changes to try and make this clearer.

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/datetimeExpressions.scala Outdated

+                              if (!conf.incompatDateFormats) {
+                                willNotWorkOnGpu(s"LEGACY format '$sparkFormat' on the GPU is not guaranteed " +
+                                  s"to produce the same results as Spark on CPU. Set " +
+                                  s"spark.rapids.sql.incompatibleDateFormats.enabled=true to force onto GPU.")

Collaborator

revans2 Jul 7, 2021

nit. Instead of hard coding the config here can we use the KEY form the RapidsConf just to avoid any possible misspellings?

andygrove changed the title ~~Add support for timeParserPolicy=LEGACY~~ WIP: Add support for timeParserPolicy=LEGACY

andygrove marked this pull request as draft

July 7, 2021 16:15

Contributor Author

andygrove commented Jul 7, 2021

During manual performance testing, I ran into some behavior that I don't understand yet, so I am changing this to a draft / WIP for now.

andygrove changed the title ~~WIP: Add support for timeParserPolicy=LEGACY~~ Add support for timeParserPolicy=LEGACY

andygrove marked this pull request as ready for review

July 7, 2021 20:04

Contributor Author

andygrove commented Jul 7, 2021

During manual performance testing, I ran into some behavior that I don't understand yet, so I am changing this to a draft / WIP for now.

The issue was that I had ANSI mode enabled when manually testing, and hadn't implemented ANSI support as part of this PR. It is now updated to fall back to CPU if LEGACY + ANSI are both enabled. If we do want to support LEGACY + ANSI together then we can do that as a follow-on issue.


          Add support for timeParserPolicy=LEGACY when parsing dates

3b6bc16

Signed-off-by: Andy Grove <andygrove@nvidia.com>

andygrove force-pushed the legacy-timeParserPolicy branch from 93d6b76 to 3b6bc16 Compare

July 7, 2021 20:09

andygrove added 2 commits

July 7, 2021 16:01


          Validate final string before conversion and handle more edge cases fo…

fac5b01

…und during manual fuzzing

Signed-off-by: Andy Grove <andygrove@nvidia.com>


          Remove duplicate code and add more code comments

6f50d9f

Signed-off-by: Andy Grove <andygrove@nvidia.com>

andygrove linked an issue

that may be closed by this pull request

[BUG] to_date produces wrong value on GPU for some corner cases #2868

Closed

andygrove removed a link to an issue

[BUG] to_date produces wrong value on GPU for some corner cases #2868

Closed

Contributor Author

andygrove commented Jul 8, 2021

build

andygrove mentioned this pull request

Fix cast to date bug #2890

Merged


          Use raw strings for improved readability

a9d11f0

Signed-off-by: Andy Grove <andygrove@nvidia.com>

jlowe reviewed

View reviewed changes

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/datetimeExpressions.scala Outdated Show resolved Hide resolved

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/datetimeExpressions.scala Outdated Show resolved Hide resolved

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/datetimeExpressions.scala Outdated Show resolved Hide resolved

andygrove added 3 commits

July 8, 2021 17:20


          Combine FIX_SINGLE_DIGIT_DAY patterns

2e6fe53

Signed-off-by: Andy Grove <andygrove@nvidia.com>


          Combine patterns for matching seconds and fix some incorrect comments

36ce995

Signed-off-by: Andy Grove <andygrove@nvidia.com>


          combine two more rules

aac06e3

Signed-off-by: Andy Grove <andygrove@nvidia.com>

gerashegalov reviewed

View reviewed changes

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/datetimeExpressions.scala Outdated

+                // We are compatible with Spark for these formats when the timeParserPolicy is LEGACY. It
+                // is possible that other formats may be supported but these are the only ones that we have
+                // tests for.
+                val LEGACY_COMPATIBLE_FORMATS = Seq(

Collaborator

gerashegalov Jul 9, 2021

since we are doing lookups in this Seq can we make it a Set[LegacyParseFormat] or even better a Map[String, LegacyParseFormat]

Contributor Author

andygrove Jul 9, 2021

Thanks. Fixed.


          Change LEGACY_COMPATIBLE_FORMATS to Map

41208b4

Signed-off-by: Andy Grove <andygrove@nvidia.com>

jlowe approved these changes

View reviewed changes

Member

jlowe commented Jul 9, 2021

build

gerashegalov approved these changes

View reviewed changes

Collaborator

gerashegalov left a comment

🚀

andygrove merged commit f521bf9 into NVIDIA:branch-21.08

andygrove deleted the legacy-timeParserPolicy branch

July 9, 2021 21:03

This was referenced Jul 9, 2021

Fix compiler warning #2906

Merged

[BUG] No validation of format strings when formatting dates in legacy timeParserPolicy mode #2929

Closed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature request