Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for timeParserPolicy=LEGACY #2875

Merged
merged 8 commits into from
Jul 9, 2021

Conversation

andygrove
Copy link
Contributor

@andygrove andygrove commented Jul 7, 2021

Closes #2860

This PR adds support for parsing strings to date/timestamp when spark.sql.legacy.timeParserPolicy=LEGACY for the following formats:

  • dd-MM-yyyy
  • dd/MM/yyyy
  • yyyy/MM/dd
  • yyyy-MM-dd
  • yyyy/MM/dd HH:mm:ss
  • yyyy-MM-dd HH:mm:ss

We are not 100% compatible with Spark on CPU in all cases so this support is only enabled when spark.rapids.sql.incompatibleDateFormats.enabled is also set to true. We have the following limitations when running on the GPU:

  • Only 4-digit years are supported
  • The proleptic Gregorian calendar is used instead of the hybrid Julian+Gregorian calendar that Spark uses in legacy mode, so we produce different results for dates prior to the Gregorian calendar starting.

@andygrove andygrove added the feature request New feature or request label Jul 7, 2021
@andygrove andygrove added this to the July 5 - July 16 milestone Jul 7, 2021
@andygrove andygrove self-assigned this Jul 7, 2021
@@ -353,7 +353,12 @@ the specified format string will fall into one of three categories:
- Supported on GPU but may produce different results to Spark
- Unsupported on GPU

The formats which are supported on GPU and 100% compatible with Spark are :
The formats which are supported on GPU and 100% compatible with Spark vary depending on the setting
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But from reading below none of the formats are 100% compatible with Spark because even for CORRECTED and EXCEPTION we do not detect trailing characters.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was badly worded. We do claim that some of the formats are 100% compatible. I have made some changes to try and make this clearer.

if (!conf.incompatDateFormats) {
willNotWorkOnGpu(s"LEGACY format '$sparkFormat' on the GPU is not guaranteed " +
s"to produce the same results as Spark on CPU. Set " +
s"spark.rapids.sql.incompatibleDateFormats.enabled=true to force onto GPU.")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit. Instead of hard coding the config here can we use the KEY form the RapidsConf just to avoid any possible misspellings?

@andygrove andygrove changed the title Add support for timeParserPolicy=LEGACY WIP: Add support for timeParserPolicy=LEGACY Jul 7, 2021
@andygrove andygrove marked this pull request as draft July 7, 2021 16:15
@andygrove
Copy link
Contributor Author

During manual performance testing, I ran into some behavior that I don't understand yet, so I am changing this to a draft / WIP for now.

@andygrove andygrove changed the title WIP: Add support for timeParserPolicy=LEGACY Add support for timeParserPolicy=LEGACY Jul 7, 2021
@andygrove andygrove marked this pull request as ready for review July 7, 2021 20:04
@andygrove
Copy link
Contributor Author

During manual performance testing, I ran into some behavior that I don't understand yet, so I am changing this to a draft / WIP for now.

The issue was that I had ANSI mode enabled when manually testing, and hadn't implemented ANSI support as part of this PR. It is now updated to fall back to CPU if LEGACY + ANSI are both enabled. If we do want to support LEGACY + ANSI together then we can do that as a follow-on issue.

Signed-off-by: Andy Grove <andygrove@nvidia.com>
@andygrove andygrove force-pushed the legacy-timeParserPolicy branch from 93d6b76 to 3b6bc16 Compare July 7, 2021 20:09
andygrove added 2 commits July 7, 2021 16:01
…und during manual fuzzing

Signed-off-by: Andy Grove <andygrove@nvidia.com>
Signed-off-by: Andy Grove <andygrove@nvidia.com>
@andygrove
Copy link
Contributor Author

build

@andygrove andygrove mentioned this pull request Jul 8, 2021
Signed-off-by: Andy Grove <andygrove@nvidia.com>
andygrove added 3 commits July 8, 2021 17:20
Signed-off-by: Andy Grove <andygrove@nvidia.com>
Signed-off-by: Andy Grove <andygrove@nvidia.com>
Signed-off-by: Andy Grove <andygrove@nvidia.com>
// We are compatible with Spark for these formats when the timeParserPolicy is LEGACY. It
// is possible that other formats may be supported but these are the only ones that we have
// tests for.
val LEGACY_COMPATIBLE_FORMATS = Seq(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since we are doing lookups in this Seq can we make it a Set[LegacyParseFormat] or even better a Map[String, LegacyParseFormat]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Fixed.

Signed-off-by: Andy Grove <andygrove@nvidia.com>
@jlowe
Copy link
Member

jlowe commented Jul 9, 2021

build

Copy link
Collaborator

@gerashegalov gerashegalov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

@andygrove andygrove merged commit f521bf9 into NVIDIA:branch-21.08 Jul 9, 2021
@andygrove andygrove deleted the legacy-timeParserPolicy branch July 9, 2021 21:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Minimal support for LEGACY timeParserPolicy
4 participants