Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement java.time.{Duration, Instant, Period} type encoders #581

Merged
merged 6 commits into from
Mar 5, 2022

Conversation

jgoday
Copy link
Contributor

@jgoday jgoday commented Nov 10, 2021

See #576 (provide TypedEncoder for java.time.{Instant, Duration, Period} in Spark 3.2 #576)

This PR implements the following implicit typeEncoders in frameless.TypedEncoder

  1. timeDuration: TypedEncoder[java.time.Duration]
  2. timeInstant: TypedEncoder[java.time.Instant]
  3. timePeriod: TypedEncoder[java.time.Period]

As DayTimeIntervalType and YearMonthIntervalType were introduced in spark 3.2,
to maintain compatibility through 3.0, 3.1 and 3.2 spark versions,
java.time.Instant is represented as an Int (days) catalyst type and java.time.Duration as a LongType (millis).

@pomadchin
Copy link
Member

pomadchin commented Nov 10, 2021

Thanks for the PR 🎉 , I'll look into it a bit later, I have a quick question though, is not YearMonthIntervalType compatible with java.time.Period?

@jgoday
Copy link
Contributor Author

jgoday commented Nov 10, 2021

Thanks for the PR tada , I'll look into it a bit later, I have a quick question though, is not YearMonthIntervalType compatible with java.time.Period?

I guess so, but I think YearMonthIntervalType it's not available for 3.0 and 3.1, or maybe I'm misunderstanding the problem ?

@pomadchin
Copy link
Member

pomadchin commented Nov 10, 2021

@jgoday oh, how the TypedEncoder may look like for DayTimeIntervalType / YearMonthIntervalType (Spark 3.2)?

@jgoday
Copy link
Contributor Author

jgoday commented Nov 10, 2021

@jgoday oh, how the TypedEncoder may look like for DayTimeIntervalType / YearMonthIntervalType (Spark 3.2)?

I think that with DayTimeIntervalType and YearMonthIntervalType period and duration typeEncoders might look like

  import org.apache.spark.sql.catalyst.util.{IntervalUtils}

  implicit val timeDuration: TypedEncoder[java.time.Duration] = new TypedEncoder[java.time.Duration] {
    def nullable: Boolean = false

    def jvmRepr: DataType = ScalaReflection.dataTypeFor[java.time.Duration]
    def catalystRepr: DataType = DayTimeIntervalType()

    def toCatalyst(path: Expression): Expression =
      StaticInvoke(
        IntervalUtils.getClass,
        DayTimeIntervalType(),
        "durationToMicros",
        path :: Nil,
        returnNullable = false)
    }

    def fromCatalyst(path: Expression): Expression =
      StaticInvoke(
        IntervalUtils.getClass,
        ObjectType(classOf[java.time.Duration]),
        "microsToDuration",
        path :: Nil,
        returnNullable = false)
  }


  implicit val timePeriod: TypedEncoder[java.time.Period] = new TypedEncoder[java.time.Period] {
    def nullable: Boolean = false

    def jvmRepr: DataType = ScalaReflection.dataTypeFor[java.time.Period]
    def catalystRepr: DataType = YearMonthIntervalType()

    def toCatalyst(path: Expression): Expression =
      StaticInvoke(
        IntervalUtils.getClass,
        YearMonthIntervalType(),
        "periodToMonths",
        path :: Nil,
        returnNullable = false)

    def fromCatalyst(path: Expression): Expression =
      StaticInvoke(
        IntervalUtils.getClass,
        ObjectType(classOf[java.time.Period]),
        "monthsToPeriod",
        path :: Nil,
        returnNullable = false)
  }

@pomadchin
Copy link
Member

Hm my immidiate thoughts are:

  1. To leave only Instant (could be good enough), and users can define their own codecs
  2. Or to implement spark 3.x compat codecs and in case they behave in a not compatible way users may always redefine them

I'm okay with 2 if it is useful for users and definitely should not be an issue to redefine in a priority scope.

@jgoday
Copy link
Contributor Author

jgoday commented Nov 10, 2021

@pomadchin
One question about ColumnTests and the CI failed test:

[info] - between *** FAILED *** (1 second, 587 milliseconds)
[info]   RuntimeException was thrown during property evaluation.
[info]     Message: Error while encoding: java.lang.ArithmeticException: long overflow
[info]   staticinvoke(class org.apache.spark.sql.catalyst.util.DateTimeUtils$, TimestampType, instantToMicros, assertnotnull(input[0, frameless.X3, false]).a, true, false) AS a#13804
[info]   staticinvoke(class org.apache.spark.sql.catalyst.util.DateTimeUtils$, TimestampType, instantToMicros, assertnotnull(input[0, frameless.X3, false]).b, true, false) AS b#13805
[info]   staticinvoke(class org.apache.spark.sql.catalyst.util.DateTimeUtils$, TimestampType, instantToMicros, assertnotnull(input[0, frameless.X3, false]).c, true, false) AS c#13806
[info]     Occurred when passed generated values (
[info]       arg0 = 1970-01-01T00:00:01Z,
[info]       arg1 = 1970-01-01T00:00:00Z,
[info]       arg2 = +688620642-12-13T00:30:35Z
[info]     )

Seems to be failing with the Instant conversion (DateTimeUtils.instantToMicros),
As I understand the problem is with the Arbitrary[Instant] implicit,
it goes from 0 to Instant.MAX ('1000000000-12-31T23:59:59.999999999Z'),
so when DateTimeUtils tries to convert it to micros it overflows.

should we be using another more realistic range (from Instant.EPOCH to Instant.now() for example) ?

Arbitrary(Gen.choose[Instant](Instant.EPOCH, Instant.now()))

@pomadchin
Copy link
Member

@jgoday 👍

@pomadchin pomadchin self-requested a review November 10, 2021 22:38
@codecov
Copy link

codecov bot commented Nov 10, 2021

Codecov Report

Merging #581 (f3850af) into master (69a1b59) will increase coverage by 0.00%.
The diff coverage is 100.00%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master     #581   +/-   ##
=======================================
  Coverage   95.14%   95.15%           
=======================================
  Files          65       65           
  Lines        1134     1157   +23     
  Branches        8        7    -1     
=======================================
+ Hits         1079     1101   +22     
- Misses         55       56    +1     
Flag Coverage Δ
2.12.15 95.15% <100.00%> (+<0.01%) ⬆️
2.13.8 95.78% <100.00%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...ore/src/main/scala/frameless/CatalystOrdered.scala 94.44% <100.00%> (-5.56%) ⬇️
...ataset/src/main/scala/frameless/TypedEncoder.scala 97.37% <100.00%> (+0.25%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 69a1b59...f3850af. Read the comment docs.

Copy link
Member

@pomadchin pomadchin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instant codec 100% makes sense, it has a separate type and supported across all 3.x versions. I also left a couple of comments.

Duration and Period look more like Injections (since that is a covnersion to the underlying type). I'm good with having those as default injections in the TypedEncoder companion object, but not strognly opinionated about it: it may cause a confusing fallback for those who forgot to implement it.

Thanks for looking into all corner cases! 👍

dataset/src/main/scala/frameless/TypedEncoder.scala Outdated Show resolved Hide resolved
dataset/src/main/scala/frameless/TypedEncoder.scala Outdated Show resolved Hide resolved
dataset/src/test/scala/frameless/ColumnTests.scala Outdated Show resolved Hide resolved
@pomadchin pomadchin self-requested a review November 13, 2021 13:23
@pomadchin pomadchin force-pushed the feature/java_time_type_encoders branch 3 times, most recently from 17e0ec7 to d5a966b Compare December 12, 2021 03:41
Copy link
Member

@pomadchin pomadchin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jgoday hey 👋; sorry that I didn't have a chance to review it earlier;

I added some changes to clean it up a bit. However, it looks like Duration and Period require a little bit of more tests; I tried to add them into frameless.CollectTests, however didn't have enough time to investigate failures.

Could you take a look into that? I think there is some inconsistency in the encode / decode behavior. If you would not have time I'll have another look into it later this week.

@cchantep
Copy link
Collaborator

Hi, what's up there ?

@pomadchin
Copy link
Member

Hey @cchantep requires a little bit of work and extra tests coverage to ensure that all is consistent, sadly I didn't have much time lately.

@pomadchin pomadchin self-requested a review March 5, 2022 21:39
@pomadchin pomadchin force-pushed the feature/java_time_type_encoders branch from d5a966b to cf3fa3a Compare March 5, 2022 21:40
Copy link
Member

@pomadchin pomadchin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry that it took so long to get it merged!
Once it's green I'm merging it, thanks for your contribution 🚀

StaticInvoke(
DateTimeUtils.getClass,
TimestampType,
"instantToMicros",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jgoday that's the main beast, has some limitations 🤷 but we have to use it to be compat with Spark.

@pomadchin pomadchin force-pushed the feature/java_time_type_encoders branch 2 times, most recently from 761a4bc to 9273128 Compare March 5, 2022 22:10
@pomadchin pomadchin force-pushed the feature/java_time_type_encoders branch from 9273128 to f3850af Compare March 5, 2022 22:17
@pomadchin pomadchin merged commit 18fc6d2 into typelevel:master Mar 5, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants