Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encode Value class field using the inner type #546

Merged
merged 9 commits into from
Sep 6, 2021

Conversation

cchantep
Copy link
Collaborator

@cchantep cchantep commented Aug 8, 2021

Considering the type below:

class Name(val value: String) extends AnyVal

case class Person(name: Name, age: Int)

Encoding

The name field must be declared as StringType (not StructType(StructField("value", ...))) in the Catalyst representation for Person (see failing test).

Possible fix:

  • Introduce RecordFieldEncoder typeclass to wrap individual/field TypedEncoder for the implicit resolution required for RecordEncoderFields.
  • Provide a default RecordFieldEncoder instance for any T: TypedEncoder
  • Provide a RecordFieldEncoder specific for T <: AnyVal, with a higher resolution priority.

Literal

A literal created using lit function for a Value class instance generate a StructType (see failing test).

Introduce functions.litValue:

frameless.functions.litValue(valueClassInstance) // TypedColumn


encoder.jvmRepr shouldBe ObjectType(classOf[Person])

encoder.catalystRepr shouldBe StructType(Seq(
Copy link
Collaborator Author

@cchantep cchantep Aug 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test failing on master: https://github.com/typelevel/frameless/pull/546/checks?check_run_id=3423040083#step:8:195

StructType(StructField(name,StructType(StructField(value,StringType,false)),false), StructField(age,IntegerType,false)) was not equal to StructType(StructField(name,StringType,false), StructField(age,IntegerType,false)) (RecordEncoderTests.scala:112)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pomadchin
Copy link
Member

pomadchin commented Aug 26, 2021

Yea, I can see how this can make sense, indeed this can be a nice improvement. I have two questions about the possible fix:

  1. why that should be a wrapped codec? It can live in the TypedEncoder, after the derivation implicit.
  implicit def usingDerivationValueClass[F <: AnyVal, G <: _ :: HNil, H <: _ :: HNil, V]
    (implicit
     i0: LabelledGeneric.Aux[F, G],
     i1: DropUnitValues.Aux[G, H],
     i2: IsHCons.Aux[H, _ <: FieldType[_, V], HNil],
     i3: TypedEncoder[V],
     i4: ClassTag[F]
    ): TypedEncoder[F] = new TypedEncoder[F] {

      def nullable: Boolean = i3.nullable

      def jvmRepr: DataType = i3.catalystRepr

      def catalystRepr: DataType = i3.catalystRepr

      def fromCatalyst(path: Expression): Expression = i3.fromCatalyst(path)

      @inline def toCatalyst(path: Expression): Expression = path
    }
  1. How such codec knows how to construct the value class (the codec above obviously would not be able to construct the value class from string)?

@pomadchin
Copy link
Member

pomadchin commented Aug 26, 2021

Oh I see, hm it looks like the scalar case should be handled differently (i got familiar with all issues we have now). It looks like the major issue for now is that we can't establish a relationship between AnyVal and its underlying type.

@pomadchin
Copy link
Member

That's cool btw, usingDerivationValueClass fixes derivation of calsses with nested value classes. I'm wondering what is exactly wrong with the current derivation (consider the case where AnyVal is a case class)

@cchantep
Copy link
Collaborator Author

That's cool btw, usingDerivationValueClass fixes derivation of calsses with nested value classes. I'm wondering what is exactly wrong with the current derivation (consider the case where AnyVal is a case class)

The issue is that at Java reflection level (and so in expressions), a Value class cannot be constructed the same way when used as a field. It's very specific to bytecode optimisation.

The drafted code fixes that : 0c3f618#diff-e752c554dd3e99366532d2e8ed8a406370715aee66beecb675c4ec38f1f9403aR173

Will push thereafter.

@pomadchin
Copy link
Member

pomadchin commented Aug 26, 2021

@cchantep I tried to poke around with your code a bit (0c3f618#diff-e752c554dd3e99366532d2e8ed8a406370715aee66beecb675c4ec38f1f9403aR173), and I don't think the following tests pass:

  // passes
  test("Scalar value class") {
    import RecordEncoderTests._

    val encoder = TypedEncoder[Name]

    encoder.jvmRepr shouldBe ObjectType(classOf[Name])

    encoder.catalystRepr shouldBe StructType(Seq(StructField("value", StringType, false)))

    val sqlContext = session.sqlContext
    import sqlContext.implicits._

    TypedDataset
      .createUnsafe[Name](Seq("Foo", "Bar").toDF)(encoder)
      .collect().run() shouldBe Seq(new Name("Foo"), new Name("Bar"))

  }

  // decodes schema but fails on the actual dataset
  //  - Case class with value class field *** FAILED *** (1 second, 117 milliseconds)
  // [info]   java.lang.UnsupportedOperationException: No Encoder found for frameless.RecordEncoderTests.Name
  // [info] - field (class: "frameless.RecordEncoderTests.Name", name: "name")
  // [info] - root class: "frameless.RecordEncoderTests.Person"
  // [info]   at org.apache.spark.sql.catalyst.ScalaReflection$.$anonfun$serializerFor$1(ScalaReflection.scala:591)
  test("Case class with value class field") {
    import RecordEncoderTests._

    val encoder = TypedEncoder[Person]

    encoder.jvmRepr shouldBe ObjectType(classOf[Person])

    encoder.catalystRepr shouldBe StructType(Seq(
      StructField("name", StringType, false),
      StructField("age", IntegerType, false)))

    val sqlContext = session.sqlContext
    import sqlContext.implicits._

    TypedDataset
      .createUnsafe[Person](Seq(Person(new Name("Foo"), 25), Person(new Name("Bar"), 26)).toDF)(encoder)
      .collect().run() shouldBe Seq(Person(new Name("Foo"), 25), Person(new Name("Bar"), 26))
  }

  // decodes schema but fails on the actual dataset
  
  test("Case class with value class as optional field") {
    import RecordEncoderTests._

    // Encode as a User field
    val encoder = TypedEncoder[User]

    encoder.jvmRepr shouldBe ObjectType(classOf[User])

    val expectedPersonStructType = StructType(Seq(
      StructField("id", LongType, false),
      StructField("name", StringType, true)))

    encoder.catalystRepr shouldBe expectedPersonStructType

    val sqlContext = session.sqlContext
    import sqlContext.implicits._

    TypedDataset
      .createUnsafe[User](Seq(User(0, Some(new Name("Foo"))), User(1, None)).toDF)(encoder)
      .collect().run() shouldBe Seq(User(0, Some(new Name("Foo"))), User(1, None))
  }

@pomadchin
Copy link
Member

btw, @cchantep why is an extra wrapper required? I discovered that putting that implicit directly into the TypedEncoder scope beahves better (though creates an issues with the scalar test)

@cchantep
Copy link
Collaborator Author

cchantep commented Aug 26, 2021

btw, @cchantep why is an extra wrapper required? I discovered that putting that implicit directly into the TypedEncoder scope beahves better (though creates an issues with the scalar t

Exactly for that reason, handling a Value class as scalar cannot be done in the same way, the having only TypedEncoder cannot distinguish those cases, while the record encoder typeclass can.

@pomadchin
Copy link
Member

@cchantep got it, than it makes sense to me! I'll look into it more as well to help you (as much as I can) and I think that's really a valuable PR. Even Spark 3.2 can handle AnyVals now! :D

@cchantep cchantep marked this pull request as ready for review August 28, 2021 15:32
@cchantep cchantep force-pushed the feature/valueclass-field branch 4 times, most recently from a4f8950 to 3ff773e Compare August 29, 2021 17:10

val lorem = new Name("Lorem")

ds.withColumnReplaced('name, lit(lorem)).
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Failing test as lit for Value class generates StructType: failing test)

[info] - support value class *** FAILED *** (89 milliseconds)
[info]   Array(Q(1,[0,1000000005,6d65726f4c]), Q(2,[0,1000000005,6d65726f4c])) was not equal to List(Q(1,Lorem), Q(2,Lorem)) (LitTests.scala:76)

@codecov
Copy link

codecov bot commented Aug 29, 2021

Codecov Report

Merging #546 (ab19018) into master (59caada) will increase coverage by 0.06%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #546      +/-   ##
==========================================
+ Coverage   96.28%   96.35%   +0.06%     
==========================================
  Files          62       63       +1     
  Lines        1051     1071      +20     
  Branches        5        9       +4     
==========================================
+ Hits         1012     1032      +20     
  Misses         39       39              
Flag Coverage Δ
2.12.14 96.35% <100.00%> (+0.06%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
.../main/scala/frameless/TypedExpressionEncoder.scala 100.00% <ø> (ø)
...ataset/src/main/scala/frameless/IsValueClass.scala 100.00% <100.00%> (ø)
...taset/src/main/scala/frameless/RecordEncoder.scala 100.00% <100.00%> (ø)
dataset/src/main/scala/frameless/TypedColumn.scala 100.00% <100.00%> (ø)
...taset/src/main/scala/frameless/functions/Lit.scala 94.44% <100.00%> (-0.56%) ⬇️
...t/src/main/scala/frameless/functions/package.scala 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 59caada...ab19018. Read the comment docs.

@cchantep cchantep force-pushed the feature/valueclass-field branch from b346330 to 1a83610 Compare August 30, 2021 21:44
@cchantep
Copy link
Collaborator Author

@pomadchin I'm not 100% satisfied, but it can be reviewed

@pomadchin
Copy link
Member

Hey @cchantep, I will take a look later today or this weekend!

Copy link
Member

@pomadchin pomadchin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀 Left a couple of comments, but I think it looks good and is a solid enhancement.


object IsValueClass {
/** Provides an evidence `A` is a Value class */
implicit def apply[A <: AnyVal, G <: ::[_, HNil], H <: ::[_ <: FieldType[_ <: Symbol, _], HNil]](
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IsValueClass is required to disambiguate between Value Classes and Primitives. Being a subtype of AnyVal is not enough i.e. implicitly[Double <:< AnyVal] (that's a comment for myself).

If(IsNull(path), nullExpr, newExpr)
}
}

final class RecordFieldEncoder[T](
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the RecordFieldEncoder makes sense to me now 👍
I wanted to avoid intoroducing new extra wrappers for derivation, but it is pretty elegant.

dataset/src/main/scala/frameless/functions/Lit.scala Outdated Show resolved Hide resolved
dataset/src/main/scala/frameless/functions/Lit.scala Outdated Show resolved Hide resolved
dataset/src/main/scala/frameless/functions/package.scala Outdated Show resolved Hide resolved
dataset/src/main/scala/frameless/functions/package.scala Outdated Show resolved Hide resolved
import org.apache.spark.sql.types.DataType

@deprecated("Use `Lit[A]`", "0.10.2")
case class FramelessLit[A](obj: A, encoder: TypedEncoder[A]) extends Expression with NonSQLExpression {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How much sense it makes to keep it in the codebase? It is not used anywhere and I doubt users would be interested in that.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a breaking change, so it should only happen for a major version

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cchantep well, I have a feeling that this PR is in general very much binary breaking. It is API compatible though.
Let me set the MIMA plugin in a separate PR!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing a public class does break the API.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cchantep changing public API functions signatures is API breaking as well. I'm wondering what would be the MIMA report now.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a question of compromise. Removing this class is not required to fix the current issues, but would break the API (even if we can consider that the user should never directly use this public class), the others (mainly 1) updated signature are strictly required changes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cchantep oh yea, but these are public API changes; they break bin compat as well; have you checked the plugin report btw?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see no reasons why we can't make the major release (and it looks like this PR, with or without this legacy class - is bin incompatible); value classes + Spark 3.2.0 could be good enough for the major version release.

I think @imarios opionion about the versioning scheme is important there as well ¯_(ツ)_/¯ I don't think all frameless minor releases a bin compatible.

@cchantep cchantep force-pushed the feature/valueclass-field branch 2 times, most recently from 66f29b3 to 7b460a8 Compare September 6, 2021 09:45
Copy link
Member

@pomadchin pomadchin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀 A cool PR; but before merging I'd like to see the MIMA plugin (I can generate it myself) output though, to decide on the FramelessLit fate, and to decide on versioning.

@pomadchin
Copy link
Member

pomadchin commented Sep 6, 2021

sbt:frameless> mimaReportBinaryIssues
[info] frameless: mimaPreviousArtifacts is empty, not analyzing binary compatibility.
[info] docs: mimaPreviousArtifacts is empty, not analyzing binary compatibility.
[error] frameless-dataset: Failed binary compatibility check against org.typelevel:frameless-dataset_2.12:0.10.1! Found 6 potential problems
[error]  * static method deriveRecordCons(shapeless.Witness,frameless.TypedEncoder,frameless.RecordEncoderFields)frameless.RecordEncoderFields in interface frameless.RecordEncoderFields's type is different in current version, where it is (shapeless.Witness,frameless.RecordFieldEncoder,frameless.RecordEncoderFields)frameless.RecordEncoderFields instead of (shapeless.Witness,frameless.TypedEncoder,frameless.RecordEncoderFields)frameless.RecordEncoderFields
[error]    filter with: ProblemFilters.exclude[IncompatibleMethTypeProblem]("frameless.RecordEncoderFields.deriveRecordCons")
[error]  * static method deriveRecordLast(shapeless.Witness,frameless.TypedEncoder)frameless.RecordEncoderFields in interface frameless.RecordEncoderFields's type is different in current version, where it is (shapeless.Witness,frameless.RecordFieldEncoder)frameless.RecordEncoderFields instead of (shapeless.Witness,frameless.TypedEncoder)frameless.RecordEncoderFields
[error]    filter with: ProblemFilters.exclude[IncompatibleMethTypeProblem]("frameless.RecordEncoderFields.deriveRecordLast")
[error]  * method deriveRecordLast(shapeless.Witness,frameless.TypedEncoder)frameless.RecordEncoderFields in object frameless.RecordEncoderFields's type is different in current version, where it is (shapeless.Witness,frameless.RecordFieldEncoder)frameless.RecordEncoderFields instead of (shapeless.Witness,frameless.TypedEncoder)frameless.RecordEncoderFields
[error]    filter with: ProblemFilters.exclude[IncompatibleMethTypeProblem]("frameless.RecordEncoderFields.deriveRecordLast")
[error]  * method deriveRecordCons(shapeless.Witness,frameless.TypedEncoder,frameless.RecordEncoderFields)frameless.RecordEncoderFields in object frameless.RecordEncoderFields's type is different in current version, where it is (shapeless.Witness,frameless.RecordFieldEncoder,frameless.RecordEncoderFields)frameless.RecordEncoderFields instead of (shapeless.Witness,frameless.TypedEncoder,frameless.RecordEncoderFields)frameless.RecordEncoderFields
[error]    filter with: ProblemFilters.exclude[IncompatibleMethTypeProblem]("frameless.RecordEncoderFields.deriveRecordCons")
[error]  * static method litAggr(java.lang.Object,frameless.TypedEncoder)frameless.TypedAggregate in class frameless.functions.package does not have a correspondent in current version
[error]    filter with: ProblemFilters.exclude[DirectMissingMethodProblem]("frameless.functions.package.litAggr")
[error]  * method litAggr(java.lang.Object,frameless.TypedEncoder)frameless.TypedAggregate in object frameless.functions.package does not have a correspondent in current version
[error]    filter with: ProblemFilters.exclude[DirectMissingMethodProblem]("frameless.functions.package.litAggr")
[error] stack trace is suppressed; run last dataset / mimaReportBinaryIssues for the full output
[error] (dataset / mimaReportBinaryIssues) Failed binary compatibility check against org.typelevel:frameless-dataset_2.12:0.10.1! Found 6 potential problems
[error] Total time: 2 s, completed Sep 6, 2021 7:51:08 AM

Are these less harmuful than the MissingClass (talking about the potenital removal of the legacy class)? (I'm ok with keeping it, just wondering - mb it makes sense to target to the major release and keep the codebase clean?)

@cchantep
Copy link
Collaborator Author

cchantep commented Sep 6, 2021

sbt:frameless> mimaReportBinaryIssues
[info] frameless: mimaPreviousArtifacts is empty, not analyzing binary compatibility.
[info] docs: mimaPreviousArtifacts is empty, not analyzing binary compatibility.
[error] frameless-dataset: Failed binary compatibility check against org.typelevel:frameless-dataset_2.12:0.10.1! Found 6 potential problems
[error]  * static method deriveRecordCons(shapeless.Witness,frameless.TypedEncoder,frameless.RecordEncoderFields)frameless.RecordEncoderFields in interface frameless.RecordEncoderFields's type is different in current version, where it is (shapeless.Witness,frameless.RecordFieldEncoder,frameless.RecordEncoderFields)frameless.RecordEncoderFields instead of (shapeless.Witness,frameless.TypedEncoder,frameless.RecordEncoderFields)frameless.RecordEncoderFields
[error]    filter with: ProblemFilters.exclude[IncompatibleMethTypeProblem]("frameless.RecordEncoderFields.deriveRecordCons")
[error]  * static method deriveRecordLast(shapeless.Witness,frameless.TypedEncoder)frameless.RecordEncoderFields in interface frameless.RecordEncoderFields's type is different in current version, where it is (shapeless.Witness,frameless.RecordFieldEncoder)frameless.RecordEncoderFields instead of (shapeless.Witness,frameless.TypedEncoder)frameless.RecordEncoderFields
[error]    filter with: ProblemFilters.exclude[IncompatibleMethTypeProblem]("frameless.RecordEncoderFields.deriveRecordLast")
[error]  * method deriveRecordLast(shapeless.Witness,frameless.TypedEncoder)frameless.RecordEncoderFields in object frameless.RecordEncoderFields's type is different in current version, where it is (shapeless.Witness,frameless.RecordFieldEncoder)frameless.RecordEncoderFields instead of (shapeless.Witness,frameless.TypedEncoder)frameless.RecordEncoderFields
[error]    filter with: ProblemFilters.exclude[IncompatibleMethTypeProblem]("frameless.RecordEncoderFields.deriveRecordLast")
[error]  * method deriveRecordCons(shapeless.Witness,frameless.TypedEncoder,frameless.RecordEncoderFields)frameless.RecordEncoderFields in object frameless.RecordEncoderFields's type is different in current version, where it is (shapeless.Witness,frameless.RecordFieldEncoder,frameless.RecordEncoderFields)frameless.RecordEncoderFields instead of (shapeless.Witness,frameless.TypedEncoder,frameless.RecordEncoderFields)frameless.RecordEncoderFields
[error]    filter with: ProblemFilters.exclude[IncompatibleMethTypeProblem]("frameless.RecordEncoderFields.deriveRecordCons")
[error]  * static method litAggr(java.lang.Object,frameless.TypedEncoder)frameless.TypedAggregate in class frameless.functions.package does not have a correspondent in current version
[error]    filter with: ProblemFilters.exclude[DirectMissingMethodProblem]("frameless.functions.package.litAggr")
[error]  * method litAggr(java.lang.Object,frameless.TypedEncoder)frameless.TypedAggregate in object frameless.functions.package does not have a correspondent in current version
[error]    filter with: ProblemFilters.exclude[DirectMissingMethodProblem]("frameless.functions.package.litAggr")
[error] stack trace is suppressed; run last dataset / mimaReportBinaryIssues for the full output
[error] (dataset / mimaReportBinaryIssues) Failed binary compatibility check against org.typelevel:frameless-dataset_2.12:0.10.1! Found 6 potential problems
[error] Total time: 2 s, completed Sep 6, 2021 7:51:08 AM

Are these less harmuful than the MissingClass (talking about the potenital removal of the legacy class)? (I'm ok with keeping it, just wondering - mb it makes sense to target to the major release and keep the codebase clean?)

MiMa results need to be interpreted.
The real breaking changes (unavoidable) are deriveRecord{Last,Cons}, but they occur at compile-time and should not breaking at runtime.

Anyway if we decide to bump to 0.11, that's fine for me (in which case there is no risk to keep FramelessLit, neither to clean it).

@pomadchin
Copy link
Member

@cchantep let's do 0.11 than!

@cchantep
Copy link
Collaborator Author

cchantep commented Sep 6, 2021

@cchantep let's do 0.11 than!

Than https://github.com/typelevel/frameless/pull/449/files could also be included at the same time (no breaking change, but would make a bump more valuable).

@cchantep cchantep force-pushed the feature/valueclass-field branch 2 times, most recently from a50a9d1 to ab7610f Compare September 6, 2021 13:56
@cchantep cchantep force-pushed the feature/valueclass-field branch from ab7610f to 9ebeca4 Compare September 6, 2021 14:01
@cchantep cchantep force-pushed the feature/valueclass-field branch from 9ebeca4 to ab19018 Compare September 6, 2021 15:48
@cchantep cchantep merged commit 3515785 into typelevel:master Sep 6, 2021
@cchantep cchantep deleted the feature/valueclass-field branch September 6, 2021 16:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants