-
Notifications
You must be signed in to change notification settings - Fork 138
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Encode Value class field using the inner type #546
Conversation
|
||
encoder.jvmRepr shouldBe ObjectType(classOf[Person]) | ||
|
||
encoder.catalystRepr shouldBe StructType(Seq( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test failing on master
: https://github.com/typelevel/frameless/pull/546/checks?check_run_id=3423040083#step:8:195
StructType(StructField(name,StructType(StructField(value,StringType,false)),false), StructField(age,IntegerType,false)) was not equal to StructType(StructField(name,StringType,false), StructField(age,IntegerType,false)) (RecordEncoderTests.scala:112)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, I can see how this can make sense, indeed this can be a nice improvement. I have two questions about the possible fix:
implicit def usingDerivationValueClass[F <: AnyVal, G <: _ :: HNil, H <: _ :: HNil, V]
(implicit
i0: LabelledGeneric.Aux[F, G],
i1: DropUnitValues.Aux[G, H],
i2: IsHCons.Aux[H, _ <: FieldType[_, V], HNil],
i3: TypedEncoder[V],
i4: ClassTag[F]
): TypedEncoder[F] = new TypedEncoder[F] {
def nullable: Boolean = i3.nullable
def jvmRepr: DataType = i3.catalystRepr
def catalystRepr: DataType = i3.catalystRepr
def fromCatalyst(path: Expression): Expression = i3.fromCatalyst(path)
@inline def toCatalyst(path: Expression): Expression = path
}
|
Oh I see, hm it looks like the scalar case should be handled differently (i got familiar with all issues we have now). It looks like the major issue for now is that we can't establish a relationship between AnyVal and its underlying type. |
That's cool btw, |
The issue is that at Java reflection level (and so in expressions), a Value class cannot be constructed the same way when used as a field. It's very specific to bytecode optimisation. The drafted code fixes that : 0c3f618#diff-e752c554dd3e99366532d2e8ed8a406370715aee66beecb675c4ec38f1f9403aR173 Will push thereafter. |
@cchantep I tried to poke around with your code a bit (0c3f618#diff-e752c554dd3e99366532d2e8ed8a406370715aee66beecb675c4ec38f1f9403aR173), and I don't think the following tests pass: // passes
test("Scalar value class") {
import RecordEncoderTests._
val encoder = TypedEncoder[Name]
encoder.jvmRepr shouldBe ObjectType(classOf[Name])
encoder.catalystRepr shouldBe StructType(Seq(StructField("value", StringType, false)))
val sqlContext = session.sqlContext
import sqlContext.implicits._
TypedDataset
.createUnsafe[Name](Seq("Foo", "Bar").toDF)(encoder)
.collect().run() shouldBe Seq(new Name("Foo"), new Name("Bar"))
}
// decodes schema but fails on the actual dataset
// - Case class with value class field *** FAILED *** (1 second, 117 milliseconds)
// [info] java.lang.UnsupportedOperationException: No Encoder found for frameless.RecordEncoderTests.Name
// [info] - field (class: "frameless.RecordEncoderTests.Name", name: "name")
// [info] - root class: "frameless.RecordEncoderTests.Person"
// [info] at org.apache.spark.sql.catalyst.ScalaReflection$.$anonfun$serializerFor$1(ScalaReflection.scala:591)
test("Case class with value class field") {
import RecordEncoderTests._
val encoder = TypedEncoder[Person]
encoder.jvmRepr shouldBe ObjectType(classOf[Person])
encoder.catalystRepr shouldBe StructType(Seq(
StructField("name", StringType, false),
StructField("age", IntegerType, false)))
val sqlContext = session.sqlContext
import sqlContext.implicits._
TypedDataset
.createUnsafe[Person](Seq(Person(new Name("Foo"), 25), Person(new Name("Bar"), 26)).toDF)(encoder)
.collect().run() shouldBe Seq(Person(new Name("Foo"), 25), Person(new Name("Bar"), 26))
}
// decodes schema but fails on the actual dataset
test("Case class with value class as optional field") {
import RecordEncoderTests._
// Encode as a User field
val encoder = TypedEncoder[User]
encoder.jvmRepr shouldBe ObjectType(classOf[User])
val expectedPersonStructType = StructType(Seq(
StructField("id", LongType, false),
StructField("name", StringType, true)))
encoder.catalystRepr shouldBe expectedPersonStructType
val sqlContext = session.sqlContext
import sqlContext.implicits._
TypedDataset
.createUnsafe[User](Seq(User(0, Some(new Name("Foo"))), User(1, None)).toDF)(encoder)
.collect().run() shouldBe Seq(User(0, Some(new Name("Foo"))), User(1, None))
} |
btw, @cchantep why is an extra wrapper required? I discovered that putting that implicit directly into the |
Exactly for that reason, handling a Value class as scalar cannot be done in the same way, the having only |
@cchantep got it, than it makes sense to me! I'll look into it more as well to help you (as much as I can) and I think that's really a valuable PR. Even Spark 3.2 can handle AnyVals now! :D |
a4f8950
to
3ff773e
Compare
|
||
val lorem = new Name("Lorem") | ||
|
||
ds.withColumnReplaced('name, lit(lorem)). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Failing test as lit
for Value class generates StructType
: failing test)
[info] - support value class *** FAILED *** (89 milliseconds)
[info] Array(Q(1,[0,1000000005,6d65726f4c]), Q(2,[0,1000000005,6d65726f4c])) was not equal to List(Q(1,Lorem), Q(2,Lorem)) (LitTests.scala:76)
Codecov Report
@@ Coverage Diff @@
## master #546 +/- ##
==========================================
+ Coverage 96.28% 96.35% +0.06%
==========================================
Files 62 63 +1
Lines 1051 1071 +20
Branches 5 9 +4
==========================================
+ Hits 1012 1032 +20
Misses 39 39
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
b346330
to
1a83610
Compare
@pomadchin I'm not 100% satisfied, but it can be reviewed |
Hey @cchantep, I will take a look later today or this weekend! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚀 Left a couple of comments, but I think it looks good and is a solid enhancement.
|
||
object IsValueClass { | ||
/** Provides an evidence `A` is a Value class */ | ||
implicit def apply[A <: AnyVal, G <: ::[_, HNil], H <: ::[_ <: FieldType[_ <: Symbol, _], HNil]]( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IsValueClass is required to disambiguate between Value Classes and Primitives. Being a subtype of AnyVal is not enough i.e. implicitly[Double <:< AnyVal]
(that's a comment for myself).
If(IsNull(path), nullExpr, newExpr) | ||
} | ||
} | ||
|
||
final class RecordFieldEncoder[T]( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the RecordFieldEncoder
makes sense to me now 👍
I wanted to avoid intoroducing new extra wrappers for derivation, but it is pretty elegant.
import org.apache.spark.sql.types.DataType | ||
|
||
@deprecated("Use `Lit[A]`", "0.10.2") | ||
case class FramelessLit[A](obj: A, encoder: TypedEncoder[A]) extends Expression with NonSQLExpression { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How much sense it makes to keep it in the codebase? It is not used anywhere and I doubt users would be interested in that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a breaking change, so it should only happen for a major version
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cchantep well, I have a feeling that this PR is in general very much binary breaking. It is API compatible though.
Let me set the MIMA plugin in a separate PR!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removing a public class does break the API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cchantep changing public API functions signatures is API breaking as well. I'm wondering what would be the MIMA report now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a question of compromise. Removing this class is not required to fix the current issues, but would break the API (even if we can consider that the user should never directly use this public class), the others (mainly 1) updated signature are strictly required changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cchantep oh yea, but these are public API changes; they break bin compat as well; have you checked the plugin report btw?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see no reasons why we can't make the major release (and it looks like this PR, with or without this legacy class - is bin incompatible); value classes + Spark 3.2.0 could be good enough for the major version release.
I think @imarios opionion about the versioning scheme is important there as well ¯_(ツ)_/¯ I don't think all frameless minor releases a bin compatible.
66f29b3
to
7b460a8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚀 A cool PR; but before merging I'd like to see the MIMA plugin (I can generate it myself) output though, to decide on the FramelessLit
fate, and to decide on versioning.
sbt:frameless> mimaReportBinaryIssues
[info] frameless: mimaPreviousArtifacts is empty, not analyzing binary compatibility.
[info] docs: mimaPreviousArtifacts is empty, not analyzing binary compatibility.
[error] frameless-dataset: Failed binary compatibility check against org.typelevel:frameless-dataset_2.12:0.10.1! Found 6 potential problems
[error] * static method deriveRecordCons(shapeless.Witness,frameless.TypedEncoder,frameless.RecordEncoderFields)frameless.RecordEncoderFields in interface frameless.RecordEncoderFields's type is different in current version, where it is (shapeless.Witness,frameless.RecordFieldEncoder,frameless.RecordEncoderFields)frameless.RecordEncoderFields instead of (shapeless.Witness,frameless.TypedEncoder,frameless.RecordEncoderFields)frameless.RecordEncoderFields
[error] filter with: ProblemFilters.exclude[IncompatibleMethTypeProblem]("frameless.RecordEncoderFields.deriveRecordCons")
[error] * static method deriveRecordLast(shapeless.Witness,frameless.TypedEncoder)frameless.RecordEncoderFields in interface frameless.RecordEncoderFields's type is different in current version, where it is (shapeless.Witness,frameless.RecordFieldEncoder)frameless.RecordEncoderFields instead of (shapeless.Witness,frameless.TypedEncoder)frameless.RecordEncoderFields
[error] filter with: ProblemFilters.exclude[IncompatibleMethTypeProblem]("frameless.RecordEncoderFields.deriveRecordLast")
[error] * method deriveRecordLast(shapeless.Witness,frameless.TypedEncoder)frameless.RecordEncoderFields in object frameless.RecordEncoderFields's type is different in current version, where it is (shapeless.Witness,frameless.RecordFieldEncoder)frameless.RecordEncoderFields instead of (shapeless.Witness,frameless.TypedEncoder)frameless.RecordEncoderFields
[error] filter with: ProblemFilters.exclude[IncompatibleMethTypeProblem]("frameless.RecordEncoderFields.deriveRecordLast")
[error] * method deriveRecordCons(shapeless.Witness,frameless.TypedEncoder,frameless.RecordEncoderFields)frameless.RecordEncoderFields in object frameless.RecordEncoderFields's type is different in current version, where it is (shapeless.Witness,frameless.RecordFieldEncoder,frameless.RecordEncoderFields)frameless.RecordEncoderFields instead of (shapeless.Witness,frameless.TypedEncoder,frameless.RecordEncoderFields)frameless.RecordEncoderFields
[error] filter with: ProblemFilters.exclude[IncompatibleMethTypeProblem]("frameless.RecordEncoderFields.deriveRecordCons")
[error] * static method litAggr(java.lang.Object,frameless.TypedEncoder)frameless.TypedAggregate in class frameless.functions.package does not have a correspondent in current version
[error] filter with: ProblemFilters.exclude[DirectMissingMethodProblem]("frameless.functions.package.litAggr")
[error] * method litAggr(java.lang.Object,frameless.TypedEncoder)frameless.TypedAggregate in object frameless.functions.package does not have a correspondent in current version
[error] filter with: ProblemFilters.exclude[DirectMissingMethodProblem]("frameless.functions.package.litAggr")
[error] stack trace is suppressed; run last dataset / mimaReportBinaryIssues for the full output
[error] (dataset / mimaReportBinaryIssues) Failed binary compatibility check against org.typelevel:frameless-dataset_2.12:0.10.1! Found 6 potential problems
[error] Total time: 2 s, completed Sep 6, 2021 7:51:08 AM Are these less harmuful than the |
MiMa results need to be interpreted. Anyway if we decide to bump to 0.11, that's fine for me (in which case there is no risk to keep |
@cchantep let's do 0.11 than! |
Than https://github.com/typelevel/frameless/pull/449/files could also be included at the same time (no breaking change, but would make a bump more valuable). |
a50a9d1
to
ab7610f
Compare
``` - support value class *** FAILED *** (101 milliseconds) [info] Array(Q(1,[0,1000000005,6d65726f4c]), Q(2,[0,1000000005,6d65726f4c])) was not equal to List(Q(1,Lorem), Q(2,Lorem)) (LitTests.scala:70) ```
ab7610f
to
9ebeca4
Compare
9ebeca4
to
ab19018
Compare
Considering the type below:
Encoding
The
name
field must be declared asStringType
(notStructType(StructField("value", ...))
) in the Catalyst representation forPerson
(see failing test).Possible fix:
RecordFieldEncoder
typeclass to wrap individual/fieldTypedEncoder
for the implicit resolution required forRecordEncoderFields
.RecordFieldEncoder
instance for anyT: TypedEncoder
RecordFieldEncoder
specific forT <: AnyVal
, with a higher resolution priority.Literal
A literal created using
lit
function for a Value class instance generate aStructType
(see failing test).Introduce
functions.litValue
:frameless.functions.litValue(valueClassInstance) // TypedColumn