[SPARK-25789][SQL] Support for Dataset of Avro #22878

xuanyuanking · 2018-10-29T15:59:08Z

What changes were proposed in this pull request?

Please credit to @bdrillard cause this mainly based on his previous work.

This PR add support for Dataset of Avro records in an API that would allow the user to provide a class to an Encoder for Avro, analogous to the Bean encoder.

Add ObjectCast and InitializeAvroObject(analogous to InitializeJavaBean) expression.
Add an AvroEncoder for Datasets of Avro records to Spark.
Add type-inference utilities AvroTypeInference for Avro object and SQL DataType (analogous to JavaTypeInference).

How was this patch tested?

Add UT in AvroSuite.scala and manual test by modified SQLExample with external avro package.

SparkQA · 2018-10-29T16:13:45Z

Test build #98217 has finished for PR 22878 at commit c70ddb3.

This patch fails to build.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class SerializableSchema(@transient var value: Schema) extends Externalizable
case class InitializeAvroObject(

dongjoon-hyun · 2018-10-29T17:18:43Z

@xuanyuanking . To give credit to @bdrillard correctly, you need to add his commits. Apache Spark community officially recommend to show Co-Authorship in commit messages.

Please credit to @bdrillard cause this main

gatorsmile · 2018-10-29T20:04:38Z

cc @gengliangwang

xuanyuanking · 2018-10-30T02:41:40Z

@dongjoon-hyun Thanks for your comment, let me see how to achieve this, @bdrillard 's commits based on databricks/spark-avro.

xuanyuanking · 2018-10-30T02:46:20Z

also cc @bdrillard, link this to #21348.

gengliangwang · 2018-10-30T03:01:33Z

@xuanyuanking , thanks for the work!
The following is not working, please ignore it. Commit one commit with main author as @bdrillard as @HyukjinKwon suggested.

You~~ can try editing the previous commit message https://help.github.com/articles/creating-a-commit-with-multiple-authors/ , and then push -f.

HyukjinKwon · 2018-10-30T03:03:35Z

I wonder if that can be handled by merge script tho. I think it's okay just to pick up some commits there and rebase them to here even if they become empty commits. That's easier for committers to put his name as primary author when it's merged.

HyukjinKwon · 2018-10-30T03:13:18Z

Just quickly and roughly tested. Merge script looks only recognising main author of each commit in a PR. Let's just push a commit into here.

SparkQA · 2018-10-30T06:06:51Z

Test build #98244 has finished for PR 22878 at commit 697813a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

xuanyuanking · 2018-10-30T06:16:07Z

Thanks @gengliangwang and @HyukjinKwon. Done in this commit.

SparkQA · 2018-10-30T07:05:01Z

Test build #98251 has finished for PR 22878 at commit b06a888.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

bdrillard

Here's an initial review.

First, I appreciate the efforts of the group here to include me in the commit history of this PR. Given that this is a port of commits from a separate project, I wasn't anticipating that level of commitment, and I'm appreciative of that. Thanks everyone!

To summarize the two main things:

It might be nice to add a test case over SpecificRecord. That would require either generating or importing some Java Avro classes (I link to some I'd made for the Spark-Avro PR in the comment on this topic).
We can do some refactoring of to the NewInstance expression to remove the need for separate InitializeAvroObject and InitializeJavaBean expressions. That refactor was prepared in [SPARK-22739][Catalyst] Additional Expression Support for Objects #21348 and I describe it more in a comment here, but if it's considered too large a change for the scope of this PR, I'm happy to create a followup PR for it.

external/avro/src/main/scala/org/apache/spark/sql/avro/AvroEncoder.scala

external/avro/src/test/scala/org/apache/spark/sql/avro/AvroSuite.scala

bdrillard · 2018-10-30T14:30:27Z

external/avro/src/test/scala/org/apache/spark/sql/avro/AvroSuite.scala

+    val ds = rdd.toDS()
+    assert(ds.count() == genericRecords.size)
+    context.stop()
+  }
 }


The above tests are all for GenericRecord Avro classes. It might be good to generate an Avro class having a schema similar to the GenericRecord described above, so that we can test an instance extending SpecificRecord (which will probably be the most commonly used Avro class for the encoder).

There was one such class in the Spark-Avro project, but I can understand why it may not have been copied over in this PR.

Yep, actually I test the cases you mentioned self but its need to add lots of generation code by avro. Moreover, IIUC, the testing of SpecificRecord just test one more logic of avroClass.getMethod("getClassSchema"), I just think no need to add those generation code for this test. If we really want to achieve this maybe add a little simple specific record example based on existing test.avsc? Or if we just want to show the usage, maybe add corresponding document is enought. WDYT :)

Yeah, I'm comfortable with that. The rest of the encoder code path would be the same.

bdrillard · 2018-10-30T14:36:09Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala

+ * @param args a sequence of expression pairs that will respectively evaluate to the index of
+ *             the record in which to insert, and the argument value to insert
+ */
+case class InitializeAvroObject(


It's possible to refactor the NewInstance expression also in this objects class to support construction of Avro classes, which would eliminate the need for a separate InititalizeAvroObject. Interestingly, the same refactor would also generalize in such a way as to allow us to remove the need for a separate InitializeJavaBean expression.

To summarize the change: NewInstance would accept a Seq of Expression for the arguments to the instance's constructor, but also a Seq of (String, Seq[Expression]) tuples, being an ordered list of setter methods and the methods' respective arguments to call after the object has been constructed.

This covers both construction of Java beans, it covers the construction and instantiation of SpecificRecord.

See the necessary changes to NewInstance, here.

Also an additional clause to TreeNode, here.

And then the changes to JavaTypeInference, here.

If this refactor is considered a bit too complicated for this PR, we can start with an InitializeAvroObject and do some cleanup in a followup. As background, a refactor like this was initially suggested by @cloud-fan, see comment.

Yep, as my comment in #21348 (comment), AFAIK, maybe we can keep 2 pr for convenient review, also there's some refactor work on JavaTypeInference after #21348, need more advise from Wenchen.

xuanyuanking · 2018-10-31T16:41:25Z

external/avro/src/test/scala/org/apache/spark/sql/avro/AvroSuite.scala

+    val record = new GenericRecordBuilder(schema).build
+    val row = expressionEncoder.toRow(record)
+    val recordFromRow = expressionEncoder.resolveAndBind().fromRow(row)
+    assert(record.toString == recordFromRow.toString)


In order not to let reviewer confuse, add more notes here, after adding map type in this case, record.get(15).equals(recordFromRow.get(15)) is false, this is because key/value in map of record is Utf8 while CharSequence in recordFromRow, directly call map.equals got false. So here check the result by string.
Avro GenericData.compare():

https://github.com/apache/avro/blob/8d2a2ce10db3fdef107f834a0fe0c9297b043a94/lang/java/avro/src/main/java/org/apache/avro/generic/GenericData.java#L965

SparkQA · 2018-10-31T19:43:35Z

Test build #98323 has finished for PR 22878 at commit 3f80ce2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

external/avro/src/test/scala/org/apache/spark/sql/avro/AvroSuite.scala

external/avro/src/main/scala/org/apache/spark/sql/avro/AvroEncoder.scala

SparkQA · 2018-11-02T13:15:12Z

Test build #98391 has finished for PR 22878 at commit 9ee695c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

benmccann · 2018-12-12T17:08:37Z

@gengliangwang thanks for your review on this PR. Do you have any other comments?

gatorsmile · 2018-12-12T19:08:21Z

cc @gengliangwang @cloud-fan

external/avro/src/test/scala/org/apache/spark/sql/avro/AvroEncoderSuite.scala

gengliangwang · 2018-12-13T05:11:07Z

retest this please.

gengliangwang · 2018-12-13T05:11:28Z

Overall LGTM. cc @cloud-fan

SparkQA · 2018-12-13T05:22:58Z

Test build #100069 has finished for PR 22878 at commit 9ee695c.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

external/avro/src/main/scala/org/apache/spark/sql/avro/AvroEncoder.scala

SparkQA · 2018-12-13T14:28:58Z

Test build #100093 has finished for PR 22878 at commit 00cb983.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-12-13T14:37:18Z

Test build #100094 has started for PR 22878 at commit c45d0c1.

srowen

General question: is this based on spark-avro, and/or intended to supersede it?

xuanyuanking · 2018-12-13T15:24:03Z

@srowen IMO, this isn't based on spark-avro, it's more like a supplement for it.

SparkQA · 2020-01-08T16:26:24Z

Test build #116318 has finished for PR 22878 at commit dfae1b0.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

skonto · 2020-01-17T11:50:43Z

@xuanyuanking gentle ping.

Co-Authored-By: xuanyuanking <xyliyuanjian@gmail.com>

SparkQA · 2020-01-20T08:05:02Z

Test build #117095 has finished for PR 22878 at commit e59e58c.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class AvroEncoderSuite extends SharedSparkSession

xuanyuanking · 2020-01-20T09:16:13Z

retest this please.

SparkQA · 2020-01-20T11:54:02Z

Test build #117107 has finished for PR 22878 at commit e59e58c.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class AvroEncoderSuite extends SharedSparkSession

xuanyuanking · 2020-01-20T12:28:16Z

The failed test CSVSuite.SPARK-23786: warning should be printed if CSV header doesn't conform to schema can pass locally.

xuanyuanking · 2020-01-20T12:28:36Z

retest this please

SparkQA · 2020-01-20T16:47:40Z

Test build #117117 has finished for PR 22878 at commit e59e58c.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class AvroEncoderSuite extends SharedSparkSession

skonto · 2020-01-22T12:08:57Z

thank you @xuanyuanking is this something we can backport to 2.4.x? Do you see any issues?

github-actions · 2020-05-02T00:11:03Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

anleib · 2022-07-12T22:50:17Z

Can this be re-opened? Is there someone who can take it through the finish line? Seems like this PR is really close. This is hugely useful to those of us who are working with Kafka + Spark Structured Streaming + Specific Avro types.

sadikovi · 2022-08-11T23:16:27Z

Yes, this looks very interesting and could be beneficial for the community. We may need to handle a few corner cases before merging it but surely can merge.

Since this is a new feature, it could be difficult to make a case for backports to any existing release branches. I will take over this work and see it through.

dongjoon-hyun · 2022-08-12T04:05:13Z

Feel free to take over by opening your PRs, @anleib . As we know, this PR is ancient one.

[SPARK-22739][Catalyst] Additional Expression Support for Objects #21348 (original one, 2018)
this PR (2018 and closed by Stale and reopened at 2020 [SPARK-25789][SQL] Support for Dataset of Avro #22878 (comment) )
Now, we are in Year 2022.

xkrogen · 2022-08-16T19:55:30Z

At LinkedIn we've been using a fork of this PR for many years, and have a number of internal enhancements on top of it. I'm happy to put up a PR to open-source our work, but past conversation on this PR indicate some resistance to bringing it into the Spark project (e.g. this comment) and past dev-list discussions (here and here) haven't generated much interest.

@HyukjinKwon -- do you still have general concerns with this work being pulled into Spark?
@dongjoon-hyun -- I'm hesitant to put effort into this without having support from a committer and/or PMC who is willing to help shepherd it along, would you be willing to help push it through if I were to actively pursue a new PR?

dongjoon-hyun · 2022-08-16T19:58:25Z

@xkrogen To be clear for your request, I don't have any preference for this one actually. So, I don't think I can help you push it through.

The reason of move is that next-gen DaliSpark reader needs some Spark avro 2.4 feature. Basically moving apache#22878, also fixed some glitches during the move make AvroEncoder could recognize the existing internal sql API. fix a API change RB=1559999 BUG=LIHADOOP-44097 G=superfriends-reviewers R=edlu,fli,yezhou,mshen A=fli

bdrillard reviewed Oct 30, 2018

View reviewed changes

xuanyuanking commented Oct 31, 2018

View reviewed changes

gengliangwang reviewed Nov 1, 2018

View reviewed changes

external/avro/src/test/scala/org/apache/spark/sql/avro/AvroSuite.scala Outdated Show resolved Hide resolved

external/avro/src/test/scala/org/apache/spark/sql/avro/AvroSuite.scala Outdated Show resolved Hide resolved

gengliangwang reviewed Nov 2, 2018

View reviewed changes

external/avro/src/main/scala/org/apache/spark/sql/avro/AvroEncoder.scala Show resolved Hide resolved

benmccann mentioned this pull request Dec 12, 2018

[SPARK-24256][SQL] SPARK-24256: ExpressionEncoder should support user-defined types as fields of Scala case class and tuple #21310

Closed

gengliangwang reviewed Dec 13, 2018

View reviewed changes

external/avro/src/test/scala/org/apache/spark/sql/avro/AvroEncoderSuite.scala Outdated Show resolved Hide resolved

external/avro/src/test/scala/org/apache/spark/sql/avro/AvroEncoderSuite.scala Outdated Show resolved Hide resolved

gengliangwang reviewed Dec 13, 2018

View reviewed changes

external/avro/src/main/scala/org/apache/spark/sql/avro/AvroEncoder.scala Show resolved Hide resolved

xuanyuanking force-pushed the SPARK-25789 branch from 9ee695c to 00cb983 Compare December 13, 2018 14:11

srowen reviewed Dec 13, 2018

View reviewed changes

xuanyuanking and others added 11 commits January 20, 2020 15:32

SPARK-25789: Support for Dataset of Avro

3922957

import scala.language.existentials

9fab984

nit fix and add author list

446acba

use CharSequence as object type

64d89cb

Co-Authored-By: xuanyuanking <xyliyuanjian@gmail.com>

Add test for map

2bfb940

Move cases to AvroEncoderSuite and simply the schema

03dcf5f

address comments

d7bb571

fix build error

0ac41fa

Conflict resolve

2f4af8c

UT and style fix

f402ca7

rebase and fix

e59e58c

xuanyuanking force-pushed the SPARK-25789 branch from dfae1b0 to e59e58c Compare January 20, 2020 07:56

github-actions bot added the Stale label May 2, 2020

github-actions bot closed this May 3, 2020

[SPARK-25789][SQL] Support for Dataset of Avro #22878

[SPARK-25789][SQL] Support for Dataset of Avro #22878

Conversation

xuanyuanking commented Oct 29, 2018 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Oct 29, 2018

dongjoon-hyun commented Oct 29, 2018

gatorsmile commented Oct 29, 2018

xuanyuanking commented Oct 30, 2018

xuanyuanking commented Oct 30, 2018

gengliangwang commented Oct 30, 2018 • edited Loading

HyukjinKwon commented Oct 30, 2018 • edited Loading

HyukjinKwon commented Oct 30, 2018

SparkQA commented Oct 30, 2018

xuanyuanking commented Oct 30, 2018

SparkQA commented Oct 30, 2018

bdrillard left a comment

Choose a reason for hiding this comment

bdrillard Oct 30, 2018

Choose a reason for hiding this comment

xuanyuanking Oct 31, 2018

Choose a reason for hiding this comment

bdrillard Nov 1, 2018

Choose a reason for hiding this comment

bdrillard Oct 30, 2018 • edited Loading

Choose a reason for hiding this comment

xuanyuanking Oct 31, 2018

Choose a reason for hiding this comment

xuanyuanking Oct 31, 2018

Choose a reason for hiding this comment

SparkQA commented Oct 31, 2018

SparkQA commented Nov 2, 2018

benmccann commented Dec 12, 2018

gatorsmile commented Dec 12, 2018

gengliangwang commented Dec 13, 2018

gengliangwang commented Dec 13, 2018

SparkQA commented Dec 13, 2018

SparkQA commented Dec 13, 2018

SparkQA commented Dec 13, 2018

srowen left a comment

Choose a reason for hiding this comment

xuanyuanking commented Dec 13, 2018

SparkQA commented Jan 8, 2020

skonto commented Jan 17, 2020

SparkQA commented Jan 20, 2020

xuanyuanking commented Jan 20, 2020

SparkQA commented Jan 20, 2020

xuanyuanking commented Jan 20, 2020

xuanyuanking commented Jan 20, 2020

SparkQA commented Jan 20, 2020

skonto commented Jan 22, 2020

github-actions bot commented May 2, 2020

anleib commented Jul 12, 2022

sadikovi commented Aug 11, 2022

dongjoon-hyun commented Aug 12, 2022

xkrogen commented Aug 16, 2022

dongjoon-hyun commented Aug 16, 2022

xuanyuanking commented Oct 29, 2018 •

edited

Loading

gengliangwang commented Oct 30, 2018 •

edited

Loading

HyukjinKwon commented Oct 30, 2018 •

edited

Loading

bdrillard Oct 30, 2018 •

edited

Loading