Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-25789][SQL] Support for Dataset of Avro #22878

Closed
wants to merge 11 commits into from

Conversation

xuanyuanking
Copy link
Member

@xuanyuanking xuanyuanking commented Oct 29, 2018

What changes were proposed in this pull request?

Please credit to @bdrillard cause this mainly based on his previous work.

This PR add support for Dataset of Avro records in an API that would allow the user to provide a class to an Encoder for Avro, analogous to the Bean encoder.

  • Add ObjectCast and InitializeAvroObject(analogous to InitializeJavaBean) expression.
  • Add an AvroEncoder for Datasets of Avro records to Spark.
  • Add type-inference utilities AvroTypeInference for Avro object and SQL DataType (analogous to JavaTypeInference).

How was this patch tested?

Add UT in AvroSuite.scala and manual test by modified SQLExample with external avro package.

@SparkQA
Copy link

SparkQA commented Oct 29, 2018

Test build #98217 has finished for PR 22878 at commit c70ddb3.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class SerializableSchema(@transient var value: Schema) extends Externalizable
  • case class InitializeAvroObject(

@dongjoon-hyun
Copy link
Member

@xuanyuanking . To give credit to @bdrillard correctly, you need to add his commits. Apache Spark community officially recommend to show Co-Authorship in commit messages.

Please credit to @bdrillard cause this main

@gatorsmile
Copy link
Member

cc @gengliangwang

@xuanyuanking
Copy link
Member Author

@dongjoon-hyun Thanks for your comment, let me see how to achieve this, @bdrillard 's commits based on databricks/spark-avro.

@xuanyuanking
Copy link
Member Author

also cc @bdrillard, link this to #21348.

@gengliangwang
Copy link
Member

gengliangwang commented Oct 30, 2018

@xuanyuanking , thanks for the work!
The following is not working, please ignore it. Commit one commit with main author as @bdrillard as @HyukjinKwon suggested.

You~~ can try editing the previous commit message https://help.github.com/articles/creating-a-commit-with-multiple-authors/ , and then push -f.

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Oct 30, 2018

I wonder if that can be handled by merge script tho. I think it's okay just to pick up some commits there and rebase them to here even if they become empty commits. That's easier for committers to put his name as primary author when it's merged.

@HyukjinKwon
Copy link
Member

Just quickly and roughly tested. Merge script looks only recognising main author of each commit in a PR. Let's just push a commit into here.

@SparkQA
Copy link

SparkQA commented Oct 30, 2018

Test build #98244 has finished for PR 22878 at commit 697813a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@xuanyuanking
Copy link
Member Author

Thanks @gengliangwang and @HyukjinKwon. Done in this commit.

@SparkQA
Copy link

SparkQA commented Oct 30, 2018

Test build #98251 has finished for PR 22878 at commit b06a888.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link

@bdrillard bdrillard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's an initial review.

First, I appreciate the efforts of the group here to include me in the commit history of this PR. Given that this is a port of commits from a separate project, I wasn't anticipating that level of commitment, and I'm appreciative of that. Thanks everyone!

To summarize the two main things:

  1. It might be nice to add a test case over SpecificRecord. That would require either generating or importing some Java Avro classes (I link to some I'd made for the Spark-Avro PR in the comment on this topic).
  2. We can do some refactoring of to the NewInstance expression to remove the need for separate InitializeAvroObject and InitializeJavaBean expressions. That refactor was prepared in [SPARK-22739][Catalyst] Additional Expression Support for Objects #21348 and I describe it more in a comment here, but if it's considered too large a change for the scope of this PR, I'm happy to create a followup PR for it.

val ds = rdd.toDS()
assert(ds.count() == genericRecords.size)
context.stop()
}
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The above tests are all for GenericRecord Avro classes. It might be good to generate an Avro class having a schema similar to the GenericRecord described above, so that we can test an instance extending SpecificRecord (which will probably be the most commonly used Avro class for the encoder).

There was one such class in the Spark-Avro project, but I can understand why it may not have been copied over in this PR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, actually I test the cases you mentioned self but its need to add lots of generation code by avro. Moreover, IIUC, the testing of SpecificRecord just test one more logic of avroClass.getMethod("getClassSchema"), I just think no need to add those generation code for this test. If we really want to achieve this maybe add a little simple specific record example based on existing test.avsc? Or if we just want to show the usage, maybe add corresponding document is enought. WDYT :)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'm comfortable with that. The rest of the encoder code path would be the same.

* @param args a sequence of expression pairs that will respectively evaluate to the index of
* the record in which to insert, and the argument value to insert
*/
case class InitializeAvroObject(
Copy link

@bdrillard bdrillard Oct 30, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's possible to refactor the NewInstance expression also in this objects class to support construction of Avro classes, which would eliminate the need for a separate InititalizeAvroObject. Interestingly, the same refactor would also generalize in such a way as to allow us to remove the need for a separate InitializeJavaBean expression.

To summarize the change: NewInstance would accept a Seq of Expression for the arguments to the instance's constructor, but also a Seq of (String, Seq[Expression]) tuples, being an ordered list of setter methods and the methods' respective arguments to call after the object has been constructed.

This covers both construction of Java beans, it covers the construction and instantiation of SpecificRecord.

See the necessary changes to NewInstance, here.

Also an additional clause to TreeNode, here.

And then the changes to JavaTypeInference, here.

If this refactor is considered a bit too complicated for this PR, we can start with an InitializeAvroObject and do some cleanup in a followup. As background, a refactor like this was initially suggested by @cloud-fan, see comment.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, as my comment in #21348 (comment), AFAIK, maybe we can keep 2 pr for convenient review, also there's some refactor work on JavaTypeInference after #21348, need more advise from Wenchen.

val record = new GenericRecordBuilder(schema).build
val row = expressionEncoder.toRow(record)
val recordFromRow = expressionEncoder.resolveAndBind().fromRow(row)
assert(record.toString == recordFromRow.toString)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In order not to let reviewer confuse, add more notes here, after adding map type in this case, record.get(15).equals(recordFromRow.get(15)) is false, this is because key/value in map of record is Utf8 while CharSequence in recordFromRow, directly call map.equals got false. So here check the result by string.
Avro GenericData.compare():

https://github.com/apache/avro/blob/8d2a2ce10db3fdef107f834a0fe0c9297b043a94/lang/java/avro/src/main/java/org/apache/avro/generic/GenericData.java#L965

@SparkQA
Copy link

SparkQA commented Oct 31, 2018

Test build #98323 has finished for PR 22878 at commit 3f80ce2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 2, 2018

Test build #98391 has finished for PR 22878 at commit 9ee695c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@benmccann
Copy link
Contributor

@gengliangwang thanks for your review on this PR. Do you have any other comments?

@gatorsmile
Copy link
Member

cc @gengliangwang @cloud-fan

@gengliangwang
Copy link
Member

retest this please.

@gengliangwang
Copy link
Member

Overall LGTM. cc @cloud-fan

@SparkQA
Copy link

SparkQA commented Dec 13, 2018

Test build #100069 has finished for PR 22878 at commit 9ee695c.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 13, 2018

Test build #100093 has finished for PR 22878 at commit 00cb983.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 13, 2018

Test build #100094 has started for PR 22878 at commit c45d0c1.

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General question: is this based on spark-avro, and/or intended to supersede it?

@xuanyuanking
Copy link
Member Author

@srowen IMO, this isn't based on spark-avro, it's more like a supplement for it.

@SparkQA
Copy link

SparkQA commented Jan 8, 2020

Test build #116318 has finished for PR 22878 at commit dfae1b0.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@skonto
Copy link
Contributor

skonto commented Jan 17, 2020

@xuanyuanking gentle ping.

@SparkQA
Copy link

SparkQA commented Jan 20, 2020

Test build #117095 has finished for PR 22878 at commit e59e58c.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class AvroEncoderSuite extends SharedSparkSession

@xuanyuanking
Copy link
Member Author

retest this please.

@SparkQA
Copy link

SparkQA commented Jan 20, 2020

Test build #117107 has finished for PR 22878 at commit e59e58c.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class AvroEncoderSuite extends SharedSparkSession

@xuanyuanking
Copy link
Member Author

The failed test CSVSuite.SPARK-23786: warning should be printed if CSV header doesn't conform to schema can pass locally.

@xuanyuanking
Copy link
Member Author

retest this please

@SparkQA
Copy link

SparkQA commented Jan 20, 2020

Test build #117117 has finished for PR 22878 at commit e59e58c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class AvroEncoderSuite extends SharedSparkSession

@skonto
Copy link
Contributor

skonto commented Jan 22, 2020

thank you @xuanyuanking is this something we can backport to 2.4.x? Do you see any issues?

@github-actions
Copy link

github-actions bot commented May 2, 2020

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions github-actions bot added the Stale label May 2, 2020
@github-actions github-actions bot closed this May 3, 2020
@anleib
Copy link

anleib commented Jul 12, 2022

Can this be re-opened? Is there someone who can take it through the finish line? Seems like this PR is really close. This is hugely useful to those of us who are working with Kafka + Spark Structured Streaming + Specific Avro types.

@sadikovi
Copy link
Contributor

Yes, this looks very interesting and could be beneficial for the community. We may need to handle a few corner cases before merging it but surely can merge.

Since this is a new feature, it could be difficult to make a case for backports to any existing release branches. I will take over this work and see it through.

@dongjoon-hyun
Copy link
Member

Feel free to take over by opening your PRs, @anleib . As we know, this PR is ancient one.

@xkrogen
Copy link
Contributor

xkrogen commented Aug 16, 2022

At LinkedIn we've been using a fork of this PR for many years, and have a number of internal enhancements on top of it. I'm happy to put up a PR to open-source our work, but past conversation on this PR indicate some resistance to bringing it into the Spark project (e.g. this comment) and past dev-list discussions (here and here) haven't generated much interest.

@HyukjinKwon -- do you still have general concerns with this work being pulled into Spark?
@dongjoon-hyun -- I'm hesitant to put effort into this without having support from a committer and/or PMC who is willing to help shepherd it along, would you be willing to help push it through if I were to actively pursue a new PR?

@dongjoon-hyun
Copy link
Member

@xkrogen To be clear for your request, I don't have any preference for this one actually. So, I don't think I can help you push it through.

otterc pushed a commit to linkedin/spark that referenced this pull request Mar 22, 2023
The reason of move is that next-gen DaliSpark reader needs some Spark avro 2.4 feature.
Basically moving apache#22878, also fixed
some glitches during the move make AvroEncoder could recognize the
existing internal sql API.

fix a API change

RB=1559999
BUG=LIHADOOP-44097
G=superfriends-reviewers
R=edlu,fli,yezhou,mshen
A=fli
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.