-
Notifications
You must be signed in to change notification settings - Fork 240
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added ParquetCachedBatchSerializer support for Databricks #2880
Conversation
Signed-off-by: Raza Jafri <rjafri@nvidia.com>
Signed-off-by: Raza Jafri <rjafri@nvidia.com>
@NvTimLiu can you especially take a look at the run-tests.py? |
Signed-off-by: Raza Jafri <rjafri@nvidia.com>
build |
@razajafri As per tests using the fix jar, For example,
Should we worry about that extra |
@viadea This is definitely something that I will look into. Can we file an issue for it? |
There is a mismatch of param numbers in the Databricks test script, when running Databricks nightly test job or nightly build job. The test job takes 1 to 3 params, while the build job requires only 1 to 2 params. This makes it hard to match up params between the test job and the build jobs. To fix the issue, we explicitly export vars for the test script instead of taking shell params. This method also makes it easier to extend vars/params for the test script. Signed-off-by: Tim Liu <timl@nvidia.com>
LGTM, I can't approve because I am the author, @tgravescs will have to do the honors if he thinks this is good. @tgravescs are we intentionally not calling |
yes it was intentional, I'd have to go back and verify but I don't think there was any overlap, I think data bricks override everything with its own version so didn't do any good. |
Even if Databricks overrode everything, wouldn't we want to still try to pick up the base version in case a new shim for an exec appears that we can just reuse in the future? |
sure, then we I think can remove the ParquetCachedBatchSerializer spark311db version all together. |
OK, I will push another update to this PR where we will get everything from the parent just like in all other shims |
Signed-off-by: Raza Jafri <rjafri@nvidia.com>
build |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks OK to me, assuming this was tested on Databricks.
|
||
import com.nvidia.spark.rapids.shims | ||
|
||
class ParquetCachedBatchSerializer extends shims.spark311.ParquetCachedBatchSerializer { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this file shouldn't be needed now right? the 311 shim version of this works.
Or is the intention to keep the db one so that the user specifies this one? https://nvidia.github.io/spark-rapids/docs/additional-functionality/cache-serializer.html. If that is the case we need to update the docs.
originally the intention was that since the user has to specify it have it match the version of spark they are using so that its hopefully least confusing. I don't know how much it betters if they specify spark311 vs spark311db
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if its possible it would be nice to have one class that just loaded the proper shim version, but that is a separate issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this file shouldn't be needed now right? the 311 shim version of this works.
Or is the intention to keep the db one so that the user specifies this one? https://nvidia.github.io/spark-rapids/docs/additional-functionality/cache-serializer.html. If that is the case we need to update the docs.originally the intention was that since the user has to specify it have it match the version of spark they are using so that its hopefully least confusing. I don't know how much it betters if they specify spark311 vs spark311db
If we add spark311db to the documentation we will then also have to add spark311cdh. I almost feel like they are all spark 311 and we aren't doing anything specific in the extended version of their PCBS we should just get rid of them.
if its possible it would be nice to have one class that just loaded the proper shim version, but that is a separate issue.
This is a good idea, I can look into it as a follow-on
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jlowe thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if its possible it would be nice to have one class that just loaded the proper shim version, but that is a separate issue.
☝️ This. We should not have Spark-specific versions of user-visible classes unless they are truly required (e.g.: as in the shuffle case, unfortunately). If we know one class will work going forward, as is the case with the main executor plugin, then we should strive to use a common class name without a Spark version in it. If this is indeed possible, we should deprecate the old 311 version and eventually remove it.
So it really all comes down to that question. If we can have a common version, my vote is to use the one class. We can change the package name and deprecate the existing spark311 package version in a new PR if it's too tricky to do in this one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. I can do this as a follow-on. In the interim, do we just update the doc or remove the spark311db or spark311cdh versions of the serializer? I feel removing the db and cdh versions of the serializer should be the way as we will do more work as part of the follow-on
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're planning on removing these classes in the near future then we should not document them only to rip them out immediately afterward. Let's keep the number of classes to deprecate to a minimum.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks OK to me assuming the other Spark-specific cache serializers (e.g.: spark312) will be removed in a followup.
Signed-off-by: Raza Jafri <rjafri@nvidia.com>
build |
.../spark311cdh/src/main/scala/com/nvidia/spark/rapids/shims/spark311cdh/Spark311CDHShims.scala
Outdated
Show resolved
Hide resolved
shims/spark311db/src/main/scala/com/nvidia/spark/rapids/shims/spark311db/Spark311dbShims.scala
Outdated
Show resolved
Hide resolved
Signed-off-by: Raza Jafri <rjafri@nvidia.com>
build |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like some more unneeded imports in the Databricks shim.
shims/spark311db/src/main/scala/com/nvidia/spark/rapids/shims/spark311db/Spark311dbShims.scala
Outdated
Show resolved
Hide resolved
shims/spark311db/src/main/scala/com/nvidia/spark/rapids/shims/spark311db/Spark311dbShims.scala
Outdated
Show resolved
Hide resolved
…spark311db/Spark311dbShims.scala Co-authored-by: Jason Lowe <jlowe@nvidia.com>
…spark311db/Spark311dbShims.scala Co-authored-by: Jason Lowe <jlowe@nvidia.com>
build |
This PR adds ParquetCachedBatchSerializer support for Databricks and added nightly tests
fixes #2856