Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added ParquetCachedBatchSerializer support for Databricks #2880
Added ParquetCachedBatchSerializer support for Databricks #2880
Changes from 2 commits
1555f60
4f03b82
7d02e9b
63b1e23
2c3a4c9
5a402a9
807d34d
07c93bf
13e5aec
179c435
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this file shouldn't be needed now right? the 311 shim version of this works.
Or is the intention to keep the db one so that the user specifies this one? https://nvidia.github.io/spark-rapids/docs/additional-functionality/cache-serializer.html. If that is the case we need to update the docs.
originally the intention was that since the user has to specify it have it match the version of spark they are using so that its hopefully least confusing. I don't know how much it betters if they specify spark311 vs spark311db
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if its possible it would be nice to have one class that just loaded the proper shim version, but that is a separate issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we add spark311db to the documentation we will then also have to add spark311cdh. I almost feel like they are all spark 311 and we aren't doing anything specific in the extended version of their PCBS we should just get rid of them.
This is a good idea, I can look into it as a follow-on
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jlowe thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
☝️ This. We should not have Spark-specific versions of user-visible classes unless they are truly required (e.g.: as in the shuffle case, unfortunately). If we know one class will work going forward, as is the case with the main executor plugin, then we should strive to use a common class name without a Spark version in it. If this is indeed possible, we should deprecate the old 311 version and eventually remove it.
So it really all comes down to that question. If we can have a common version, my vote is to use the one class. We can change the package name and deprecate the existing spark311 package version in a new PR if it's too tricky to do in this one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. I can do this as a follow-on. In the interim, do we just update the doc or remove the spark311db or spark311cdh versions of the serializer? I feel removing the db and cdh versions of the serializer should be the way as we will do more work as part of the follow-on
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're planning on removing these classes in the near future then we should not document them only to rip them out immediately afterward. Let's keep the number of classes to deprecate to a minimum.