-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-33298][CORE] Introduce new API to FileCommitProtocol allow flexible file naming #33012
Conversation
cc @cloud-fan could you help take a look once you have time? Thanks. |
Kubernetes integration test unable to build dist. exiting with code: 1 |
Test build #140119 has finished for PR 33012 at commit
|
* partitioning. The "spec" parameter specifies the file name. The rest are left to the commit | ||
* protocol implementation to decide. | ||
* | ||
* Important: it is the caller's responsibility to add uniquely identifying content to "spec" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not true, commit protocol needs to guarantee name uniqueness, as caller side only gives prefix and suffix
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The whole sentence is:
it is the caller's responsibility to add uniquely identifying content to "spec"
if a task is going to write out multiple files to the same dir.
I think this refers to the case when in one task, caller calls newTaskTempFile() multiple times with same spec
, the caller should not expect commit protocol returning unique different file path every time. The current implementation of HadoopMapReduceCommitProtocol
would return same path if ext
being same. This is a copy-paste from original comment of newTaskTempFile
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see!
* @param prefix Prefix of file. | ||
* @param ext Extension of file. | ||
*/ | ||
final case class FileNameSpec(prefix: String, ext: String) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we rename ext
to suffix
? it reality it includes more than extension, such as file count.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan - agree, updated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, @c21 and @cloud-fan .
Why do we need to deprecate the old APIs? Apache Spark community doesn't not delete the deprecated APIs in general. For the deprecation at Apache Spark 3.2.0, I believe we need another discussion in the community mailing list. Could you remove the deprecation part from this PR?
Also, cc @sunchao because this PR is about bucketing. |
It's not a public API (under package The old API needs to be deprecated as it's not fully functional now. To support hive bucket table, Spark needs to ask the commit protocol to add a certain prefix in the file name, and the old API can never know this prefix requirement. |
@dongjoon-hyun - this is not a public API. Here with this PR, any existing third parties library will not be broken. The new API is superset of existing API, and any third parties can move forward to implement the new API in the future, but they can also stay on the old API if they need to. |
Kubernetes integration test starting |
Kubernetes integration test starting |
Kubernetes integration test status failure |
Kubernetes integration test status failure |
Test build #140170 has finished for PR 33012 at commit
|
Ya, I know this is internal, but are we able to remove that API in the future at Apache Spark 3.3.0? |
Hmmm, shall we simply add
For the deprecation, I think either way is fine. We can simply remove them and wait for the discussion conclusion as @dongjoon-hyun mentioned. |
OK since it's a semi-developer API maybe we don't need to deprecate. We can have a discussion in the dev list if we need to remove the old APIs one day. |
then we need to break the API again if we need to customize the file name more in the future. |
@cloud-fan It's just two functions for the temp file/path creation. Introducing a new class for this seems over-designed. |
API design needs personal taste, I'll leave the decision to @c21 |
@gengliangwang - I agree with @cloud-fan. The
@dongjoon-hyun - based on people's opinions here, I removed the |
Kubernetes integration test starting |
Kubernetes integration test status failure |
Test build #140218 has finished for PR 33012 at commit
|
@dongjoon-hyun - sorry for pinging again as it closes to branch cut. Could you help take a look again? Thanks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM. Thank you, @c21 , @cloud-fan , @gengliangwang .
Thank you all for review! |
* | ||
* This API should be implemented and called, instead of | ||
* [[newTaskTempFile(taskContest, dir, ext)]]. Provide a default implementation here to be | ||
* backward compatible with custom [[FileCommitProtocol]] implementations before Spark 3.2.0. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should mark this as an API (e.g., @Unstable
) ... this is currently internal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a bit weird to say about "backward compatible" here. If this isn't an API, we should explicitly mention that this isn't an API at least here to avoid giving a false impression to dev.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@HyukjinKwon - will it look better as below?
This method should be implemented and called, instead of
[[newTaskTempFile(taskContest, dir, ext)]]. Provide a default implementation here to be
compatible with custom [[FileCommitProtocol]] implementations before Spark 3.2.0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we mark it as @Unstable
for now with explaining the context a bit? e.g.) this class is exposed as an API considering the usage of many downstream custom implementations but will be subject to be changed and/or moved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @cloud-fan to get more feedback
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine to mark it Unstable
. It doesn't hurt anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. Let me create a PR now.
…rotocol` ### What changes were proposed in this pull request? This is the followup from #33012 (comment), where we want to add `Unstable` to `FileCommitProtocol`, to give people a better idea of API. ### Why are the changes needed? Make it easier for people to follow and understand code. Clean up code. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing unit tests, as no real logic change. Closes #33148 from c21/bucket-followup. Authored-by: Cheng Su <chengsu@fb.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
…ormat with Hive hash) ### What changes were proposed in this pull request? This is a re-work of #30003, here we add support for writing Hive bucketed table with Parquet/ORC file format (data source v1 write path and Hive hash as the hash function). Support for Hive's other file format will be added in follow up PR. The changes are mostly on: * `HiveMetastoreCatalog.scala`: When converting hive table relation to data source relation, pass bucket info (BucketSpec) and other hive related info as options into `HadoopFsRelation` and `LogicalRelation`, which can be later accessed by `FileFormatWriter` to customize bucket id and file name. * `FileFormatWriter.scala`: Use `HiveHash` for `bucketIdExpression` if it's writing to Hive bucketed table. In addition, Spark output file name should follow Hive/Presto/Trino bucketed file naming convention. Introduce another parameter `bucketFileNamePrefix` and it introduces subsequent change in `FileFormatDataWriter`. * `HadoopMapReduceCommitProtocol`: Implement the new file name APIs introduced in #33012, and change its sub-class `PathOutputCommitProtocol`, to make Hive bucketed table writing work with all commit protocol (including S3A commit protocol). ### Why are the changes needed? To make Spark write other-SQL-engines-compatible bucketed table. Currently Spark bucketed table cannot be leveraged by other SQL engines like Hive and Presto, because it uses a different hash function (Spark murmur3hash) and different file name scheme. With this PR, the Spark-written-Hive-bucketed-table can be efficiently read by Presto and Hive to do bucket filter pruning, join, group-by, etc. This was and is blocking several companies (confirmed from Facebook, Lyft, etc) migrate bucketing workload from Hive to Spark. ### Does this PR introduce _any_ user-facing change? Yes, any Hive bucketed table (with Parquet/ORC format) written by Spark, is properly bucketed and can be efficiently processed by Hive and Presto/Trino. ### How was this patch tested? * Added unit test in BucketedWriteWithHiveSupportSuite.scala, to verify bucket file names and each row in each bucket is written properly. * Tested by Lyft Spark team (Shashank Pedamallu) to read Spark-written bucketed table from Trino, Spark and Hive. Closes #33432 from c21/hive-bucket-v1. Authored-by: Cheng Su <chengsu@fb.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…ormat with Hive hash) ### What changes were proposed in this pull request? This is a re-work of apache#30003, here we add support for writing Hive bucketed table with Parquet/ORC file format (data source v1 write path and Hive hash as the hash function). Support for Hive's other file format will be added in follow up PR. The changes are mostly on: * `HiveMetastoreCatalog.scala`: When converting hive table relation to data source relation, pass bucket info (BucketSpec) and other hive related info as options into `HadoopFsRelation` and `LogicalRelation`, which can be later accessed by `FileFormatWriter` to customize bucket id and file name. * `FileFormatWriter.scala`: Use `HiveHash` for `bucketIdExpression` if it's writing to Hive bucketed table. In addition, Spark output file name should follow Hive/Presto/Trino bucketed file naming convention. Introduce another parameter `bucketFileNamePrefix` and it introduces subsequent change in `FileFormatDataWriter`. * `HadoopMapReduceCommitProtocol`: Implement the new file name APIs introduced in apache#33012, and change its sub-class `PathOutputCommitProtocol`, to make Hive bucketed table writing work with all commit protocol (including S3A commit protocol). ### Why are the changes needed? To make Spark write other-SQL-engines-compatible bucketed table. Currently Spark bucketed table cannot be leveraged by other SQL engines like Hive and Presto, because it uses a different hash function (Spark murmur3hash) and different file name scheme. With this PR, the Spark-written-Hive-bucketed-table can be efficiently read by Presto and Hive to do bucket filter pruning, join, group-by, etc. This was and is blocking several companies (confirmed from Facebook, Lyft, etc) migrate bucketing workload from Hive to Spark. ### Does this PR introduce _any_ user-facing change? Yes, any Hive bucketed table (with Parquet/ORC format) written by Spark, is properly bucketed and can be efficiently processed by Hive and Presto/Trino. ### How was this patch tested? * Added unit test in BucketedWriteWithHiveSupportSuite.scala, to verify bucket file names and each row in each bucket is written properly. * Tested by Lyft Spark team (Shashank Pedamallu) to read Spark-written bucketed table from Trino, Spark and Hive. Closes apache#33432 from c21/hive-bucket-v1. Authored-by: Cheng Su <chengsu@fb.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
What changes were proposed in this pull request?
This PR is to introduce a new sets of APIs
newTaskTempFile
andnewTaskTempFileAbsPath
insideFileCommitProtocol
, to allow more flexible file naming of Spark output. The major change is to passFileNameSpec
intoFileCommitProtocol
, instead of originalext
(currently havingprefix
andext
), to allow individualFileCommitProtocol
implementation comes up with more flexible file names (e.g. has a customprefix
) for Hive/Presto bucketing - #30003. Provide a default implementations of the added APIs, so all existing implementation ofFileCommitProtocol
is NOT being broken.Why are the changes needed?
To make commit protocol more flexible in terms of Spark output file name.
Pre-requisite of #30003.
Does this PR introduce any user-facing change?
Yes for developers who implement/run custom implementation of
FileCommitProtocol
. They can choose to implement for the newly added API.How was this patch tested?
Existing unit tests as this is just adding an API.