[SPARK-33298][CORE] Introduce new API to FileCommitProtocol allow flexible file naming #33012

c21 · 2021-06-22T05:14:47Z

What changes were proposed in this pull request?

This PR is to introduce a new sets of APIs newTaskTempFile and newTaskTempFileAbsPath inside FileCommitProtocol, to allow more flexible file naming of Spark output. The major change is to pass FileNameSpec into FileCommitProtocol, instead of original ext (currently having prefix and ext), to allow individual FileCommitProtocol implementation comes up with more flexible file names (e.g. has a custom prefix) for Hive/Presto bucketing - #30003. Provide a default implementations of the added APIs, so all existing implementation of FileCommitProtocol is NOT being broken.

Why are the changes needed?

To make commit protocol more flexible in terms of Spark output file name.
Pre-requisite of #30003.

Does this PR introduce any user-facing change?

Yes for developers who implement/run custom implementation of FileCommitProtocol. They can choose to implement for the newly added API.

How was this patch tested?

Existing unit tests as this is just adding an API.

c21 · 2021-06-22T05:15:02Z

cc @cloud-fan could you help take a look once you have time? Thanks.

SparkQA · 2021-06-22T06:14:45Z

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44646/

SparkQA · 2021-06-22T08:01:40Z

Test build #140119 has finished for PR 33012 at commit 8cc4899.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
final case class FileNameSpec(prefix: String, ext: String)

cloud-fan · 2021-06-22T10:35:57Z

core/src/main/scala/org/apache/spark/internal/io/FileCommitProtocol.scala

+   * partitioning. The "spec" parameter specifies the file name. The rest are left to the commit
+   * protocol implementation to decide.
+   *
+   * Important: it is the caller's responsibility to add uniquely identifying content to "spec"


This is not true, commit protocol needs to guarantee name uniqueness, as caller side only gives prefix and suffix

The whole sentence is:

it is the caller's responsibility to add uniquely identifying content to "spec" if a task is going to write out multiple files to the same dir.

I think this refers to the case when in one task, caller calls newTaskTempFile() multiple times with same spec, the caller should not expect commit protocol returning unique different file path every time. The current implementation of HadoopMapReduceCommitProtocol would return same path if ext being same. This is a copy-paste from original comment of newTaskTempFile.

cloud-fan · 2021-06-22T10:37:16Z

core/src/main/scala/org/apache/spark/internal/io/FileCommitProtocol.scala

+ * @param prefix Prefix of file.
+ * @param ext Extension of file.
+ */
+final case class FileNameSpec(prefix: String, ext: String)


shall we rename ext to suffix? it reality it includes more than extension, such as file count.

@cloud-fan - agree, updated.

dongjoon-hyun

Hi, @c21 and @cloud-fan .
Why do we need to deprecate the old APIs? Apache Spark community doesn't not delete the deprecated APIs in general. For the deprecation at Apache Spark 3.2.0, I believe we need another discussion in the community mailing list. Could you remove the deprecation part from this PR?

cc @gengliangwang

dongjoon-hyun · 2021-06-22T18:15:23Z

Also, cc @sunchao because this PR is about bucketing.

cloud-fan · 2021-06-22T18:31:38Z

It's not a public API (under package org.apache.spark.internal) and we are conservative here simply because we know some third party libraries are using it, e.g. https://github.com/steveloughran/zero-rename-committer

The old API needs to be deprecated as it's not fully functional now. To support hive bucket table, Spark needs to ask the commit protocol to add a certain prefix in the file name, and the old API can never know this prefix requirement.

c21 · 2021-06-22T19:00:53Z

@dongjoon-hyun - this is not a public API. Here with this PR, any existing third parties library will not be broken. The new API is superset of existing API, and any third parties can move forward to implement the new API in the future, but they can also stay on the old API if they need to.

SparkQA · 2021-06-23T02:36:32Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44695/

SparkQA · 2021-06-23T02:41:29Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44697/

SparkQA · 2021-06-23T02:48:57Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44695/

SparkQA · 2021-06-23T02:51:39Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44697/

SparkQA · 2021-06-23T04:47:01Z

Test build #140170 has finished for PR 33012 at commit b24f40d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2021-06-23T10:03:36Z

Ya, I know this is internal, but are we able to remove that API in the future at Apache Spark 3.3.0?
Why don't we remove deprecation warnings from this PR?
After we file a JIRA and discuss in email thread, we can add them later.

gengliangwang · 2021-06-23T13:13:09Z

Hmmm, shall we simply add

def newTaskTempFile(
      taskContext: TaskAttemptContext, dir: Option[String], prefix: String,ext: String): String
def newTaskTempFileAbsPath(
      taskContext: TaskAttemptContext, absoluteDir: String, prefix: String,ext: String): String

For the deprecation, I think either way is fine. We can simply remove them and wait for the discussion conclusion as @dongjoon-hyun mentioned.

cloud-fan · 2021-06-23T13:25:56Z

OK since it's a semi-developer API maybe we don't need to deprecate. We can have a discussion in the dev list if we need to remove the old APIs one day.

cloud-fan · 2021-06-23T13:26:47Z

taskContext: TaskAttemptContext, dir: Option[String], prefix: String,ext: String

then we need to break the API again if we need to customize the file name more in the future.

gengliangwang · 2021-06-23T13:34:50Z

@cloud-fan It's just two functions for the temp file/path creation. Introducing a new class for this seems over-designed.

cloud-fan · 2021-06-23T14:13:24Z

API design needs personal taste, I'll leave the decision to @c21

c21 · 2021-06-23T18:43:42Z

taskContext: TaskAttemptContext, dir: Option[String], prefix: String,ext: String

@gengliangwang - I agree with @cloud-fan. The FileNameSpec class is introduced mainly for future-proof. Whenever in the future we need pass more parameters to customize file name (e.g. require randomness/UUID, etc), we can just add more field into FileNameSpec. We don't need to break the API again. Future-proof is the main purpose here other than encapsulation. How about keeping FileNameSpec as it is?

Ya, I know this is internal, but are we able to remove that API in the future at Apache Spark 3.3.0?
Why don't we remove deprecation warnings from this PR?
After we file a JIRA and discuss in email thread, we can add them later.

@dongjoon-hyun - based on people's opinions here, I removed the deprecated annotations of existing APIs. I agree with your plan. I will have a discussion in mailing list for Spark 3.3 later, and we can add deprecated later or just remove the existing APIs. Thanks.

SparkQA · 2021-06-23T19:22:46Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44746/

SparkQA · 2021-06-23T19:30:49Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44746/

SparkQA · 2021-06-23T21:18:48Z

Test build #140218 has finished for PR 33012 at commit 2ec6fc6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

c21 · 2021-06-24T21:12:21Z

@dongjoon-hyun - sorry for pinging again as it closes to branch cut. Could you help take a look again? Thanks.

dongjoon-hyun

+1, LGTM. Thank you, @c21 , @cloud-fan , @gengliangwang .

c21 · 2021-06-25T00:36:51Z

Thank you all for review!

HyukjinKwon · 2021-06-28T02:59:29Z

core/src/main/scala/org/apache/spark/internal/io/FileCommitProtocol.scala

+   *
+   * This API should be implemented and called, instead of
+   * [[newTaskTempFile(taskContest, dir, ext)]]. Provide a default implementation here to be
+   * backward compatible with custom [[FileCommitProtocol]] implementations before Spark 3.2.0.


I think we should mark this as an API (e.g., @Unstable) ... this is currently internal.

It's a bit weird to say about "backward compatible" here. If this isn't an API, we should explicitly mention that this isn't an API at least here to avoid giving a false impression to dev.

@HyukjinKwon - will it look better as below?

This method should be implemented and called, instead of [[newTaskTempFile(taskContest, dir, ext)]]. Provide a default implementation here to be compatible with custom [[FileCommitProtocol]] implementations before Spark 3.2.0.

Shall we mark it as @Unstable for now with explaining the context a bit? e.g.) this class is exposed as an API considering the usage of many downstream custom implementations but will be subject to be changed and/or moved.

cc @cloud-fan to get more feedback

I'm fine to mark it Unstable. It doesn't hurt anyway.

Sounds good. Let me create a PR now.

…rotocol` ### What changes were proposed in this pull request? This is the followup from #33012 (comment), where we want to add `Unstable` to `FileCommitProtocol`, to give people a better idea of API. ### Why are the changes needed? Make it easier for people to follow and understand code. Clean up code. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing unit tests, as no real logic change. Closes #33148 from c21/bucket-followup. Authored-by: Cheng Su <chengsu@fb.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

…ormat with Hive hash) ### What changes were proposed in this pull request? This is a re-work of #30003, here we add support for writing Hive bucketed table with Parquet/ORC file format (data source v1 write path and Hive hash as the hash function). Support for Hive's other file format will be added in follow up PR. The changes are mostly on: * `HiveMetastoreCatalog.scala`: When converting hive table relation to data source relation, pass bucket info (BucketSpec) and other hive related info as options into `HadoopFsRelation` and `LogicalRelation`, which can be later accessed by `FileFormatWriter` to customize bucket id and file name. * `FileFormatWriter.scala`: Use `HiveHash` for `bucketIdExpression` if it's writing to Hive bucketed table. In addition, Spark output file name should follow Hive/Presto/Trino bucketed file naming convention. Introduce another parameter `bucketFileNamePrefix` and it introduces subsequent change in `FileFormatDataWriter`. * `HadoopMapReduceCommitProtocol`: Implement the new file name APIs introduced in #33012, and change its sub-class `PathOutputCommitProtocol`, to make Hive bucketed table writing work with all commit protocol (including S3A commit protocol). ### Why are the changes needed? To make Spark write other-SQL-engines-compatible bucketed table. Currently Spark bucketed table cannot be leveraged by other SQL engines like Hive and Presto, because it uses a different hash function (Spark murmur3hash) and different file name scheme. With this PR, the Spark-written-Hive-bucketed-table can be efficiently read by Presto and Hive to do bucket filter pruning, join, group-by, etc. This was and is blocking several companies (confirmed from Facebook, Lyft, etc) migrate bucketing workload from Hive to Spark. ### Does this PR introduce _any_ user-facing change? Yes, any Hive bucketed table (with Parquet/ORC format) written by Spark, is properly bucketed and can be efficiently processed by Hive and Presto/Trino. ### How was this patch tested? * Added unit test in BucketedWriteWithHiveSupportSuite.scala, to verify bucket file names and each row in each bucket is written properly. * Tested by Lyft Spark team (Shashank Pedamallu) to read Spark-written bucketed table from Trino, Spark and Hive. Closes #33432 from c21/hive-bucket-v1. Authored-by: Cheng Su <chengsu@fb.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…ormat with Hive hash) ### What changes were proposed in this pull request? This is a re-work of apache#30003, here we add support for writing Hive bucketed table with Parquet/ORC file format (data source v1 write path and Hive hash as the hash function). Support for Hive's other file format will be added in follow up PR. The changes are mostly on: * `HiveMetastoreCatalog.scala`: When converting hive table relation to data source relation, pass bucket info (BucketSpec) and other hive related info as options into `HadoopFsRelation` and `LogicalRelation`, which can be later accessed by `FileFormatWriter` to customize bucket id and file name. * `FileFormatWriter.scala`: Use `HiveHash` for `bucketIdExpression` if it's writing to Hive bucketed table. In addition, Spark output file name should follow Hive/Presto/Trino bucketed file naming convention. Introduce another parameter `bucketFileNamePrefix` and it introduces subsequent change in `FileFormatDataWriter`. * `HadoopMapReduceCommitProtocol`: Implement the new file name APIs introduced in apache#33012, and change its sub-class `PathOutputCommitProtocol`, to make Hive bucketed table writing work with all commit protocol (including S3A commit protocol). ### Why are the changes needed? To make Spark write other-SQL-engines-compatible bucketed table. Currently Spark bucketed table cannot be leveraged by other SQL engines like Hive and Presto, because it uses a different hash function (Spark murmur3hash) and different file name scheme. With this PR, the Spark-written-Hive-bucketed-table can be efficiently read by Presto and Hive to do bucket filter pruning, join, group-by, etc. This was and is blocking several companies (confirmed from Facebook, Lyft, etc) migrate bucketing workload from Hive to Spark. ### Does this PR introduce _any_ user-facing change? Yes, any Hive bucketed table (with Parquet/ORC format) written by Spark, is properly bucketed and can be efficiently processed by Hive and Presto/Trino. ### How was this patch tested? * Added unit test in BucketedWriteWithHiveSupportSuite.scala, to verify bucket file names and each row in each bucket is written properly. * Tested by Lyft Spark team (Shashank Pedamallu) to read Spark-written bucketed table from Trino, Spark and Hive. Closes apache#33432 from c21/hive-bucket-v1. Authored-by: Cheng Su <chengsu@fb.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

Introduce new API to FileCommitProtocol allow flexible file naming

8cc4899

github-actions bot added the CORE label Jun 22, 2021

c21 changed the title ~~Introduce new API to FileCommitProtocol allow flexible file naming~~ [SPARK-33298][CORE] Introduce new API to FileCommitProtocol allow flexible file naming Jun 22, 2021

cloud-fan reviewed Jun 22, 2021

View reviewed changes

cloud-fan approved these changes Jun 22, 2021

View reviewed changes

dongjoon-hyun requested changes Jun 22, 2021

View reviewed changes

c21 added 2 commits June 22, 2021 18:27

Address comment of ext

b88a9f6

Fix code for ext

b24f40d

Remove deprecated annotation for existing APIs

2ec6fc6

gengliangwang approved these changes Jun 24, 2021

View reviewed changes

dongjoon-hyun approved these changes Jun 25, 2021

View reviewed changes

dongjoon-hyun closed this in 2da42ca Jun 25, 2021

c21 deleted the commit-protocol-api branch June 25, 2021 00:36

c21 mentioned this pull request Jun 27, 2021

[SPARK-33298][CORE] Decouple file naming from FileCommitProtocol #32881

Closed

HyukjinKwon reviewed Jun 28, 2021

View reviewed changes

c21 mentioned this pull request Jun 30, 2021

[SPARK-33298][CORE][FOLLOWUP] Add Unstable annotation to FileCommitProtocol #33148

Closed

This was referenced Jul 20, 2021

[SPARK-32709][SQL] Support writing Hive bucketed table (Parquet/ORC format with Hive hash) #33432

Closed

Picking changes from the fix for SPARK-32709 lyft/spark#39

Merged

c21 mentioned this pull request Jul 29, 2021

[SPARK-33298][CORE][FOLLOWUP] Move FileNameSpec into FileCommitProtocol object #33565

Closed

c21 mentioned this pull request Jan 25, 2022

[SPARK-38015][CORE] Mark legacy file naming functions as deprecated in FileCommitProtocol #35311

Closed

[SPARK-33298][CORE] Introduce new API to FileCommitProtocol allow flexible file naming #33012

[SPARK-33298][CORE] Introduce new API to FileCommitProtocol allow flexible file naming #33012

Conversation

c21 commented Jun 22, 2021

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

c21 commented Jun 22, 2021

SparkQA commented Jun 22, 2021

SparkQA commented Jun 22, 2021

Choose a reason for hiding this comment

c21 Jun 23, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dongjoon-hyun left a comment • edited Loading

Choose a reason for hiding this comment

dongjoon-hyun commented Jun 22, 2021

cloud-fan commented Jun 22, 2021

c21 commented Jun 22, 2021

SparkQA commented Jun 23, 2021

SparkQA commented Jun 23, 2021

SparkQA commented Jun 23, 2021

SparkQA commented Jun 23, 2021

SparkQA commented Jun 23, 2021

dongjoon-hyun commented Jun 23, 2021 • edited Loading

gengliangwang commented Jun 23, 2021 • edited Loading

cloud-fan commented Jun 23, 2021

cloud-fan commented Jun 23, 2021

gengliangwang commented Jun 23, 2021

cloud-fan commented Jun 23, 2021

c21 commented Jun 23, 2021

SparkQA commented Jun 23, 2021

SparkQA commented Jun 23, 2021

SparkQA commented Jun 23, 2021

c21 commented Jun 24, 2021

dongjoon-hyun left a comment

Choose a reason for hiding this comment

c21 commented Jun 25, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

c21 Jun 23, 2021 •

edited

Loading

dongjoon-hyun left a comment •

edited

Loading

dongjoon-hyun commented Jun 23, 2021 •

edited

Loading

gengliangwang commented Jun 23, 2021 •

edited

Loading