-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PARQUET-2430: Add parquet joiner v2 #1335
Conversation
This PR is the outcome of simplification I mention in a comment here a couple of weeks ago: #1273 (comment) |
# Conflicts: # parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/ParquetRewriter.java
…yInputFilesToJoin
@wgtmac @ConeyLiu Some clarifications. Some bugs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay. The current test cases are already very complicated so the refactoring work on the validation methods makes sense to me.
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/ParquetRewriter.java
Outdated
Show resolved
Hide resolved
parquet-hadoop/src/test/java/org/apache/parquet/hadoop/rewrite/ParquetRewriterTest.java
Outdated
Show resolved
Hide resolved
parquet-hadoop/src/test/java/org/apache/parquet/hadoop/rewrite/ParquetRewriterTest.java
Outdated
Show resolved
Hide resolved
It's reasonable on my side, as long as the new feature is covered.
It would be great if you have time to create a PR to fix this. Thanks! |
@wgtmac |
LGTM. Thanks! |
@wgtmac |
I'm not sure if @ConeyLiu wants to take another look. BTW, could you fix the PR title and description? It is no longer a WIP. |
+1, I have no further comments, thanks for the great work |
Thanks @MaxNevermind and @ConeyLiu! |
This is a simplified version of original proposed functionality of a joiner, see description of original idea and simplified design below.
Original design
See related original PR: [WIP][Proposal] PARQUET-2430: Add parquet joiner
ParquetJoiner feature is similar to ParquetRewrite class. ParquetRewrite allows to stitch files with the same schema into a single file while ParquetJoiner should enable stitching files with different schemas into a single file. That is possible when: 1) the number of rows in the main and extra files is the same, 2) the ordering of rows in the main and extra files is the same. Main benefit of ParquetJoiner is performance, for the cases when you join/stitch Terabytes/Petabytes of data that seemingly simple low level API can be up to 10x more resource efficient.
Implementation details
ParquetJoiner allows to specify the main input parquet file and extra input parquet files. ParquetJoiner will copy the main input as binary data and write extra input files with row groups adjusted to the main input. If main input is much larger than extra inputs then a lot of resources will be saved by working with the main input as binary.
Use-case examples
A very large Parquet based dataset(dozens or hundreds of fields/Terabytes of data daily/Petabytes of historical partitions). The task is to modify a column or add a new column to it for all the historic data. It is trivial using Spark, but taking into consideration the share scale of a dataset it will take a lot of resources to do that.
Side notes
Note that this class of problems could be in theory solved by storing main input and extra inputs in HMS/Iceberg bucketed tables and use a view that joins those tables on the fly into the final version but in practice there is often a requirement to merge parquet files and have a single parquet sources in the file system.
Use-case implementation details using Apache Spark
You can use Apache Spark to perform the join with ParquetJoiner, read the large main input and prepare the right side of a join in a way that each file on the left have a corresponding file on the right and it preserves records ordering on the right side in the same order as on the left side, that allows the whole input on the left and right to have the same number of files and the same number of records in corresponding files and the same ordering of records in each file pair. Then run ParquetJoiner in parallel for each file pair and perform a join. Example of the code that utilizes this new feature: https://gist.github.com/MaxNevermind/0feaaf380520ca34c2637027ef349a7d.
A simplified design(this PR)
inputFilesToJoin
instead ofList<List<>>
as in original PRinputFilesToJoin
is expected to have the same rowGroups ordering as ininputFiles
, number of files ininputFiles
andinputFilesToJoin
is not necessarily has be the same, but ordering of rowGroups and the rowCount of paired rowGroups must be the samejoinColumnsOverwrite
is used if theinputFilesToJoin
is expected to overwrite column ininputFiles
inputFiles
like pruning, nullification, binary copy, now should be available forinputFilesToJoin
tooPost PR action points
Add a part on the ParqeutRewriter to parquet-java's README.md as noted by @wgtmac in this commentWas addressed by adding ParquetRewriter description in JavaDoc