Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HUDI-6909] Use isDelete instead of isDeleted function for Spark merger #12007

Merged

Conversation

linliu-code
Copy link
Contributor

@linliu-code linliu-code commented Sep 26, 2024

Change Logs

isDeleted() only checks if the data == null.

isDelete() returns true when

  1. data == null; or
  2. _hoodie_operation field exists and is "D"; or
  3. data contains a specified delete field.

Therefore, remove isDeleted() and its underlying variable.

Impact

Clean the logic of determining delete record for Spark.

Risk level (write none, low medium or high below)

Medium.
Existing tests should be sufficient.

Documentation Update

Describe any necessary documentation update if there is any new feature, config, or user-facing change. If not, put "none".

  • The config description must be updated if new configs are added or the default value of the configs are changed
  • Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the
    ticket number here and follow the instruction to make
    changes to the website.

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

@linliu-code linliu-code changed the title [HUUse isDelete instead of isDeleted function for merger [HUDI-6909] Use isDelete instead of isDeleted function for Spark merger Sep 26, 2024
@linliu-code linliu-code force-pushed the HUDI-6909_fix_a_bug_in_the_spark_merger branch from b416e96 to c08c321 Compare September 26, 2024 00:22
@github-actions github-actions bot added the size:S PR with lines of changes in (10, 100] label Sep 26, 2024
@@ -50,7 +50,7 @@ public Option<Pair<HoodieRecord, Schema>> merge(HoodieRecord older, Schema oldSc

if (newer instanceof HoodieSparkRecord) {
HoodieSparkRecord newSparkRecord = (HoodieSparkRecord) newer;
if (newSparkRecord.isDeleted()) {
if (newSparkRecord.isDelete(newSchema, props)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We better add a UT for the merger.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I will add a UT for this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the UT has been added.

@linliu-code linliu-code force-pushed the HUDI-6909_fix_a_bug_in_the_spark_merger branch from c08c321 to 4c37fd3 Compare September 27, 2024 01:31
@github-actions github-actions bot added size:M PR with lines of changes in (100, 300] and removed size:S PR with lines of changes in (10, 100] labels Sep 27, 2024
@linliu-code linliu-code force-pushed the HUDI-6909_fix_a_bug_in_the_spark_merger branch from 4c37fd3 to 6c7a5c1 Compare September 27, 2024 02:02

class TestDefaultSparkRecordMerger {
static final String RECORD_KEY_FIELD_NAME = "record_key";
static final String PARTITION_PATH_FIELD_NAME = "partition_path";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see there are also IT tests: TestHoodieMergeHandleWithSparkMerger, do we also need to add a test case there?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. the delete record with hoodieoperation is tested, and the fg reader has been used to read.

* If the input records are not Spark record, it throws.
*/
@Test
void testMergerWithArvroRecord() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: avro

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

* If the new record is a delete record, the merged record is empty.
*/
@Test
void testMergerWithNewRecordIsDeleteRecord() throws IOException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

testMergerWithNewRecordAsDelete ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

* If the old record is a delete record, the merged record is the new record.
*/
@Test
void testMergerWithOldRecordIsDeleteRecord() throws IOException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

testMergerWithOldRecordAsDelete?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@hudi-bot
Copy link

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@danny0405 danny0405 merged commit 7bbe1a4 into apache:master Sep 30, 2024
44 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size:M PR with lines of changes in (100, 300]
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants