Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] make data block balance before importing data #5026 #5044

Closed
wants to merge 7 commits into from

Conversation

ddna1021
Copy link
Contributor

@ddna1021 ddna1021 commented Jul 7, 2023

make data block balance before importing data #5026

When partition_balance is set to true in the env,in the sink process, a repartition will be performed first to ensure that the size of each partition is roughly the same, which can avoid problems caused by data skew, but it will consume some extra time. The default value is false.

support spark and flink engine

support spark and flink engine

If you need to use this feature, add "partition_balance = true" in the env configuration, the default value is false
@ddna1021 ddna1021 changed the title make data block balance before importing data #5026 [Feature] make data block balance before importing data #5026 Jul 8, 2023
@Hisoka-X
Copy link
Member

Hisoka-X commented Jul 9, 2023

Please add new config it into document. Also please add warning for it only work for flink or spark. cc @TyrantLucifer

@@ -19,6 +20,16 @@ When `parallelism` is not specified, the `parallelism` in env is used by default

When parallelism is specified, it will override the parallelism in env.

### partition_balance [boolean]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add comment that this option only supported by Flink/Spark.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Hisoka-X
Hisoka-X previously approved these changes Jul 12, 2023
@TyrantLucifer
Copy link
Member

It seems that CI does not passed. @Hisoka-X

Copy link
Member

@TyrantLucifer TyrantLucifer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-1, please fix CI

@ddna1021
Copy link
Contributor Author

I checked the logs, the CI error is caused by related modules of seatunnel-connecotr-e2e and seatunnel-transforms-v2-e2e, it should have nothing to do with this PR

@hailin0
Copy link
Member

hailin0 commented Jul 19, 2023

image

@TyrantLucifer
Copy link
Member

I checked the logs, the CI error is caused by related modules of seatunnel-connecotr-e2e and seatunnel-transforms-v2-e2e, it should have nothing to do with this PR

Your code modification caused the failure of these module tests.

@ic4y
Copy link
Contributor

ic4y commented Aug 5, 2023

Waiting for CI/CD.

@EricJoy2048
Copy link
Member

Please fix the CI error.

@EricJoy2048
Copy link
Member

@Carl-Zhou-CN PTAL

@Carl-Zhou-CN
Copy link
Member

This does seem to alleviate some of the data skew problems, but it doesn't really improve the reading,Can you merge dev again and check CI @ddna1021

@ddna1021 ddna1021 closed this Jun 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants