-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature] make data block balance before importing data #5026 #5044
Conversation
support spark and flink engine If you need to use this feature, add "partition_balance = true" in the env configuration, the default value is false
Please add new config it into document. Also please add warning for it only work for flink or spark. cc @TyrantLucifer |
@@ -19,6 +20,16 @@ When `parallelism` is not specified, the `parallelism` in env is used by default | |||
|
|||
When parallelism is specified, it will override the parallelism in env. | |||
|
|||
### partition_balance [boolean] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add comment that this option only supported by Flink/Spark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
It seems that CI does not passed. @Hisoka-X |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-1, please fix CI
I checked the logs, the CI error is caused by related modules of seatunnel-connecotr-e2e and seatunnel-transforms-v2-e2e, it should have nothing to do with this PR |
Your code modification caused the failure of these module tests. |
Waiting for CI/CD. |
Please fix the CI error. |
@Carl-Zhou-CN PTAL |
This does seem to alleviate some of the data skew problems, but it doesn't really improve the reading,Can you merge dev again and check CI @ddna1021 |
make data block balance before importing data #5026
When partition_balance is set to true in the env,in the sink process, a repartition will be performed first to ensure that the size of each partition is roughly the same, which can avoid problems caused by data skew, but it will consume some extra time. The default value is false.
support spark and flink engine