Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[VL] Pass partition id to velox functions #4344

Merged
merged 3 commits into from
Mar 11, 2024

Conversation

zhli1142015
Copy link
Contributor

@zhli1142015 zhli1142015 commented Jan 10, 2024

What changes were proposed in this pull request?

Pass partition id to velox functions.

How was this patch tested?

UT.

Copy link

Thanks for opening a pull request!

Could you open an issue for this pull request on Github Issues?

https://github.com/oap-project/gluten/issues

Then could you also rename commit message and pull request title in the following format?

[GLUTEN-${ISSUES_ID}][COMPONENT]feat/fix: ${detailed message}

See also:

Copy link

Run Gluten Clickhouse CI

3 similar comments
Copy link

Run Gluten Clickhouse CI

Copy link

Run Gluten Clickhouse CI

Copy link

Run Gluten Clickhouse CI

@zhli1142015
Copy link
Contributor Author

we should use rdd partition id instead of task partition id for this function.

- SPARK-14393: values generated by non-deterministic functions shouldn't change after coalesce or union *** FAILED ***
  Array([0,0], [1,1], [2,2], [3,3]) did not equal Array([0,0], [1,0], [2,1], [3,1]) Values changed after coalesce when codegenFallback=false and wholeStage=false. (DataFrameFunctionsSuite.scala:3591)

@zhli1142015 zhli1142015 marked this pull request as ready for review January 10, 2024 10:30
Copy link

Run Gluten Clickhouse CI

@zhli1142015
Copy link
Contributor Author

@PHILO-HE @rui-mo could you help review this PR?

Copy link

Run Gluten Clickhouse CI

Copy link
Contributor

@rui-mo rui-mo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

cpp/velox/operators/functions/SparkPartitionId.cc Outdated Show resolved Hide resolved
@PHILO-HE
Copy link
Contributor

we should use rdd partition id instead of task partition id for this function.

- SPARK-14393: values generated by non-deterministic functions shouldn't change after coalesce or union *** FAILED ***
  Array([0,0], [1,1], [2,2], [3,3]) did not equal Array([0,0], [1,0], [2,1], [3,1]) Values changed after coalesce when codegenFallback=false and wholeStage=false. (DataFrameFunctionsSuite.scala:3591)

How was this issue produced? This spark_partition_id() function should fall back in this test. Could you clarify a bit? Thanks!

Copy link
Contributor

@PHILO-HE PHILO-HE left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

cpp/velox/operators/functions/SparkPartitionId.cc Outdated Show resolved Hide resolved
@zhli1142015
Copy link
Contributor Author

zhli1142015 commented Jan 15, 2024

we should use rdd partition id instead of task partition id for this function.

- SPARK-14393: values generated by non-deterministic functions shouldn't change after coalesce or union *** FAILED ***
  Array([0,0], [1,1], [2,2], [3,3]) did not equal Array([0,0], [1,0], [2,1], [3,1]) Values changed after coalesce when codegenFallback=false and wholeStage=false. (DataFrameFunctionsSuite.scala:3591)

How was this issue produced? This spark_partition_id() function should fall back in this test. Could you clarify a bit? Thanks!

This issue is found after i offload spark_partition_id to velox. native spark_partition_id is simple which returns a config as you mentioned, the config here is the partition id we set for each Velox task. Before the partition id we passed to native is task partition id, not rdd partition id. They are different when there are coalesce or union operatiors. I mentioned this here just for a reminder for my self.

Copy link

github-actions bot commented Mar 4, 2024

Run Gluten Clickhouse CI

Copy link

github-actions bot commented Mar 7, 2024

Run Gluten Clickhouse CI

@zhli1142015 zhli1142015 changed the title [VL] Add spark_partition_id support [VL] Pass partition id to velox functions Mar 7, 2024
Copy link

github-actions bot commented Mar 7, 2024

Run Gluten Clickhouse CI

2 similar comments
Copy link

github-actions bot commented Mar 7, 2024

Run Gluten Clickhouse CI

Copy link

github-actions bot commented Mar 7, 2024

Run Gluten Clickhouse CI

Copy link

github-actions bot commented Mar 7, 2024

Run Gluten Clickhouse CI

@zhli1142015 zhli1142015 requested review from rui-mo and PHILO-HE March 7, 2024 23:16
Copy link
Contributor

@PHILO-HE PHILO-HE left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Just two comments, could you help check and clarify? BTW, could you let us know how get_partition_id() function is used on your side? Directly used in sql? Thanks! cc @zhouyuan

@@ -111,6 +111,8 @@ Java_io_glutenproject_vectorized_PlanEvaluatorJniWrapper_nativeValidateWithFailu

// A query context used for function validation.
velox::core::QueryCtx queryCtx;
std::unordered_map<std::string, std::string> configs{{velox::core::QueryConfig::kSparkPartitionId, "0"}};
queryCtx.testingOverrideConfigUnsafe(std::move(configs));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems testingOverrideConfigUnsafe should be used in test code only.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated, thanks.

@@ -111,6 +111,8 @@ Java_io_glutenproject_vectorized_PlanEvaluatorJniWrapper_nativeValidateWithFailu

// A query context used for function validation.
velox::core::QueryCtx queryCtx;
std::unordered_map<std::string, std::string> configs{{velox::core::QueryConfig::kSparkPartitionId, "0"}};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to pass this config for validator? Seems if lacked, it doesn't have any impact.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added, thanks

Copy link

github-actions bot commented Mar 8, 2024

Run Gluten Clickhouse CI

Copy link

github-actions bot commented Mar 8, 2024

Run Gluten Clickhouse CI

@zhli1142015
Copy link
Contributor Author

Looks good! Just two comments, could you help check and clarify? BTW, could you let us know how get_partition_id() function is used on your side? Directly used in sql? Thanks! cc @zhouyuan

This function is used for data profiling, thanks.

Copy link
Contributor

@PHILO-HE PHILO-HE left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your work! cc @rui-mo

@zhli1142015 zhli1142015 merged commit e78ee43 into apache:main Mar 11, 2024
17 checks passed
@GlutenPerfBot
Copy link
Contributor

===== Performance report for TPCH SF2000 with Velox backend, for reference only ====

query log/native_4344_time.csv log/native_master_03_10_2024_3ad58ce14_time.csv difference percentage
q1 34.98 35.86 0.880 102.52%
q2 25.81 25.95 0.140 100.54%
q3 36.42 36.82 0.394 101.08%
q4 39.56 38.96 -0.603 98.48%
q5 69.51 70.83 1.312 101.89%
q6 5.78 7.41 1.633 128.27%
q7 86.64 83.10 -3.544 95.91%
q8 86.39 85.35 -1.042 98.79%
q9 117.56 118.73 1.168 100.99%
q10 43.29 42.98 -0.310 99.28%
q11 21.36 21.60 0.234 101.10%
q12 27.99 25.06 -2.929 89.53%
q13 49.45 46.55 -2.905 94.13%
q14 20.60 19.73 -0.876 95.75%
q15 30.55 32.39 1.835 106.01%
q16 14.13 13.72 -0.408 97.11%
q17 101.59 100.46 -1.134 98.88%
q18 142.11 142.43 0.323 100.23%
q19 14.72 13.62 -1.100 92.53%
q20 29.06 26.87 -2.191 92.46%
q21 226.59 225.05 -1.539 99.32%
q22 13.91 14.04 0.129 100.93%
total 1238.01 1227.48 -10.532 99.15%

taiyang-li pushed a commit to bigo-sg/gluten that referenced this pull request Mar 25, 2024
[VL] Pass partition id to velox functions.
taiyang-li pushed a commit to bigo-sg/gluten that referenced this pull request Oct 8, 2024
[VL] Pass partition id to velox functions.
taiyang-li pushed a commit to bigo-sg/gluten that referenced this pull request Oct 9, 2024
[VL] Pass partition id to velox functions.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants