Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CORE] Support submit subqueries concurrently to improve scalar subquery performance #1097

Merged
merged 1 commit into from
Jun 9, 2023

Conversation

WangGuangxin
Copy link
Contributor

@WangGuangxin WangGuangxin commented Mar 9, 2023

What changes were proposed in this pull request?

In this MR, I proposed the following two changes to improve subquery performance.

  1. In WholeStageTransformerExec.doWholestageTransform, invoke SparkPlan.prepare to trigger Subquery execution concurrently before we do expression transform.
  2. don't trigger Subquery execution when do native validation

Currently, subquery is supported in ScalarSubqueryTransformer.doTransform, by invoking ScalarSbuquery.plan.executeCollect, which will block the whole code path before the subquery really returns.

But in vanilla spark, subquery is submitted concurrently in SparkPlan.prepare,so that we can fully utilize spark cores.

Take TPC-DS Q9 as an example,

vanilla spark submits subquery calculation concurrently,
image

while gluten submits subquery one by one
image

How was this patch tested?

It's tested against TPC-DS 1T Q9

before this patch, it costs 5.3min, after this patch cost 39s

@github-actions
Copy link

github-actions bot commented Mar 9, 2023

Thanks for opening a pull request!

Could you open an issue for this pull request on Github Issues?

https://github.com/oap-project/gluten/issues

Then could you also rename commit message and pull request title in the following format?

[Gluten-${ISSUES_ID}] ${detailed message}

See also:

@WangGuangxin
Copy link
Contributor Author

@zhztheplayer @rui-mo

@rui-mo rui-mo requested a review from zzcclp March 9, 2023 05:09
Copy link
Contributor

@rui-mo rui-mo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you. I have a question on the validation parameter.

@WangGuangxin WangGuangxin changed the title [Gluten-core] Support submit subqueries concurrently to improve scala subquery performance [Gluten-core] Support submit subqueries concurrently to improve scalar subquery performance Mar 9, 2023
// the first column in first row from `query`.
val rows = query.plan.executeCollect()
if (rows.length > 1) {
sys.error(s"more than one row returned by a subquery used as an expression:\n${query.plan}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Throw runtime exception

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -26,7 +26,7 @@ import org.apache.spark.sql.catalyst.optimizer.NormalizeNaNAndZero
import org.apache.spark.sql.execution._
import org.apache.spark.sql.execution.exchange.BroadcastExchangeExec

object ExpressionConverter extends Logging {
case class ExpressionConverter(validation: Boolean) extends Logging {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add some notes here for parameter validation? By looking into ExpressionConverter's code one doesn't realize where validation is happened.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a comment here

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@zhouyuan zhouyuan added the velox backend works for Velox backend label Mar 20, 2023
@zhztheplayer zhztheplayer changed the title [Gluten-core] Support submit subqueries concurrently to improve scalar subquery performance [CORE] Support submit subqueries concurrently to improve scalar subquery performance Mar 24, 2023
@zhouyuan
Copy link
Contributor

zhouyuan commented Apr 7, 2023

hi @WangGuangxin, can you please do a rebase?

@kyligence-git
Copy link
Contributor

Can one of the admins verify this patch?

@WangGuangxin
Copy link
Contributor Author

hi @WangGuangxin, can you please do a rebase?

ok

@PHILO-HE
Copy link
Contributor

PHILO-HE commented Jun 8, 2023

@WangGuangxin, could you please update this patch?
As you know, there is already a fix in main branch to avoid scalar subquery executed during validation. So you can just keep the first proposed fix in this PR. We are expecting this patch to be merged soon for bringing perf. gain on subquery. Thanks!

@WangGuangxin
Copy link
Contributor Author

@WangGuangxin, could you please update this patch? As you know, there is already a fix in main branch to avoid scalar subquery executed during validation. So you can just keep the first proposed fix in this PR. We are expecting this patch to be merged soon for bringing perf. gain on subquery. Thanks!

ok

@github-actions
Copy link

github-actions bot commented Jun 9, 2023

Run Gluten Clickhouse CI

@github-actions
Copy link

github-actions bot commented Jun 9, 2023

Run Gluten Clickhouse CI

1 similar comment
@PHILO-HE
Copy link
Contributor

PHILO-HE commented Jun 9, 2023

Run Gluten Clickhouse CI

Copy link
Contributor

@zhouyuan zhouyuan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@zhouyuan zhouyuan merged commit fdf6d71 into apache:main Jun 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
velox backend works for Velox backend
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants