Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ExpandExecTransformer optimization #842

Closed
JkSelf opened this issue Jan 12, 2023 · 8 comments
Closed

ExpandExecTransformer optimization #842

JkSelf opened this issue Jan 12, 2023 · 8 comments
Assignees
Labels
enhancement New feature or request velox backend works for Velox backend

Comments

@JkSelf
Copy link
Contributor

JkSelf commented Jan 12, 2023

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Velox backend need the aggregation and grouping expression when creating the GroupId node. But Expand operator will have the projection expression and not distinguishing the aggregation and grouping expression.
At present, ExpandExecTransformer is based on the assumption, columns in projections could be divided into two parts, aggregating columns and grouping columns(including grouping id column). For the following query

select msgtype, hour, count(1) from t group by msgtype, hour with rollup
The CustomExpandExec.projections header is
[msgtype#90L, msgtype#110L, hour#111, spark_grouping_id#109L]
The CustomExpandExec.groupingExpressions is
[msgtype#110L, hour#111, spark_grouping_id#109L]
The CustomExpandExec.aggregateExpressions is
[msgtype#90L]
The above assumption looks correct.

But this failed on test case Gluten null count in GlutenDataFrameAggregateSuite

checkAnswer(
      testData3
        .agg(count($"a"), count($"b"), count(lit(1)), count_distinct($"a"), count_distinct($"b")),
      Row(2, 1, 2, 2, 1))

The CustomExpandExec.projections header is
[a#12773, b#12774, gid#12772, b#12775]
The CustomExpandExec.groupingExpressions is
[a#12773, b#12774, gid#12772]
The CustomExpandExec.aggregateExpressions is
[partial_count(1), partial_count(b#12775)]
Describe the solution you'd like
Will use the projections directly in ExpandExecTransformer after velox backend can creating the GroupID using the projections.

@JkSelf JkSelf added the enhancement New feature or request label Jan 12, 2023
@weiting-chen weiting-chen moved this to 🆕 New in Gluten 0.5.0 Mar 1, 2023
@weiting-chen weiting-chen added the velox backend works for Velox backend label Mar 1, 2023
@zhli1142015
Copy link
Contributor

zhli1142015 commented Apr 12, 2023

Here are project expressions we observed in expand operation:

  1. agg exprs + group by exprs + gid.
  2. agg exprs + group by exprs + gid + _gen_grouping_pos. --> The last column is for handling duplicate grouping sets.
  3. group by exprs + gid + agg exprs. --> gid is calculated by different way from above two cases. it's assigned with the sequence number of project set.

Group id operation in velox doesn't cover all three cases above. I think maybe add a spark version group id op in velox, which accepts the projection sets from spark and dose expand operation like spark does. In this way we don't need to distinguish the group expressions and aggregate expressions. and also gid / g_pos are passed from spark, we don't need to calucaulte them in velox. This would make things easier.
And we may need a pre projection, as we saw some cast expression is used when expand is handling some decimal type data.
Please let me know your points.
Thanks.

@zhli1142015
Copy link
Contributor

@JkSelf
Copy link
Contributor Author

JkSelf commented Apr 12, 2023

@zhli1142015 Agree with your thoughts. I also discussed with @zhouyuan before that the ExpandTransformer part needs to implement a GroupID adapted to vanilla spark on the velox side. In this way, many restrictions can be removed, and more use cases can be adapted.

@zhli1142015
Copy link
Contributor

Thanks for your input, I would like to work on this.

@zhli1142015
Copy link
Contributor

Raise the PR for velox change: oap-project/velox#199.

@zhli1142015
Copy link
Contributor

zhli1142015 commented Apr 14, 2023

PR for gluten: https://github.com/oap-project/gluten/pull/1361/files, not sure if this change is ok for CH, as proto file is changed, who can help for CH.

@FelixYBW
Copy link
Contributor

PR for gluten: #1361 (files), not sure if this change is ok for CH, as proto file is changed, who can help for CH.

@zzcclp

@zzcclp
Copy link
Contributor

zzcclp commented Apr 17, 2023

@lgbo-ustc please help to check.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request velox backend works for Velox backend
Projects
None yet
Development

No branches or pull requests

5 participants