-
Notifications
You must be signed in to change notification settings - Fork 442
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ExpandExecTransformer optimization #842
Comments
Here are project expressions we observed in expand operation:
Group id operation in velox doesn't cover all three cases above. I think maybe add a spark version group id op in velox, which accepts the projection sets from spark and dose expand operation like spark does. In this way we don't need to distinguish the group expressions and aggregate expressions. and also gid / g_pos are passed from spark, we don't need to calucaulte them in velox. This would make things easier. |
@zhli1142015 Agree with your thoughts. I also discussed with @zhouyuan before that the ExpandTransformer part needs to implement a GroupID adapted to vanilla spark on the velox side. In this way, many restrictions can be removed, and more use cases can be adapted. |
Thanks for your input, I would like to work on this. |
Raise the PR for velox change: oap-project/velox#199. |
PR for gluten: https://github.com/oap-project/gluten/pull/1361/files, not sure if this change is ok for CH, as proto file is changed, who can help for CH. |
|
@lgbo-ustc please help to check. |
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Velox backend need the aggregation and grouping expression when creating the
GroupId
node. But Expand operator will have the projection expression and not distinguishing the aggregation and grouping expression.At present,
ExpandExecTransformer
is based on the assumption, columns in projections could be divided into two parts, aggregating columns and grouping columns(including grouping id column). For the following queryselect msgtype, hour, count(1) from t group by msgtype, hour with rollup
The CustomExpandExec.projections header is
[msgtype#90L, msgtype#110L, hour#111, spark_grouping_id#109L]
The CustomExpandExec.groupingExpressions is
[msgtype#110L, hour#111, spark_grouping_id#109L]
The CustomExpandExec.aggregateExpressions is
[msgtype#90L]
The above assumption looks correct.
But this failed on test case Gluten null count in GlutenDataFrameAggregateSuite
The CustomExpandExec.projections header is
[a#12773, b#12774, gid#12772, b#12775]
The CustomExpandExec.groupingExpressions is
[a#12773, b#12774, gid#12772]
The CustomExpandExec.aggregateExpressions is
[partial_count(1), partial_count(b#12775)]
Describe the solution you'd like
Will use the projections directly in ExpandExecTransformer after velox backend can creating the GroupID using the projections.
The text was updated successfully, but these errors were encountered: