Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GLUTEN-842][VL] convert expand op to expand exec in velox #1361

Merged
merged 11 commits into from
Apr 19, 2023

Conversation

zhli1142015
Copy link
Contributor

@zhli1142015 zhli1142015 commented Apr 14, 2023

What changes were proposed in this pull request?

Here are project expressions we observed in expand operation:

agg exprs + group by exprs + gid.
agg exprs + group by exprs + gid + _gen_grouping_pos. --> The last column is for handling duplicate grouping sets.
group by exprs + gid + agg exprs. --> gid is calculated by different way from above two cases. it's assigned with the sequence number of project set.

Original ExpandExecTrandofrmer can only handle the first case. I'm adding the expand exec in velox side by the PR: https://github.com/oap-project/velox/pull/199/files. This PR is for converting spark ExpandExec to Expand OP in Velox, We don't need to do the columns' mapping in Gluten.

The original ExpandExecTrandofrmer is renamed to GroupIdExecTrandofrmer to not break ClicKHouse.

How was this patch tested?

Unit test.

@github-actions
Copy link

Thanks for opening a pull request!

Could you open an issue for this pull request on Github Issues?

https://github.com/oap-project/gluten/issues

Then could you also rename commit message and pull request title in the following format?

[GLUTEN-${ISSUES_ID}][COMPONENT]feat/fix: ${detailed message}

See also:

@JkSelf
Copy link
Contributor

JkSelf commented Apr 14, 2023

@baibaichen

This PR is to implement ExpandTransformer similar with vanilla spark. And then we no need to distinguish the agg cols and group cols in Gluten. It seems the Expand PR to substrait community is ready to merge except the doc update. How about directly use PR in Gluten to support Expand in this PR?

@FelixYBW
Copy link
Contributor

@baibaichen

This PR is to implement ExpandTransformer similar with vanilla spark. And then we no need to distinguish the agg cols and group cols in Gluten. It seems the Expand PR to substrait community is ready to merge except the doc update. How about directly use PR in Gluten to support Expand in this PR?

We should follow substrait's solution. Does the PR need change to substrait?

@JkSelf
Copy link
Contributor

JkSelf commented Apr 14, 2023

@baibaichen
This PR is to implement ExpandTransformer similar with vanilla spark. And then we no need to distinguish the agg cols and group cols in Gluten. It seems the Expand PR to substrait community is ready to merge except the doc update. How about directly use PR in Gluten to support Expand in this PR?

We should follow substrait's solution. Does the PR need change to substrait?

Yes. This PR add new ExpandRel message in algebra.proto, which follow up the substrait community solution except the definition in PR.

@zhli1142015
Copy link
Contributor Author

@baibaichen
This PR is to implement ExpandTransformer similar with vanilla spark. And then we no need to distinguish the agg cols and group cols in Gluten. It seems the Expand PR to substrait community is ready to merge except the doc update. How about directly use PR in Gluten to support Expand in this PR?

We should follow substrait's solution. Does the PR need change to substrait?

Yes. This PR add new ExpandRel message in algebra.proto, which follow up the substrait community solution except the definition in PR.

Thanks, updated.

@zhli1142015 zhli1142015 changed the title [WIP][GLUTEN-842][VL] convert expand op to expand exec in velox [GLUTEN-842][VL] convert expand op to expand exec in velox Apr 17, 2023
@zhli1142015 zhli1142015 marked this pull request as ready for review April 17, 2023 12:24
@zhli1142015 zhli1142015 requested a review from JkSelf April 17, 2023 12:24
@github-actions
Copy link

#842

@zhli1142015 zhli1142015 requested a review from zzcclp April 17, 2023 12:24
@zhli1142015
Copy link
Contributor Author

@JkSelf , @zhouyuan and @lgbo-ustc, could you please help to take a look?

JkSelf
JkSelf previously approved these changes Apr 18, 2023
Copy link
Contributor

@JkSelf JkSelf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM except small comments.

FelixYBW
FelixYBW previously approved these changes Apr 18, 2023
@zhli1142015 zhli1142015 dismissed stale reviews from FelixYBW and JkSelf via 91c8666 April 18, 2023 06:03
@lgbo-ustc
Copy link
Contributor

lgbo-ustc commented Apr 18, 2023

You mention that ExpandExecTrandofrmer has benn kept but the the ExpandRel has been changed, is it backward compatible with ClickhHouse?

@zhli1142015
Copy link
Contributor Author

It seems the ExpandRel has been changed, is it backward compatible with ClickhHouse?

new 'ExpandRel' is different and can't be compatible with CH. The original Rel is renamed to 'GroupIdRel' and changes are made in CH also.
Thanks.

@lgbo-ustc
Copy link
Contributor

It's greate. This implementation is simple. I think it could solve some problems we have meet.

LGTM.

lgbo-ustc

This comment was marked as duplicate.

@zhli1142015
Copy link
Contributor Author

It's greate. This implementation is simple. I think it could solve some problems we have meet.

LGTM.

Thanks for review @lgbo-ustc , will you work on CH side to consume new ExpandRel contract?

@lgbo-ustc
Copy link
Contributor

It's greate. This implementation is simple. I think it could solve some problems we have meet.
LGTM.

Thanks for review @lgbo-ustc , will you work on CH side to consume new ExpandRel contract?

We will do it soon

Copy link
Contributor

@zhouyuan zhouyuan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@zhli1142015 zhli1142015 merged commit c156453 into apache:main Apr 19, 2023
Yohahaha pushed a commit to Yohahaha/gluten that referenced this pull request Apr 19, 2023
* init change

* convert expand op to expand exec in velox

* add pre-project & add ut

* minor change

* fix ut

* update algebra.proto

* fix build

* fix build

* fix build

* add ut

* revert velox branch

---------

Co-authored-by: zhli1142015 <zhli@pczhlich.fareast.corp.microsoft.com>
marin-ma pushed a commit to marin-ma/gluten that referenced this pull request Apr 25, 2023
* init change

* convert expand op to expand exec in velox

* add pre-project & add ut

* minor change

* fix ut

* update algebra.proto

* fix build

* fix build

* fix build

* add ut

* revert velox branch

---------

Co-authored-by: zhli1142015 <zhli@pczhlich.fareast.corp.microsoft.com>
@zhanglistar
Copy link
Contributor

@zhli1142015

agg exprs + group by exprs + gid + _gen_grouping_pos. --> The last column is for handling duplicate grouping sets.
group by exprs + gid + agg exprs. --> gid is calculated by different way from above two cases. it's assigned with the sequence number of project set.

I have a question, the two cases, could you give some sql? We don't know when spark will generate the two cases.
Thanks.

@zhli1142015
Copy link
Contributor Author

@zhli1142015

agg exprs + group by exprs + gid + _gen_grouping_pos. --> The last column is for handling duplicate grouping sets.
group by exprs + gid + agg exprs. --> gid is calculated by different way from above two cases. it's assigned with the sequence number of project set.

I have a question, the two cases, could you give some sql? We don't know when spark will generate the two cases. Thanks.

Hello @zhanglistar ,
Please check below sample code:

case class TestData3(a: Int, b: Option[Int])
val df = spark.sparkContext.parallelize(
      TestData3(1, None) ::
      TestData3(2, Some(2)) :: Nil).toDF()
import org.apache.spark.sql.functions._
df.agg(count($"a"), count($"b"), count(lit(1)), count_distinct($"a"), count_distinct($"b")).collect // case 3
df.createOrReplaceTempView("df")
spark.sql("select count(a) from df group by grouping sets((a), (a), (b))").collect // case 2

@zhli1142015 zhli1142015 deleted the expand-change-4-12 branch April 26, 2023 04:44
liuneng1994 pushed a commit that referenced this pull request Apr 26, 2023
What changes were proposed in this pull request?
support new ExpandRel introduced by #1361

(Fixes: #1392)

How was this patch tested?
unit tests
@zhanglistar
Copy link
Contributor

zhanglistar commented Apr 26, 2023

@zhli1142015

agg exprs + group by exprs + gid + _gen_grouping_pos. --> The last column is for handling duplicate grouping sets.
group by exprs + gid + agg exprs. --> gid is calculated by different way from above two cases. it's assigned with the sequence number of project set.

I have a question, the two cases, could you give some sql? We don't know when spark will generate the two cases. Thanks.

Hello @zhanglistar , Please check below sample code:

case class TestData3(a: Int, b: Option[Int])
val df = spark.sparkContext.parallelize(
      TestData3(1, None) ::
      TestData3(2, Some(2)) :: Nil).toDF()
import org.apache.spark.sql.functions._
df.agg(count($"a"), count($"b"), count(lit(1)), count_distinct($"a"), count_distinct($"b")).collect // case 3
df.createOrReplaceTempView("df")
spark.sql("select count(a) from df group by grouping sets((a), (a), (b))").collect // case 2

@zhli1142015 Thanks! For the sql spark.sql("select count(a) from df group by grouping sets((a), (a), (b))").collect // case 2, just curious, why not just duplicate the two grouping sets (a) for optimization?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants