-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Minor: Add comment on input_schema from AggregateExec #7727
Conversation
Interesting. All tests are passed but |
Ah, I see. It is used when generating physical plan from proto nodes. When doing that, it needs input schema of partial aggregation to initiate aggregate expressions. |
For the above reason, I'm not sure if it's possible to remove the parameter now. Let me think about it more. |
This reverts commit 5096883.
88eee3b
to
070c5a2
Compare
cc @alamb |
@@ -285,7 +285,9 @@ pub struct AggregateExec { | |||
schema: SchemaRef, | |||
/// Input schema before any aggregation is applied. For partial aggregate this will be the | |||
/// same as input.schema() but for the final aggregate it will be the same as the input | |||
/// to the partial aggregate | |||
/// to the partial aggregate, i.e., partial and final aggregates have same `input_schema`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this change is fine, but I think #7741 would be even better
I am far from a laptop but I tried to remove it at the first and the schema
seems necessary when deserializing from protobuf to physical aggregate plan
as aggregate expression initialization needs it.
…On Wed, Oct 4, 2023, 11:33 Andrew Lamb ***@***.***> wrote:
***@***.**** approved this pull request.
I think this change is fine, but I think #7741
<#7741> would be even
better
—
Reply to this email directly, view it on GitHub
<#7727 (review)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAAQZ56ZIYNPBHN4YQUSAP3X5WTXRAVCNFSM6AAAAAA5QJUG7CVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMYTMNJYGI2TIMBQGQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
I left a comment above to explain it and where/why it is failed in the CI
pipeline.
…On Wed, Oct 4, 2023, 12:19 L. C. Hsieh ***@***.***> wrote:
I am far from a laptop but I tried to remove it at the first and the
schema seems necessary when deserializing from protobuf to physical
aggregate plan as aggregate expression initialization needs it.
On Wed, Oct 4, 2023, 11:33 Andrew Lamb ***@***.***> wrote:
> ***@***.**** approved this pull request.
>
> I think this change is fine, but I think #7741
> <#7741> would be even
> better
>
> —
> Reply to this email directly, view it on GitHub
> <#7727 (review)>,
> or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/AAAQZ56ZIYNPBHN4YQUSAP3X5WTXRAVCNFSM6AAAAAA5QJUG7CVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMYTMNJYGI2TIMBQGQ>
> .
> You are receiving this because you were mentioned.Message ID:
> ***@***.***>
>
|
@alamb You can see my previous commits which removed the It is because during initializing aggregate expressions for final aggregate, their constructors all needs their input data type from input schema of partial aggregate (i.e., input schema before aggregate). That is why |
Sorry @viirya -- I should have looked at your comments more closely. |
No worries @alamb . Thanks for approval. 😄 I guess it is still possible to remove it, if we embed the data type info in to individual aggregate expression protobuf node. But this will involve a lot of changes just for removing the |
Makes sense. I was thinking we could effectively recover the intermediate schema somehow (though I not quite sure how) |
Thanks @alamb. I merged this first. We may investigate if it is possible to remove it completely later. |
* Remove input_schema * Revert "Remove input_schema" This reverts commit 5096883. * Add comment
Which issue does this PR close?
None
Rationale for this change
As we integrate
AggregateExec
, one thing confusing me is whyAggregateExec
keeps a separateinput_schema
which is not actually schema of input operator, but is always the schema of input of aggregation. In other words, partial and final aggregation operators have the same input schema parameter. It doesn't look consistent oninput_schema
andinput
functions on the operator and other operators. And another look at thisinput_schema
parameter, it doesn't look like there is any meaningful usage of it. It causes a bit strange when we integrate Spark with DataFusion's aggregation operator.I've tried to remove it first in this patch. It is confirmed that
input_schema
is not used during query execution as all unit tests and end-to-end tests passed without it.However,
verify benchmark results
pipeline has some failures. After looking into it, I found that the parameter is used when generating physical plan from protobuf nodes. When doing that, it needs input schema of partial aggregation to initiate aggregate expressions.I changed this patch to add a few comment on it to help me and others to understand it in the future.
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?