Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Minor: Add comment on input_schema from AggregateExec #7727

Merged
merged 3 commits into from
Oct 6, 2023

Conversation

viirya
Copy link
Member

@viirya viirya commented Oct 3, 2023

Which issue does this PR close?

None

Rationale for this change

As we integrate AggregateExec, one thing confusing me is why AggregateExec keeps a separate input_schema which is not actually schema of input operator, but is always the schema of input of aggregation. In other words, partial and final aggregation operators have the same input schema parameter. It doesn't look consistent on input_schema and input functions on the operator and other operators. And another look at this input_schema parameter, it doesn't look like there is any meaningful usage of it. It causes a bit strange when we integrate Spark with DataFusion's aggregation operator.

I've tried to remove it first in this patch. It is confirmed that input_schema is not used during query execution as all unit tests and end-to-end tests passed without it.

However, verify benchmark results pipeline has some failures. After looking into it, I found that the parameter is used when generating physical plan from protobuf nodes. When doing that, it needs input schema of partial aggregation to initiate aggregate expressions.

I changed this patch to add a few comment on it to help me and others to understand it in the future.

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

@viirya viirya marked this pull request as draft October 3, 2023 02:57
@github-actions github-actions bot added the core Core DataFusion crate label Oct 3, 2023
@viirya viirya changed the title Remove input_schema Remove input_schema from AggregateExec Oct 3, 2023
@viirya
Copy link
Member Author

viirya commented Oct 3, 2023

Interesting. All tests are passed but verify benchmark results has failure.

@viirya
Copy link
Member Author

viirya commented Oct 3, 2023

Ah, I see. It is used when generating physical plan from proto nodes. When doing that, it needs input schema of partial aggregation to initiate aggregate expressions.

@viirya
Copy link
Member Author

viirya commented Oct 3, 2023

For the above reason, I'm not sure if it's possible to remove the parameter now. Let me think about it more.

@viirya viirya changed the title Remove input_schema from AggregateExec Minor: Add comment on input_schema from AggregateExec Oct 4, 2023
@viirya viirya marked this pull request as ready for review October 4, 2023 06:18
@github-actions github-actions bot removed the core Core DataFusion crate label Oct 4, 2023
@viirya
Copy link
Member Author

viirya commented Oct 4, 2023

cc @alamb

@@ -285,7 +285,9 @@ pub struct AggregateExec {
schema: SchemaRef,
/// Input schema before any aggregation is applied. For partial aggregate this will be the
/// same as input.schema() but for the final aggregate it will be the same as the input
/// to the partial aggregate
/// to the partial aggregate, i.e., partial and final aggregates have same `input_schema`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 I went to double check this (being inspired by the thoroughness of your own reviews @viirya ) and I was able to remove the entire field: #7741

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this change is fine, but I think #7741 would be even better

@viirya
Copy link
Member Author

viirya commented Oct 4, 2023 via email

@viirya
Copy link
Member Author

viirya commented Oct 4, 2023 via email

@viirya
Copy link
Member Author

viirya commented Oct 4, 2023

@alamb You can see my previous commits which removed the input_schema field. In CI, all unit tests and end-to-end tests can pass but verify benchmark results gets failures when deserializing protobuf to physical aggregate.

It is because during initializing aggregate expressions for final aggregate, their constructors all needs their input data type from input schema of partial aggregate (i.e., input schema before aggregate). That is why input_schema is there.

@alamb
Copy link
Contributor

alamb commented Oct 4, 2023

@alamb You can see my previous commits which removed the input_schema field. In CI, all unit tests and end-to-end tests can pass but verify benchmark results gets failures when deserializing protobuf to physical aggregate.

It is because during initializing aggregate expressions for final aggregate, their constructors all needs their input data type from input schema of partial aggregate (i.e., input schema before aggregate). That is why input_schema is there.

Sorry @viirya -- I should have looked at your comments more closely.

@viirya
Copy link
Member Author

viirya commented Oct 4, 2023

No worries @alamb . Thanks for approval. 😄

I guess it is still possible to remove it, if we embed the data type info in to individual aggregate expression protobuf node. But this will involve a lot of changes just for removing the input_schema. 😄 So I stopped there and turned to add a few comment to explain why the parameter is there.

@alamb
Copy link
Contributor

alamb commented Oct 4, 2023

I guess it is still possible to remove it, if we embed the data type info in to individual aggregate expression protobuf node. But this will involve a lot of changes just for removing the input_schema. 😄 So I stopped there and turned to add a few comment to explain why the parameter is there.

Makes sense. I was thinking we could effectively recover the intermediate schema somehow (though I not quite sure how)

@viirya viirya merged commit a8c01de into apache:main Oct 6, 2023
23 checks passed
@viirya
Copy link
Member Author

viirya commented Oct 6, 2023

Thanks @alamb. I merged this first. We may investigate if it is possible to remove it completely later.

Ted-Jiang pushed a commit to Ted-Jiang/arrow-datafusion that referenced this pull request Oct 7, 2023
* Remove input_schema

* Revert "Remove input_schema"

This reverts commit 5096883.

* Add comment
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants