Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make Collect, first and last as deterministic aggregate functions for Spark-3.3 #4677

Merged
merged 5 commits into from
Feb 3, 2022

Conversation

nartal1
Copy link
Collaborator

@nartal1 nartal1 commented Feb 2, 2022

This fixes #4286 . Spark has made these functions as deterministic in Spark-3.3. This PR is intended to do the same. For previous versions of Spark(i.e prior to Spark3.3) we are still keeping them as non-determinstic.

@nartal1 nartal1 added task Work required that improves the product but is not user facing audit_3.3.0 Audit related tasks for 3.3.0 labels Feb 2, 2022
@nartal1 nartal1 added this to the Jan 31 - Feb 11 milestone Feb 2, 2022
@nartal1 nartal1 self-assigned this Feb 2, 2022
@nartal1
Copy link
Collaborator Author

nartal1 commented Feb 2, 2022

build

Signed-off-by: Niranjan Artal <nartal@nvidia.com>
Copy link
Collaborator

@gerashegalov gerashegalov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just minor comments

@abellina
Copy link
Collaborator

abellina commented Feb 2, 2022

It is not entirely clear to me what deterministic is used for when expressions are aggregate functions. I see some optimizations, like the one that triggered the change, but I don't fully understand each case. This is arguably separate from @nartal1's change, but I think now that we agree with Spark that these functions are deterministic, do we know that the GPU is as deterministic as the CPU, and does it matter?

Signed-off-by: Niranjan Artal <nartal@nvidia.com>
@gerashegalov
Copy link
Collaborator

but I think now that we agree with Spark that these functions are deterministic, do we know that the GPU is as deterministic as the CPU, and does it matter?

if you mean we=Plugin, then the agreement with Spark is that determinism depends on the determinism of the children, which is the default definition Expression. Collect,First, Last. So we should double check if we correctly compute deterministic if e.g. the input is something like non-Stable out of core sort.

@nartal1
Copy link
Collaborator Author

nartal1 commented Feb 2, 2022

It is not entirely clear to me what deterministic is used for when expressions are aggregate functions. I see some optimizations, like the one that triggered the change, but I don't fully understand each case. This is arguably separate from @nartal1's change, but I think now that we agree with Spark that these functions are deterministic, do we know that the GPU is as deterministic as the CPU, and does it matter?

@abellina IIUC from the discussion in the Spark's PR, these functions were mistakenly marked as non-deterministic. Deterministic in this context is the result will be same within a group(ordered). Optimizer rule is applied to these functions if we remove them as non-deterministic. From GPU point, I think the same rule would apply, right? And based on default definition of determinstic, it would be set to false if the child expression are not deterministic.
More context on why these has been made as deterministic in this comment: apache/spark#29810 (comment)

Signed-off-by: Niranjan Artal <nartal@nvidia.com>
@abellina
Copy link
Collaborator

abellina commented Feb 2, 2022

Having a flag that is deterministic=true for a function with non-deterministic all over the header (https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/First.scala#L35) and (https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Last.scala#L35), is really confusing.

That said our behavior hasn't changed, and we are still as non-deterministic as we were before. We are following Spark's lead in setting this, and that seems like the right thing to do (and they are also non-deterministic). I think this is a follow on research to figure out what this flag means in all scenarios, and perhaps have some updated comments in these expressions.

@nartal1
Copy link
Collaborator Author

nartal1 commented Feb 2, 2022

Having a flag that is deterministic=true for a function with non-deterministic all over the header (https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/First.scala#L35) and (https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Last.scala#L35), is really confusing.

Agreed that header is confusing. It looks like headers/comments were not updated when deterministic=false was removed.

@nartal1
Copy link
Collaborator Author

nartal1 commented Feb 2, 2022

build

@nartal1
Copy link
Collaborator Author

nartal1 commented Feb 3, 2022

@abellina Please take another look and let me know if we could merge this PR.

@abellina
Copy link
Collaborator

abellina commented Feb 3, 2022

I filed this #4684 to see if we can find more info around this flag. In terms of this PR, it is adhering to the value used in Spark 3.3, so that seems OK, that said I don't know enough about this right now to say I understand the side effects. If you or @gerashegalov are pretty convinced this is OK, then by all means please merge.

@nartal1
Copy link
Collaborator Author

nartal1 commented Feb 3, 2022

Thanks @abellina for your input! Merging this as it fixes the original issue.

@nartal1 nartal1 merged commit 8380df8 into NVIDIA:branch-22.04 Feb 3, 2022
@nartal1 nartal1 deleted the first_deterministic branch February 3, 2022 16:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
audit_3.3.0 Audit related tasks for 3.3.0 task Work required that improves the product but is not user facing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Audit] [FEA] [SPARK-32940][SQL] Collect, first and last should be deterministic aggregate functions
3 participants