Make Collect, first and last as deterministic aggregate functions for Spark-3.3 #4677

nartal1 · 2022-02-02T00:25:10Z

This fixes #4286 . Spark has made these functions as deterministic in Spark-3.3. This PR is intended to do the same. For previous versions of Spark(i.e prior to Spark3.3) we are still keeping them as non-determinstic.

Signed-off-by: Niranjan Artal <nartal@nvidia.com>

…rst_deterministic

nartal1 · 2022-02-02T00:25:30Z

build

Signed-off-by: Niranjan Artal <nartal@nvidia.com>

gerashegalov

LGTM, just minor comments

.../src/main/301until330-all/scala/com/nvidia/spark/rapids/shims/v2/Spark30Xuntil33XShims.scala

sql-plugin/src/main/330+/scala/com/nvidia/spark/rapids/shims/v2/Spark33XShims.scala

abellina · 2022-02-02T14:56:05Z

It is not entirely clear to me what deterministic is used for when expressions are aggregate functions. I see some optimizations, like the one that triggered the change, but I don't fully understand each case. This is arguably separate from @nartal1's change, but I think now that we agree with Spark that these functions are deterministic, do we know that the GPU is as deterministic as the CPU, and does it matter?

Signed-off-by: Niranjan Artal <nartal@nvidia.com>

gerashegalov · 2022-02-02T19:11:07Z

but I think now that we agree with Spark that these functions are deterministic, do we know that the GPU is as deterministic as the CPU, and does it matter?

if you mean we=Plugin, then the agreement with Spark is that determinism depends on the determinism of the children, which is the default definition Expression. Collect,First, Last. So we should double check if we correctly compute deterministic if e.g. the input is something like non-Stable out of core sort.

nartal1 · 2022-02-02T19:22:44Z

It is not entirely clear to me what deterministic is used for when expressions are aggregate functions. I see some optimizations, like the one that triggered the change, but I don't fully understand each case. This is arguably separate from @nartal1's change, but I think now that we agree with Spark that these functions are deterministic, do we know that the GPU is as deterministic as the CPU, and does it matter?

@abellina IIUC from the discussion in the Spark's PR, these functions were mistakenly marked as non-deterministic. Deterministic in this context is the result will be same within a group(ordered). Optimizer rule is applied to these functions if we remove them as non-deterministic. From GPU point, I think the same rule would apply, right? And based on default definition of determinstic, it would be set to false if the child expression are not deterministic.
More context on why these has been made as deterministic in this comment: apache/spark#29810 (comment)

.../src/main/301until330-all/scala/com/nvidia/spark/rapids/shims/v2/Spark30Xuntil33XShims.scala

Signed-off-by: Niranjan Artal <nartal@nvidia.com>

abellina · 2022-02-02T20:25:40Z

Having a flag that is deterministic=true for a function with non-deterministic all over the header (https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/First.scala#L35) and (https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Last.scala#L35), is really confusing.

That said our behavior hasn't changed, and we are still as non-deterministic as we were before. We are following Spark's lead in setting this, and that seems like the right thing to do (and they are also non-deterministic). I think this is a follow on research to figure out what this flag means in all scenarios, and perhaps have some updated comments in these expressions.

nartal1 · 2022-02-02T20:33:20Z

Having a flag that is deterministic=true for a function with non-deterministic all over the header (https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/First.scala#L35) and (https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Last.scala#L35), is really confusing.

Agreed that header is confusing. It looks like headers/comments were not updated when deterministic=false was removed.

nartal1 · 2022-02-02T21:56:49Z

build

nartal1 · 2022-02-03T01:17:04Z

@abellina Please take another look and let me know if we could merge this PR.

abellina · 2022-02-03T14:51:36Z

I filed this #4684 to see if we can find more info around this flag. In terms of this PR, it is adhering to the value used in Spark 3.3, so that seems OK, that said I don't know enough about this right now to say I understand the side effects. If you or @gerashegalov are pretty convinced this is OK, then by all means please merge.

nartal1 · 2022-02-03T16:48:43Z

Thanks @abellina for your input! Merging this as it fixes the original issue.

nartal1 added 2 commits February 1, 2022 15:14

Make First, Last and Collect as deterministic for Spark-3.3

77c1ce2

Signed-off-by: Niranjan Artal <nartal@nvidia.com>

Merge branch 'branch-22.04' of github.com:NVIDIA/spark-rapids into fi…

576cb80

…rst_deterministic

nartal1 added task Work required that improves the product but is not user facing audit_3.3.0 Audit related tasks for 3.3.0 labels Feb 2, 2022

nartal1 added this to the Jan 31 - Feb 11 milestone Feb 2, 2022

nartal1 self-assigned this Feb 2, 2022

common trait name

e66a27d

Signed-off-by: Niranjan Artal <nartal@nvidia.com>

gerashegalov reviewed Feb 2, 2022

View reviewed changes

address review comments

e0b9f78

Signed-off-by: Niranjan Artal <nartal@nvidia.com>

gerashegalov requested changes Feb 2, 2022

View reviewed changes

.../src/main/301until330-all/scala/com/nvidia/spark/rapids/shims/v2/Spark30Xuntil33XShims.scala Outdated Show resolved Hide resolved

addressed review comments

c715cdd

Signed-off-by: Niranjan Artal <nartal@nvidia.com>

gerashegalov approved these changes Feb 2, 2022

View reviewed changes

abellina mentioned this pull request Feb 3, 2022

[DOC] what is deterministic used for in Spark, specifically around aggregate functions #4684

Closed

nartal1 merged commit 8380df8 into NVIDIA:branch-22.04 Feb 3, 2022

nartal1 deleted the first_deterministic branch February 3, 2022 16:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make Collect, first and last as deterministic aggregate functions for Spark-3.3 #4677

Make Collect, first and last as deterministic aggregate functions for Spark-3.3 #4677

nartal1 commented Feb 2, 2022

nartal1 commented Feb 2, 2022

gerashegalov left a comment

abellina commented Feb 2, 2022

gerashegalov commented Feb 2, 2022

nartal1 commented Feb 2, 2022

abellina commented Feb 2, 2022 •

edited

Loading

nartal1 commented Feb 2, 2022

nartal1 commented Feb 2, 2022

nartal1 commented Feb 3, 2022

abellina commented Feb 3, 2022 •

edited

Loading

nartal1 commented Feb 3, 2022

Make Collect, first and last as deterministic aggregate functions for Spark-3.3 #4677

Make Collect, first and last as deterministic aggregate functions for Spark-3.3 #4677

Conversation

nartal1 commented Feb 2, 2022

nartal1 commented Feb 2, 2022

gerashegalov left a comment

Choose a reason for hiding this comment

abellina commented Feb 2, 2022

gerashegalov commented Feb 2, 2022

nartal1 commented Feb 2, 2022

abellina commented Feb 2, 2022 • edited Loading

nartal1 commented Feb 2, 2022

nartal1 commented Feb 2, 2022

nartal1 commented Feb 3, 2022

abellina commented Feb 3, 2022 • edited Loading

nartal1 commented Feb 3, 2022

abellina commented Feb 2, 2022 •

edited

Loading

abellina commented Feb 3, 2022 •

edited

Loading