-
Notifications
You must be signed in to change notification settings - Fork 237
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make Collect, first and last as deterministic aggregate functions for Spark-3.3 #4677
Conversation
Signed-off-by: Niranjan Artal <nartal@nvidia.com>
…rst_deterministic
build |
Signed-off-by: Niranjan Artal <nartal@nvidia.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, just minor comments
.../src/main/301until330-all/scala/com/nvidia/spark/rapids/shims/v2/Spark30Xuntil33XShims.scala
Outdated
Show resolved
Hide resolved
.../src/main/301until330-all/scala/com/nvidia/spark/rapids/shims/v2/Spark30Xuntil33XShims.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/330+/scala/com/nvidia/spark/rapids/shims/v2/Spark33XShims.scala
Outdated
Show resolved
Hide resolved
It is not entirely clear to me what deterministic is used for when expressions are aggregate functions. I see some optimizations, like the one that triggered the change, but I don't fully understand each case. This is arguably separate from @nartal1's change, but I think now that we agree with Spark that these functions are deterministic, do we know that the GPU is as deterministic as the CPU, and does it matter? |
Signed-off-by: Niranjan Artal <nartal@nvidia.com>
if you mean we=Plugin, then the agreement with Spark is that determinism depends on the determinism of the children, which is the default definition Expression. Collect,First, Last. So we should double check if we correctly compute deterministic if e.g. the input is something like non-Stable out of core sort. |
@abellina IIUC from the discussion in the Spark's PR, these functions were mistakenly marked as non-deterministic. Deterministic in this context is the result will be same within a group(ordered). Optimizer rule is applied to these functions if we remove them as non-deterministic. From GPU point, I think the same rule would apply, right? And based on default definition of determinstic, it would be set to |
.../src/main/301until330-all/scala/com/nvidia/spark/rapids/shims/v2/Spark30Xuntil33XShims.scala
Outdated
Show resolved
Hide resolved
Signed-off-by: Niranjan Artal <nartal@nvidia.com>
Having a flag that is That said our behavior hasn't changed, and we are still as non-deterministic as we were before. We are following Spark's lead in setting this, and that seems like the right thing to do (and they are also non-deterministic). I think this is a follow on research to figure out what this flag means in all scenarios, and perhaps have some updated comments in these expressions. |
Agreed that header is confusing. It looks like headers/comments were not updated when |
build |
@abellina Please take another look and let me know if we could merge this PR. |
I filed this #4684 to see if we can find more info around this flag. In terms of this PR, it is adhering to the value used in Spark 3.3, so that seems OK, that said I don't know enough about this right now to say I understand the side effects. If you or @gerashegalov are pretty convinced this is OK, then by all means please merge. |
Thanks @abellina for your input! Merging this as it fixes the original issue. |
This fixes #4286 . Spark has made these functions as deterministic in Spark-3.3. This PR is intended to do the same. For previous versions of Spark(i.e prior to Spark3.3) we are still keeping them as non-determinstic.