Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Hash Aggregate Cleanup to make closer to spark #3

Closed
revans2 opened this issue May 28, 2020 · 2 comments
Closed

[FEA] Hash Aggregate Cleanup to make closer to spark #3

revans2 opened this issue May 28, 2020 · 2 comments
Labels
feature request New feature or request SQL part of the SQL/Dataframe plugin

Comments

@revans2
Copy link
Collaborator

revans2 commented May 28, 2020

Is your feature request related to a problem? Please describe.
A lot of the processing in the GPU aggregate operations are not as close to spark as they could be. It would be nice to make it more similar to help reduce the possibility of bugs and to increase the possibility of code reuse.

@revans2 revans2 added feature request New feature or request ? - Needs Triage Need team to review and classify SQL part of the SQL/Dataframe plugin labels May 28, 2020
@revans2 revans2 changed the title [FEA] Hash Aggregate Cleanup to mke closer to spark [FEA] Hash Aggregate Cleanup to make closer to spark May 28, 2020
@sameerz sameerz removed the ? - Needs Triage Need team to review and classify label Oct 20, 2020
wjxiz1992 pushed a commit to wjxiz1992/spark-rapids that referenced this issue Oct 29, 2020
Update README and remove useless file
gerashegalov referenced this issue in gerashegalov/spark-rapids Sep 1, 2021
@abellina
Copy link
Collaborator

It seems like this issue can be closed by #3910.

We have made ourselves closer (re: setupReferences), and our interface calls out the relationship between our work and aggBufferAttributes in Spark. I am not sure how much closer we could get to the CPU. But adding the comment to this issue to see if we are satisfied with #3910 or we need more work to get us closer.

@abellina
Copy link
Collaborator

abellina commented Feb 1, 2022

This issue should have been closed by the work in #3910 and #4272.

In a nutshell, the aggregate.scala code was made simpler and more consistent between group-by and reduction, in addition to removing complicate code in setupReferences that was not required. In AggregateFunctions.scala the GpuAggregateFunction and CudfAggregate interfaces were better defined and documented and flexible enough for complicated operations like overflow checking for decimals.

@abellina abellina closed this as completed Feb 1, 2022
mythrocks added a commit to mythrocks/spark-rapids that referenced this issue Dec 6, 2022
1. Fixed indentation.
2. Hardcode for supported date format.
3. Added tests for timestamp strings read as dates.
4. Fixed behaviour for NVIDIA#3 above.
mythrocks added a commit that referenced this issue Dec 7, 2022
* Hive Text parsing of invalid date strings should not cause exceptions.

Fixes #7089. There were two problems:
  1. Strings between field delimiters should not be trimmed before casting to dates.
  2. Invalid date strings should not be causing exceptions. They should return null
     values, as is the convention in Hive's `LazySimpleSerDe`.

Signed-off-by: MithunR <mythrocks@gmail.com>

* Fixed verify errors.

* Fixed merge duplication.

* Review fixes:

1. Fixed indentation.
2. Hardcode for supported date format.
3. Added tests for timestamp strings read as dates.
4. Fixed behaviour for #3 above.

Signed-off-by: MithunR <mythrocks@gmail.com>
Co-authored-by: Robert (Bobby) Evans <bobby@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request SQL part of the SQL/Dataframe plugin
Projects
None yet
Development

No branches or pull requests

3 participants