Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable sort for single-level nesting struct columns on GPU #1883

Merged
merged 35 commits into from
Mar 26, 2021

Conversation

gerashegalov
Copy link
Collaborator

@gerashegalov gerashegalov commented Mar 5, 2021

This PR contributes to #1605. It depends on rapidsai/cudf#7422

  • adds single-level struct type

Signed-off-by: Gera Shegalov gera@apache.org

Signed-off-by: Gera Shegalov <gera@apache.org>
Signed-off-by: Gera Shegalov <gera@apache.org>
@gerashegalov
Copy link
Collaborator Author

gerashegalov commented Mar 5, 2021

Status:

TEST_PARALLEL=1 SPARK_HOME=~/spark-3.1.1-bin-hadoop3.2 integration_tests/run_pyspark_from_build.sh \
  -k "test_single_orderby and struct"

we observe the failure: UPDATE: ✔️ resolved

ai.rapids.cudf.CudfException: cuDF failure at: /home/gshegalov/gits/cudf/cpp/src/sort/sort_column.cu:112: Column type must be relationally comparable
at ai.rapids.cudf.Table.orderBy(Native Method)
at ai.rapids.cudf.Table.orderBy(Table.java:1441)
at com.nvidia.spark.rapids.GpuSorter.$anonfun$sort$2(SortUtils.scala:183)

@jlowe jlowe added the SQL part of the SQL/Dataframe plugin label Mar 5, 2021
Copy link
Member

@jlowe jlowe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's here looks good, but there are some things missing.

At some point this needs to be built with mvn verify to generate the changes to supported_ops.md and those changes should be in this PR.

It would be nice to have some negative testing for this, i.e.: testing that we don't try to place sorts of deeply nested struct key columns or non-default struct sort ordering on the GPU.

Signed-off-by: Gera Shegalov <gera@apache.org>
Signed-off-by: Gera Shegalov <gera@apache.org>
@gerashegalov
Copy link
Collaborator Author

build

@gerashegalov gerashegalov marked this pull request as ready for review March 11, 2021 09:14
@gerashegalov gerashegalov changed the title [WIP] Enable sort for single-level nesting struct columns on GPU Enable sort for single-level nesting struct columns on GPU Mar 11, 2021
@gerashegalov
Copy link
Collaborator Author

$ TEST_PARALLEL=1 COVERAGE_SUBMIT_FLAGS=$JDBSTR SPARK_HOME=~/spark-3.1.1-bin-hadoop3.2 integration_tests/run_pyspark_from_build.sh -k "test_single_struct and orderby"

=========================== short test summary info ============================
XFAIL ../../src/main/python/sort_test.py::test_single_nested_order_orderby[Column<'a ASC NULLS FIRST'>0-Struct(['child0', Struct(['child0', Byte],['child1', Short],['child2', Integer],['child3', Long],['child4', Float],['child5', Double],['child6', String],['child7', Boolean],['child8', Date],['child9', Timestamp],['child10', Null])])]
  second-level structs are not supported
XFAIL ../../src/main/python/sort_test.py::test_single_nested_order_orderby[Column<'a ASC NULLS FIRST'>0-Array(String)]
  arrays are not supported
XFAIL ../../src/main/python/sort_test.py::test_single_nested_order_orderby[Column<'a ASC NULLS FIRST'>0-Map(String(not_null),Map(String(not_null),String))]
  maps are not supported
XFAIL ../../src/main/python/sort_test.py::test_single_nested_order_orderby[Column<'a ASC NULLS FIRST'>1-Struct(['child0', Struct(['child0', Byte],['child1', Short],['child2', Integer],['child3', Long],['child4', Float],['child5', Double],['child6', String],['child7', Boolean],['child8', Date],['child9', Timestamp],['child10', Null])])]
  second-level structs are not supported
XFAIL ../../src/main/python/sort_test.py::test_single_nested_order_orderby[Column<'a ASC NULLS FIRST'>1-Array(String)]
  arrays are not supported
XFAIL ../../src/main/python/sort_test.py::test_single_nested_order_orderby[Column<'a ASC NULLS FIRST'>1-Map(String(not_null),Map(String(not_null),String))]
  maps are not supported
XFAIL ../../src/main/python/sort_test.py::test_single_nested_order_orderby[Column<'a ASC NULLS LAST'>-Struct(['child0', Byte],['child1', Short],['child2', Integer],['child3', Long],['child4', Float],['child5', Double],['child6', String],['child7', Boolean],['child8', Date],['child9', Timestamp],['child10', Null])]
  opposite null order not supported
XFAIL ../../src/main/python/sort_test.py::test_single_nested_order_orderby[Column<'a ASC NULLS LAST'>-Struct(['child0', Struct(['child0', Byte],['child1', Short],['child2', Integer],['child3', Long],['child4', Float],['child5', Double],['child6', String],['child7', Boolean],['child8', Date],['child9', Timestamp],['child10', Null])])]
  opposite null order not supported
XFAIL ../../src/main/python/sort_test.py::test_single_nested_order_orderby[Column<'a ASC NULLS LAST'>-Array(String)]
  opposite null order not supported
XFAIL ../../src/main/python/sort_test.py::test_single_nested_order_orderby[Column<'a ASC NULLS LAST'>-Map(String(not_null),Map(String(not_null),String))]
  opposite null order not supported
XFAIL ../../src/main/python/sort_test.py::test_single_nested_order_orderby[Column<'a DESC NULLS LAST'>0-Struct(['child0', Struct(['child0', Byte],['child1', Short],['child2', Integer],['child3', Long],['child4', Float],['child5', Double],['child6', String],['child7', Boolean],['child8', Date],['child9', Timestamp],['child10', Null])])]
  second-level structs are not supported
XFAIL ../../src/main/python/sort_test.py::test_single_nested_order_orderby[Column<'a DESC NULLS LAST'>0-Array(String)]
  arrays are not supported
XFAIL ../../src/main/python/sort_test.py::test_single_nested_order_orderby[Column<'a DESC NULLS LAST'>0-Map(String(not_null),Map(String(not_null),String))]
  maps are not supported
XFAIL ../../src/main/python/sort_test.py::test_single_nested_order_orderby[Column<'a DESC NULLS FIRST'>-Struct(['child0', Byte],['child1', Short],['child2', Integer],['child3', Long],['child4', Float],['child5', Double],['child6', String],['child7', Boolean],['child8', Date],['child9', Timestamp],['child10', Null])]
  opposite null order not supported
XFAIL ../../src/main/python/sort_test.py::test_single_nested_order_orderby[Column<'a DESC NULLS FIRST'>-Struct(['child0', Struct(['child0', Byte],['child1', Short],['child2', Integer],['child3', Long],['child4', Float],['child5', Double],['child6', String],['child7', Boolean],['child8', Date],['child9', Timestamp],['child10', Null])])]
  opposite null order not supported
XFAIL ../../src/main/python/sort_test.py::test_single_nested_order_orderby[Column<'a DESC NULLS FIRST'>-Array(String)]
  opposite null order not supported
XFAIL ../../src/main/python/sort_test.py::test_single_nested_order_orderby[Column<'a DESC NULLS FIRST'>-Map(String(not_null),Map(String(not_null),String))]
  opposite null order not supported
XFAIL ../../src/main/python/sort_test.py::test_single_nested_order_orderby[Column<'a DESC NULLS LAST'>1-Struct(['child0', Struct(['child0', Byte],['child1', Short],['child2', Integer],['child3', Long],['child4', Float],['child5', Double],['child6', String],['child7', Boolean],['child8', Date],['child9', Timestamp],['child10', Null])])]
  second-level structs are not supported
XFAIL ../../src/main/python/sort_test.py::test_single_nested_order_orderby[Column<'a DESC NULLS LAST'>1-Array(String)]
  arrays are not supported
XFAIL ../../src/main/python/sort_test.py::test_single_nested_order_orderby[Column<'a DESC NULLS LAST'>1-Map(String(not_null),Map(String(not_null),String))]
  maps are not supported
=============== 4 passed, 4848 deselected, 20 xfailed in 43.43s ================

@jlowe jlowe marked this pull request as draft March 11, 2021 15:23
Signed-off-by: Gera Shegalov <gera@apache.org>
- Create a single config for RAT in parent.
- Update RAT exlusion list to filter out hidden diretories

Signed-off-by: Gera Shegalov <gera@apache.org>
Signed-off-by: Gera Shegalov <gera@apache.org>
Signed-off-by: Gera Shegalov <gera@apache.org>
Copy link
Collaborator

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GPU Sort uses upperbound, but not in all cases.

withResource(sorter.upperBound(mergedBatch, cutoffCb)) { result =>

without rapidsai/cudf#7690 that means we have a ticking time bomb in the code.

Also the single partition work around makes no since to me at all. Until that is also fixed I see no way for us to merge this in.

@gerashegalov
Copy link
Collaborator Author

Also the single partition work around makes no since to me at all. Until that is also fixed I see no way for us to merge this in.

I understand the concern about GPU Sort.

As for the single partition workaround, It's just incremental delivery which mimics the progress we have on the cuDF side. It would make sense to me as long as it works as documented.

@gerashegalov gerashegalov requested review from revans2 and jlowe March 25, 2021 18:41
revans2
revans2 previously approved these changes Mar 25, 2021
@gerashegalov
Copy link
Collaborator Author

build

Copy link
Member

@jlowe jlowe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks better, but I didn't see a test added for the ability to sort a table containing arrays but not using arrays as the key. Is that going to be done as a followup then?

@gerashegalov
Copy link
Collaborator Author

@jlowe these existing tests cover the scenarios you are calling out. they are passing now locally

@gerashegalov
Copy link
Collaborator Author

build

@jlowe
Copy link
Member

jlowe commented Mar 26, 2021

these existing tests cover the scenarios you are calling out. they are passing now locally

Great, thanks @gerashegalov!

@gerashegalov gerashegalov merged commit 64f9d9b into NVIDIA:branch-0.5 Mar 26, 2021
@gerashegalov gerashegalov deleted the issue-1605-struct-sort branch March 26, 2021 22:04
nartal1 pushed a commit to nartal1/spark-rapids that referenced this pull request Jun 9, 2021
Adds single-level struct columns to sort. This PR contributes to NVIDIA#1605 

The following limitations apply with this PR for a total sort, and will be resolved in follow-up PR's
- only if the number of partitions is 1 
- only if spark.rapids.sql.stableSort.enabled is true
 
Signed-off-by: Gera Shegalov <gera@apache.org>
nartal1 pushed a commit to nartal1/spark-rapids that referenced this pull request Jun 9, 2021
Adds single-level struct columns to sort. This PR contributes to NVIDIA#1605 

The following limitations apply with this PR for a total sort, and will be resolved in follow-up PR's
- only if the number of partitions is 1 
- only if spark.rapids.sql.stableSort.enabled is true
 
Signed-off-by: Gera Shegalov <gera@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
SQL part of the SQL/Dataframe plugin
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants