Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] groupby on STRUCT / LIST input passes silently, returns incorrect grouping. #8905

Closed
mythrocks opened this issue Jul 29, 2021 · 4 comments · Fixed by #9227
Closed

[BUG] groupby on STRUCT / LIST input passes silently, returns incorrect grouping. #8905

mythrocks opened this issue Jul 29, 2021 · 4 comments · Fixed by #9227
Assignees
Labels
bug Something isn't working libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS

Comments

@mythrocks
Copy link
Contributor

mythrocks commented Jul 29, 2021

When a groupby is constructed on a STRUCT or LIST input, it does not currently fail, as one might expect of an unsupported type.

The groupby.aggregate() seems to return the input groupby column unchanged, implying that each row is its own group. I think this also contributes to issues like #8887, where the grouped_rolling_window() returns unexpected output.

It might be worth considering a CUDF_FAIL() for unsupported input types, at least until support is added.

The repro case for STRUCT input follows:

TEST_F(UtilitiesTest, groupby_struct_bork)
{
  using namespace cudf;
  using ints = fixed_width_column_wrapper<int32_t>;

  auto child       = ints{1, 2, 3, 1, 2, 2, 1, 3, 3, 2};
  auto struct_keys = structs_column_wrapper{child};
  ints vals{0, 1, 2, 3, 4, 5, 6, 7, 8, 9};

  auto expect_child       = ints{1, 2, 3};
  auto expect_struct_keys = structs_column_wrapper{expect_child};
  ints expect_vals{9, 19, 17};

  std::cout << "Struct input to groupby (keys)!" << std::endl;
  print(struct_keys);

  std::cout << "Expected output (grouped) struct keys!" << std::endl;
  print(expect_struct_keys);

  auto requests = std::vector<groupby::aggregation_request>{};
  requests.emplace_back(groupby::aggregation_request{});
  auto& agg_request  = requests.front();
  agg_request.values = vals;
  agg_request.aggregations.push_back(cudf::make_sum_aggregation());
  auto gby = groupby::groupby{table_view{{struct_keys}}, null_policy::EXCLUDE, sorted::NO, {}, {}};
  auto result = gby.aggregate(requests);

  std::cout << "Actual output (grouped) struct keys: " << std::endl;
  print(result.first->view().column(0));
}
@mythrocks mythrocks added bug Something isn't working Needs Triage Need team to review and classify labels Jul 29, 2021
@mythrocks mythrocks self-assigned this Jul 29, 2021
@mythrocks
Copy link
Contributor Author

Here's the corresponding repro case for LIST input:

TEST_F(UtilitiesTest, groupby_lists_bork)
{
  using namespace cudf;
  using ints = fixed_width_column_wrapper<int32_t>;
  using int_lists = lists_column_wrapper<int32_t>;

  auto lists = int_lists{ {1,2,3}, {4,5,6}, {1,2,3}, {4,5,6}, {1,2,3}, {4,5,6} };
  auto vals  = ints     {      0,       1,       2,       3,       4,       5};

  auto expect_lists_keys       = int_lists{{1, 2, 3}, {4,5,6}};
  ints expect_vals{9, 19, 17};

  std::cout << "Lists input to groupby (keys)!" << std::endl;
  print(lists);

  std::cout << "Expected output (grouped) lists keys!" << std::endl;
  print(expect_lists_keys);

  auto requests = std::vector<groupby::aggregation_request>{};
  requests.emplace_back(groupby::aggregation_request{});
  auto& agg_request  = requests.front();
  agg_request.values = vals;
  agg_request.aggregations.push_back(cudf::make_sum_aggregation());
  auto gby = groupby::groupby{table_view{{lists}}, null_policy::EXCLUDE, sorted::NO, {}, {}};
  auto result = gby.aggregate(requests);

  std::cout << "Actual output (grouped) lists keys: " << std::endl;
  print(result.first->view().column(0));
}

@mythrocks mythrocks changed the title [BUG] groupby on STRUCT and LIST input passes silently, and returns incorrect grouping. [BUG] groupby on STRUCT / LIST input passes silently returns incorrect grouping. Jul 29, 2021
@mythrocks mythrocks changed the title [BUG] groupby on STRUCT / LIST input passes silently returns incorrect grouping. [BUG] groupby on STRUCT / LIST input passes silently, returns incorrect grouping. Jul 29, 2021
@beckernick beckernick added libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS and removed Needs Triage Need team to review and classify labels Aug 23, 2021
@mythrocks
Copy link
Contributor Author

Update: groupby::aggregate() works with STRUCT grouping columns, as evidenced by the structs_tests.cpp added in #9024.

The remaining sticking point is grouping on LISTS.

@mythrocks
Copy link
Contributor Author

The remaining sticking point is grouping on LISTS.

With a little bit of digging, one sees that groupby's sort-based aggregations fail on LISTS, with the following error:

unknown file: Failure
C++ exception with description "cuDF failure at: /home/mithunr/workspace/dev/cudf/2/cpp/include/cudf/table/row_operators.cuh:362: Attempted to compare elements of uncomparable types." thrown in the test body.

However, hash-based aggregations still pass silently, except in debug-builds:

/home/mithunr/workspace/dev/cudf/2/cpp/include/cudf/table/row_operators.cuh:447: unsigned int cudf:
:element_hasher_with_seed<hash_function, has_nulls>::operator()(cudf::column_device_view, signed in
t) const [with T = cudf::list_view; void *<anonymous> = (void *)nullptr; hash_function = default_ha
sh; __nv_bool has_nulls = false]: block: [0,0,0], thread: [0,0,0] Assertion `false && "Unsupported
type in hash."` failed.

It only fails when it attempts to construct a hash value for a LISTS row.

Rather than depend on the benevolence of flatten_nested_columns() or the hash functions throwing an unsupported exception, it would be preferable to have groupby bail early on LISTS input.

mythrocks added a commit to mythrocks/cudf that referenced this issue Sep 14, 2021
Fixes rapidsai#8905.

Attempting groupby aggregations with LIST keys leads to silent
failures and bad results.
For instance, attempting hash-based groupby aggregations with LIST
keys only fails on DEBUG builds, thus:
```
/home/myth/dev/cudf/2/cpp/include/cudf/table/row_operators.cuh:447: unsigned int cudf:
:element_hasher_with_seed<hash_function, has_nulls>::operator()(cudf::column_device_view, signed in
t) const [with T = cudf::list_view; void *<anonymous> = (void *)nullptr; hash_function = default_ha
sh; __nv_bool has_nulls = false]: block: [0,0,0], thread: [0,0,0] Assertion `false && "Unsupported
type in hash."` failed.
```
In RELEASE builds, a copy of the input LIST column is returned, causing
each output row to be interpreted as its own group.

This commit adds an explicit failure for unsupported LIST groupby keys.
@mythrocks
Copy link
Contributor Author

hash-based aggregations still pass silently

After #9024, flatten_nested_columns() does throw a visible error, preventing grouping by LIST columns. It would still be preferable to make the changes proposed in #9227:

  1. There are tests now, for groupby aggregations with LIST keys.
  2. The error is now explicitly a grouping failure.

rapids-bot bot pushed a commit that referenced this issue Sep 20, 2021
Fixes #8905.

Attempting groupby aggregations with `LIST` keys leads to silent
failures and bad results.
For instance, attempting hash-based `groupby` aggregations with `LIST`
keys only fails on DEBUG builds, thus:
```
/home/myth/dev/cudf/2/cpp/include/cudf/table/row_operators.cuh:447: unsigned int cudf:
:element_hasher_with_seed<hash_function, has_nulls>::operator()(cudf::column_device_view, signed in
t) const [with T = cudf::list_view; void *<anonymous> = (void *)nullptr; hash_function = default_ha
sh; __nv_bool has_nulls = false]: block: [0,0,0], thread: [0,0,0] Assertion `false && "Unsupported
type in hash."` failed.
```
In RELEASE builds, a copy of the input `LIST` column is returned, causing
each output row to be interpreted as its own group.

This commit adds an explicit failure for unsupported groupby key types,
i.e. those that don't support equality comparisons (like `LIST`).

Authors:
  - MithunR (https://github.com/mythrocks)

Approvers:
  - Nghia Truong (https://github.com/ttnghia)
  - Robert Maynard (https://github.com/robertmaynard)
  - Jake Hemstad (https://github.com/jrhemstad)

URL: #9227
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants