Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Correctly set bitmask size in from_column_view #13315

Merged
merged 9 commits into from
May 17, 2023

Conversation

wence-
Copy link
Contributor

@wence- wence- commented May 9, 2023

Description

An empty struct column (dtype of StructDtype({})) has no children, and
hence a base_size of zero. However, it may still have a non-zero size
and non-empty null mask. When slicing such a column, the mask size
must be transferred over correctly by inspecting the size and offset
of the owning column. Previously, we incorrectly determined the sliced
column to have a mask buffer of zero bytes in this case.

Closes #13305.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

wence- added 2 commits May 9, 2023 13:04
An empty struct column (dtype of StructDtype({})) has no children, and
hence a base_size of zero. However, it may still have a non-zero size
and non-empty null mask. When slicing such a column, the mask size
must be transferred over correctly by inspecting the size and offset
of the owning column. Previously, we incorrectly determined the sliced
column to have a mask buffer of zero bytes in this case.

Closes rapidsai#13305.
@wence- wence- requested a review from a team as a code owner May 9, 2023 13:06
@wence- wence- requested review from vyasr and shwina May 9, 2023 13:06
@github-actions github-actions bot added the Python Affects Python cuDF API. label May 9, 2023
@wence- wence- added bug Something isn't working non-breaking Non-breaking change labels May 9, 2023
Copy link
Contributor Author

@wence- wence- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another self-review, please check my working here...

python/cudf/cudf/_lib/column.pyx Outdated Show resolved Hide resolved
python/cudf/cudf/_lib/column.pyx Outdated Show resolved Hide resolved
python/cudf/cudf/_lib/column.pyx Outdated Show resolved Hide resolved
Copy link
Contributor

@shwina shwina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I worry that in the future someone may use base_size when they should be using size + offset, and I really would like for those both to be equal always.

Would redefining the base_size of a struct column as follows fix this bug? (and would it break anything else?)

class StructColumn:

    ...

    @property
    def base_size(self):
        if not self.base_children:
            # this can happen if the column is empty
            # OR if the column is all nulls
            return self.null_count
        else:
            return len(self.base_children[0])

@wence-
Copy link
Contributor Author

wence- commented May 17, 2023

    @property
    def base_size(self):
        if not self.base_children:
            # this can happen if the column is empty
            # OR if the column is all nulls
            return self.null_count

This branch is wrong, it should be self.size + self.offset

        else:
            return len(self.base_children[0])

I think this branch is right.

So perhaps this would do the trick:

diff --git a/python/cudf/cudf/core/column/struct.py b/python/cudf/cudf/core/column/struct.py
index 6838d71164..f8ae4ab846 100644
--- a/python/cudf/cudf/core/column/struct.py
+++ b/python/cudf/cudf/core/column/struct.py
@@ -29,9 +29,11 @@ class StructColumn(ColumnBase):
     @property
     def base_size(self):
         if not self.base_children:
-            return 0
+            return self.size + self.offset
         else:
-            return len(self.base_children[0])
+            size = len(self.base_children[0])
+            assert size == self.size + self.offset, "Unpossible"
+            return size
 
     def to_arrow(self):
         children = [

def test_struct_empty_children_nulls_slice(indices):
values = [None, {}, {}, None]

s = cudf.Series([None, {}, {}, None])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reuse the input from the previous line if it’s equivalent.

Suggested change
s = cudf.Series([None, {}, {}, None])
s = cudf.Series(values)

Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks fine to me! Do you need to cover test cases with non-contiguous indices for iloc or is it only affecting “proper slices” with step=1?

@wence-
Copy link
Contributor Author

wence- commented May 17, 2023

assert size == self.size + self.offset, "Unpossible"

Turns out this is regularly possible (if one slices a column, s.iloc[10:20] then size + offset is 20, but len(self.base_children[0]) is the length of the original s)

@shwina
Copy link
Contributor

shwina commented May 17, 2023

if one slices a column, s.iloc[10:20] then size + offset is 20, but len(self.base_children[0]) is the length of the original s

Hmm, that's right. Sorry, I got this very wrong.

That being said, would you agree that the base_size of an all-null struct column should be non-zero? Going back to:

   @property
   def base_size(self):
       if not self.base_children:
           # this can happen if the column is empty
           # OR if the column is all nulls
           return self.null_count

This branch is wrong, it should be self.size + self.offset

When does this branch return the wrong answer?

@wence-
Copy link
Contributor Author

wence- commented May 17, 2023

That being said, would you agree that the base_size of an all-null struct column should be non-zero? Going back to:

Yes.

   @property
   def base_size(self):
       if not self.base_children:
           # this can happen if the column is empty
           # OR if the column is all nulls
           return self.null_count

This branch is wrong, it should be self.size + self.offset

When does this branch return the wrong answer?

The column doesn't have to be all nulls:

In [1]: import cudf

In [2]: s = cudf.Series([{}, None, None, {}])

In [3]: s
Out[3]: 
0      {}
1    None
2    None
3      {}
dtype: struct

In [4]: s._column
Out[4]: 
<cudf.core.column.struct.StructColumn object at 0x7fc81813c3c0>
-- is_valid:
  [
    true,
    false,
    false,
    true
  ]
dtype: struct

In [5]: s._column.base_children
Out[5]: ()

@wence-
Copy link
Contributor Author

wence- commented May 17, 2023

This looks fine to me! Do you need to cover test cases with non-contiguous indices for iloc or is it only affecting “proper slices” with step=1?

It is only "proper slices" where the result of the slice shares data with the sliced column, so yes, only "proper" indices need covered.

@wence-
Copy link
Contributor Author

wence- commented May 17, 2023

It is only "proper slices" where the result of the slice shares data with the sliced column, so yes, only "proper" indices need covered.

I added coverage of non-stride-1 slices just to be sure.

@wence-
Copy link
Contributor Author

wence- commented May 17, 2023

/merge

@rapids-bot rapids-bot bot merged commit 8b7c210 into rapidsai:branch-23.06 May 17, 2023
@wence- wence- deleted the wence/fix/issue-13305 branch May 17, 2023 19:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
None yet
3 participants