Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix nan_as_null not being respected when passing cudf object #14687

Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
24 changes: 11 additions & 13 deletions python/cudf/cudf/core/column/column.py
Original file line number Diff line number Diff line change
Expand Up @@ -1866,7 +1866,7 @@ def as_memoryview(arbitrary: Any) -> Optional[memoryview]:

def as_column(
arbitrary: Any,
nan_as_null: Optional[bool] = None,
nan_as_null: Optional[bool] = False,
dtype: Optional[Dtype] = None,
length: Optional[int] = None,
):
Expand Down Expand Up @@ -1918,20 +1918,18 @@ def as_column(
if dtype is not None:
column = column.astype(dtype)
return column
elif isinstance(arbitrary, ColumnBase):
if dtype is not None:
return arbitrary.astype(dtype)
elif isinstance(arbitrary, (ColumnBase, cudf.Series, cudf.BaseIndex)):
if isinstance(arbitrary, cudf.Series):
column = arbitrary._column
elif isinstance(arbitrary, cudf.BaseIndex):
column = arbitrary._values
else:
return arbitrary
elif isinstance(arbitrary, cudf.Series):
data = arbitrary._column
if dtype is not None:
data = data.astype(dtype)
elif isinstance(arbitrary, cudf.BaseIndex):
data = arbitrary._values
column = arbitrary
if column.dtype.kind == "f" and (nan_as_null is None or nan_as_null):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually I tend to prefer some parameter handling at the start of the function, such as:

if nan_as_null is None:
    nan_as_null = True

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mroeschke - The CI failure is a result of us hitting this check when we call compute() in the dask-cudf test.

When a dask collection is "computed", the partitions need to be concatenated together. It seems that calling cudf.concat uses the cudf.Series._concat, which in turn calls this as_column function with a default nan_as_null=True setting. This means that dask-cudf correctly uses the user-provided nan_as_null=False option when each partition is converted from pandas to cudf, but then the nans are lost when the partitions are concatenated together in compute.

One possible fix may be to modify this line to something like:

 return cls(data=col, index=index, name=name, nan_as_null=False)

However, I'm not immediately sure if there are other corner cases that this doesn't catch.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I suppose the best way to test this outside of dask is to add a cudf.concat test that checks that existing nan values are preserved. In the case that there are both nan and null values, then we would probably need to raise an error or "promote" everything to null.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the insight! The Series._concat piece is what I was missing. I am going to see what breaks if nan_as_null is set to False. In theory, I would imagine nans and nulls should not be mangled together when merged together

column = column.nans_to_nulls()
if dtype is not None:
data = data.astype(dtype)

column = column.astype(dtype)
return column
elif hasattr(arbitrary, "__cuda_array_interface__"):
desc = arbitrary.__cuda_array_interface__
shape = desc["shape"]
Expand Down
4 changes: 2 additions & 2 deletions python/cudf/cudf/core/reshape.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Copyright (c) 2018-2023, NVIDIA CORPORATION.
# Copyright (c) 2018-2024, NVIDIA CORPORATION.

import itertools
import warnings
Expand Down Expand Up @@ -770,7 +770,7 @@ def get_dummies(
result_data.update(col_enc_data)
return cudf.DataFrame._from_data(result_data, index=df._index)
else:
ser = cudf.Series(df)
ser = cudf.Series(df, nan_as_null=False)
unique = _get_unique(column=ser._column, dummy_na=dummy_na)
data = _one_hot_encode_column(
column=ser._column,
Expand Down
2 changes: 1 addition & 1 deletion python/cudf/cudf/core/series.py
Original file line number Diff line number Diff line change
Expand Up @@ -1705,7 +1705,7 @@ def _concat(cls, objs, axis=0, index=True):
if len(objs):
col = col._with_type_metadata(objs[0].dtype)

return cls(data=col, index=index, name=name)
return cls._from_data({name: col}, index=index)

@property # type: ignore
@_cudf_nvtx_annotate
Expand Down
2 changes: 1 addition & 1 deletion python/cudf/cudf/tests/test_dlpack.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Copyright (c) 2019-2022, NVIDIA CORPORATION.
# Copyright (c) 2019-2023, NVIDIA CORPORATION.

import itertools
from contextlib import ExitStack as does_not_raise
Expand Down
8 changes: 8 additions & 0 deletions python/cudf/cudf/tests/test_series.py
Original file line number Diff line number Diff line change
Expand Up @@ -2616,6 +2616,14 @@ def test_series_error_nan_non_float_dtypes():
s[0] = np.nan


@pytest.mark.parametrize("klass", [cudf.Index, cudf.Series])
def test_nan_as_null_from_cudf_objects(klass):
data = klass(pa.array([float("nan")]))
result = klass(data, nan_as_null=True)
expected = klass(pa.array([None], type=pa.float64()))
assert_eq(result, expected)


@pytest.mark.parametrize(
"dtype",
[
Expand Down
Loading