Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changes to support Numpy >= 1.24 #13325

Merged
merged 22 commits into from
May 25, 2023
Merged

Conversation

shwina
Copy link
Contributor

@shwina shwina commented May 10, 2023

Description

Closes #13301

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@github-actions github-actions bot added the Python Affects Python cuDF API. label May 10, 2023
@shwina shwina added tech debt non-breaking Non-breaking change and removed Python Affects Python cuDF API. labels May 10, 2023
@github-actions github-actions bot added the Python Affects Python cuDF API. label May 10, 2023
@@ -150,8 +150,8 @@ def make_all_numeric_extremes_dataframe():
np_type = pdf_dtypes[gdf_dtype]
if np.issubdtype(np_type, np.integer):
itype = np.iinfo(np_type)
extremes = [0, +1, -1, itype.min, itype.max]
df[gdf_dtype] = np.array(extremes * 4, dtype=np_type)[:20]
extremes = [itype.min, itype.max]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to change the comments at the beginning of these tests

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you elaborate? Is this a task you want to accomplish in this PR?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know that 0, +1, and -1 aren't extrema for integer types, but is there a reason you remove them from these tests? I suppose perhaps that np.uint8(-1) now raises OverflowError or something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

np.uint8(-1) in particular raises a deprecation warning. But the reason I gave up on this test was because I couldn't figure out why this was happening:

In [2]: np.array([-1]).astype("uint64")
Out[2]: array([18446744073709551615], dtype=uint64)

In [3]: np.array([18446744073709551615]).astype("uint64")
Out[3]: array([18446744073709551615], dtype=uint64)

In [4]: np.array([-1, 18446744073709551615]).astype("uint64")
<ipython-input-4-03014ed268fc>:1: RuntimeWarning: invalid value encountered in cast
  np.array([-1, 18446744073709551615]).astype("uint64")
Out[4]: array([18446744073709551615,                    0], dtype=uint64)

I've gone ahead and filtered out that warning from this test.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the reason I gave up on this test was because I couldn't figure out why this was happening:

np.array([-1, 2**64 - 1]).dtype == "float64"

which is lossy.

@shwina shwina added the improvement Improvement / enhancement to an existing function label May 10, 2023
@shwina shwina marked this pull request as ready for review May 11, 2023 21:49
@shwina shwina requested a review from a team as a code owner May 11, 2023 21:49
@@ -96,6 +97,7 @@

_setup_numba_linker(_PTX_FILE)

patch_numba_codegen_if_needed()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This hack will go away

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it can be removed from this PR itself. Now that we've verified numpy 1.24 support, I would recommend removing the numba-related changes in this PR that you're using in order to allow running numba 0.57 (which is necessary to use numpy 1.24). We'll still get it tested because our CUDA 12 wheel builds will patch us to use 0.57 anyway (but with CUDA 12 we don't use cubinlinker/ptxcompiler so we don't need any edits for those). Then when we bump our numba to 0.57 tests should pass thanks to this PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have one other question on this PR, would wait to make changes here until everything else is resolved in case you need to run more tests locally.

@vyasr
Copy link
Contributor

vyasr commented May 11, 2023

Can you remove the numpy upper bound pinnings in dependencies.yaml and the cudf meta.yaml? Then in the CUDA 12.0 CI let's make sure that at least one run installs numpy 1.24 and still passes tests.

@shwina shwina requested a review from a team as a code owner May 11, 2023 21:54
@github-actions github-actions bot added the conda label May 11, 2023
@vyasr vyasr requested a review from a team as a code owner May 11, 2023 22:33
Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Success! https://github.com/rapidsai/cudf/actions/runs/4953167580/jobs/8860521944?pr=13325#step:10:828 shows that numpy 1.24 is now getting installed into the 12.0.1 wheel build, and tests are passing

@@ -96,6 +97,7 @@

_setup_numba_linker(_PTX_FILE)

patch_numba_codegen_if_needed()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it can be removed from this PR itself. Now that we've verified numpy 1.24 support, I would recommend removing the numba-related changes in this PR that you're using in order to allow running numba 0.57 (which is necessary to use numpy 1.24). We'll still get it tested because our CUDA 12 wheel builds will patch us to use 0.57 anyway (but with CUDA 12 we don't use cubinlinker/ptxcompiler so we don't need any edits for those). Then when we bump our numba to 0.57 tests should pass thanks to this PR.

@@ -150,8 +150,8 @@ def make_all_numeric_extremes_dataframe():
np_type = pdf_dtypes[gdf_dtype]
if np.issubdtype(np_type, np.integer):
itype = np.iinfo(np_type)
extremes = [0, +1, -1, itype.min, itype.max]
df[gdf_dtype] = np.array(extremes * 4, dtype=np_type)[:20]
extremes = [itype.min, itype.max]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you elaborate? Is this a task you want to accomplish in this PR?

@@ -398,8 +398,8 @@ def test_column_view_string_slice(slc):
cudf.core.column.as_column([], dtype="uint8"),
),
(
cp.array([453], dtype="uint8"),
cudf.core.column.as_column([453], dtype="uint8"),
cp.array([255], dtype="uint8"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it doesn't matter what value we choose here? Just wondering if it's important to use 453-256.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it matters. I use 255 just because it's `np.iinfo(uint8).max'

@@ -194,6 +194,7 @@ def test_to_numeric_downcast_int(data, downcast):
assert_eq(expected, got)


@pytest.mark.filterwarnings("ignore:invalid value encountered in cast")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of applying this to the whole test, can we just wrap the pd.to_numeric call? This doesn't affect the cudf.to_numeric call, does it?

Also, should we be handling the warning conditionally? i.e. I assuming this happens when trying to downcast a signed to an unsigned type or something?

@@ -96,6 +97,7 @@

_setup_numba_linker(_PTX_FILE)

patch_numba_codegen_if_needed()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have one other question on this PR, would wait to make changes here until everything else is resolved in case you need to run more tests locally.

Copy link
Contributor

@wence- wence- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like there are some strange things going on in the changes in the tests?

extremes = [0, +1, -1, itype.min, itype.max]
df[gdf_dtype] = np.array(extremes * 4, dtype=np_type)[:20]
extremes = [itype.min, itype.max]
df[gdf_dtype] = np.array(extremes * 10, dtype=np_type)[:20]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason to change from 4 pairs of extrema to 10?

@@ -150,8 +150,8 @@ def make_all_numeric_extremes_dataframe():
np_type = pdf_dtypes[gdf_dtype]
if np.issubdtype(np_type, np.integer):
itype = np.iinfo(np_type)
extremes = [0, +1, -1, itype.min, itype.max]
df[gdf_dtype] = np.array(extremes * 4, dtype=np_type)[:20]
extremes = [itype.min, itype.max]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know that 0, +1, and -1 aren't extrema for integer types, but is there a reason you remove them from these tests? I suppose perhaps that np.uint8(-1) now raises OverflowError or something?

nrows = request.param

# Create a pandas dataframe with random data of mixed types
test_pdf = pd.DataFrame(
[list(range(ncols * i, ncols * (i + 1))) for i in range(nrows)],
columns=pd.Index([f"col_{typ}" for typ in types], name="foo"),
{f"col_{typ}": np.random.randint(0, nrows, nrows) for typ in types}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of the columns in this dataframe now have type int64, no? Since they are never downcast with astype.

nrows = request.param

# Create a pandas dataframe with random data of mixed types
test_pdf = pd.DataFrame(
[list(range(ncols * i, ncols * (i + 1))) for i in range(nrows)],
columns=pd.Index([f"col_{typ}" for typ in types], name="foo"),
{f"col_{typ}": np.random.randint(0, nrows, nrows) for typ in types}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one in contrast is cast to the appropriate type.

python/cudf/cudf/tests/test_parquet.py Show resolved Hide resolved
python/cudf/cudf/tests/test_parquet.py Show resolved Hide resolved
@@ -125,7 +125,7 @@ def test_rank_error_arguments(pdf):
np.full((3,), np.inf),
np.full((3,), -np.inf),
]
sort_dtype_args = [np.int32, np.int64, np.float32, np.float64]
sort_dtype_args = [np.float32, np.float64]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This means we now don't run some tests with integer dtypes. Is it that they don't make sense any more?

slr_device = cudf.Scalar(slr, dtype=dtype)

if op.__name__ == "neg" and np.dtype(dtype).kind == "u":
# TODO: what do we want to do here?
return
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Numpy is fine with this right? Right? Negation of unsigned integers is totally well-defined.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well:

In [2]: -np.uint16(1)
<ipython-input-2-5156426c8f88>:1: RuntimeWarning: overflow encountered in scalar negative
  -np.uint16(1)
Out[2]: 65535

Should we just go ahead and ignore that warning in this test? (I've resorted to doing that in most other cases)

@@ -137,7 +137,7 @@ def query_compile(expr):
key "args" is a sequence of name of the arguments.
"""

funcid = f"queryexpr_{np.uintp(hash(expr)):x}"
funcid = f"queryexpr_{np.uintp(abs(hash(expr))):x}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is incorrect, hash returns in the semi-open interval [-2**63, 2**63), but abs folds this to the closed interval [0, 2**63] (so you alias just shy of 50% of the values). Instead, you want to shift, I suspect, and then you don't need numpy in the loop at all:

Suggested change
funcid = f"queryexpr_{np.uintp(abs(hash(expr))):x}"
funcid = f"queryexpr_{hash(expr) + 2**63:x}"

That said, strings are hashable, so this seems like a weird way of constructing a cache key (it's somehow deliberately making it more likely that you get hash collisions and produce the wrong value).

I would have thought that this would do the trick:

@functools.cache
def query_compile(expr):
    name  = "queryexpr" # these are only looked up locally so names can collide
    info = query_parser(expr)
    fn = query_builder(info, name)
    args = info["args"]
    devicefn = cudf.jit(device=True)(fn)
    kernel = _wrap_query_expr(f"kernel_{name}", devicefn, args)
    info["kernel"] = kernel
    return info

@shwina shwina self-assigned this May 18, 2023
@vyasr
Copy link
Contributor

vyasr commented May 23, 2023

@shwina Is this PR waiting on a re-review from me and/or @wence- ?

Copy link
Member

@ajschmidt8 ajschmidt8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving ops-codeowner file changes

@shwina
Copy link
Contributor Author

shwina commented May 23, 2023

@vyasr just resolved the conflicts here. Yeah I think it could use a quick skim

Copy link
Contributor

@wence- wence- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks.

@shwina
Copy link
Contributor Author

shwina commented May 25, 2023

/merge

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improvement Improvement / enhancement to an existing function non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

[BUG] Avoid out of bounds integers in tests
4 participants