Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reorder tests in maybe_downcast_numeric #55825

Merged
merged 7 commits into from
Nov 22, 2023

Conversation

MichaelTiemannOSC
Copy link
Contributor

The comment # if we have any nulls, then we are done is not consistent with the test if isna(arr).any() because arr is constructed only from the first element (r[0]) not the full ravel'd list of values. Moreover, calling np.array() on some random type can have surprising consequences.

So instead, do the early-out test as intended, just using r[0] without going through np.array(). Then test other things about r[0]. Only then should we test all the values (and if we have any nulls, then we are done).

See #55824

  • [55824] closes #xxxx (Replace xxxx with the GitHub issue number)
  • Tests added and passed if fixing a bug or adding a new feature
  • All code checks passed.
  • Added type annotations to new arguments/methods/functions.
  • Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

The comment `# if we have any nulls, then we are done` is not consistent with the test `if isna(arr).any()` because `arr` is constructed only from the first element (`r[0]`) not the full ravel'd list of values.  Moreover, calling `np.array()` on some random type can have surprising consequences.

So instead, do the early-out test as intended, just using `r[0]` without going through `np.array()`.  Then test other things about `r[0]`.  Only then should we test all the values (and if we have any nulls, then we are done).

See pandas-dev#55824

Signed-off-by: Michael Tiemann <72577720+MichaelTiemannOSC@users.noreply.github.com>
@MichaelTiemannOSC MichaelTiemannOSC marked this pull request as ready for review November 6, 2023 09:31
@MichaelTiemannOSC
Copy link
Contributor Author

While there is a conversation going on in #55824, these changes introduce no new functionality nor user-visible changes (other than tolerating Pint's behavior). All these changes do is remove a dependency on the behavior of np.array() that can be accomplished as well by other means, and to make the implementation match the comments regarding the results of encountering NaN/NA values in the values to be possibly downcasted.

@WillAyd
Copy link
Member

WillAyd commented Nov 6, 2023

Cool thanks for the PR. Can you add a test and a whatsnew note?

@MichaelTiemannOSC
Copy link
Contributor Author

Would be happy to, but...the test case would require loading Pint. Last time I tried to do some test cases with alien packages I got back a big "No thanks!" from the pandas maintainers. Also, another PR I submitted (enabling many complex test cases) rejected the WhatsNew changes I wrote because there were no user-visible changes. Would you like to guide me further?

Comment on lines 362 to 363
if isna(r[0]):
# do a test on the first element, if it fails then we are done
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test on the first element goes back to dc73315#diff-fb8a9c322624b0777f3ff7e3ef8320d746b15a2a0b80b7cab3dfbe2e12e06daa in core.common. It appears to me it is no longer necessary.

Copy link
Contributor Author

@MichaelTiemannOSC MichaelTiemannOSC Nov 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I constructed a simple test case and I think I understand what the earlier code was trying to do, which was to ravel only the first element of result. Consider:

import numpy as	np
import pandas as pd

aa = np.array([1.0, np.nan, 2.0])
bb = np.array([aa, aa])

print(aa, pd.isna(aa))
# [ 1. nan  2.] [False  True False]
print(bb, pd.isna(bb))
#  [[ 1. nan  2.] [ 1. nan  2.]] [[False  True False] [False  True False]]
print(pd.isna(bb).any())
# True

The fact that pd.isna(bb).any() sees the nulls inside the sub-arrays means we don't need to ravel the whole thing to look for nulls. The reason to ravel, instead, is to look at the type of the elements, and we only need to ravel the first element to look into that. (We have already tested that there are elements to ravel.)
But...decimal.Decimal doesn't have a ravel, which explains perhaps why the original code tried to wrap just the first element back into an array to ravel that.

pandas/core/dtypes/cast.py Outdated Show resolved Hide resolved
If the first element of `result` is an array, ravel that to get element we will test.  Otherwise use it as is.

We only need to check whether `result` is all non-null once.

Signed-off-by: Michael Tiemann <72577720+MichaelTiemannOSC@users.noreply.github.com>
Don't use deprecated array indexing on ExtensionArrays.  We need to now us `iloc`.

Signed-off-by: Michael Tiemann <72577720+MichaelTiemannOSC@users.noreply.github.com>

elif not isinstance(r[0], (np.integer, np.floating, int, float, bool)):
if isinstance(result, np.ndarray):
element = result[0]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would result.item(0) work here? might avoid potential copies from ravel

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My suggestion here turns out to cause a different problem bc .item(0) is wrong for dt64 ndarrays

result = np.arange(5).view("M8[ns]")
>>> result.item(0)
0

When processing a multidimensional `ndarray`, we can get the first element by calling `result.item(0)` and completely avoid the copying needed by `ravel` to get the first element that way.  We can also eliminates an additional conditional check.

Signed-off-by: Michael Tiemann <72577720+MichaelTiemannOSC@users.noreply.github.com>
@MichaelTiemannOSC
Copy link
Contributor Author

So to wrap this up...I don't think it's easy to add a test case because that would involve installing Pint, which might not be very CI/CD-friendly. If there's a good way to do that, please point me to a pattern I can follow. It's easy for me to describe the change in CHANGES, but its not really user-visible. But because there is an issue (though no test case), I could write it up anyway. Thoughts?

@rhshadrach
Copy link
Member

rhshadrach commented Nov 10, 2023

@WillAyd

Can you add a test and a whatsnew note?

With the current implementation in this PR, I wouldn't expect any behavior changes. You good here?

It does change for Pint because there np.array([e]) will raise for certain Pint objects e, but I don't think we should be testing for this.

Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@mroeschke mroeschke added Dtype Conversions Unexpected or buggy dtype conversions ExtensionArray Extending pandas with custom dtypes or arrays. labels Nov 22, 2023
@mroeschke mroeschke added this to the 2.2 milestone Nov 22, 2023
@mroeschke mroeschke merged commit 4ee86a7 into pandas-dev:main Nov 22, 2023
44 of 46 checks passed
@mroeschke
Copy link
Member

Thanks @MichaelTiemannOSC

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dtype Conversions Unexpected or buggy dtype conversions ExtensionArray Extending pandas with custom dtypes or arrays.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants