-
-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
REGR: fix isin for large series with nan and mixed object dtype (causing regression in read_csv) #37499
REGR: fix isin for large series with nan and mixed object dtype (causing regression in read_csv) #37499
Conversation
…ing regression in read_csv)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm merge on green
@@ -441,7 +441,7 @@ def isin(comps: AnyArrayLike, values: AnyArrayLike) -> np.ndarray: | |||
if len(comps) > 1_000_000 and not is_object_dtype(comps): | |||
# If the the values include nan we need to check for nan explicitly | |||
# since np.nan it not equal to np.nan | |||
if np.isnan(values).any(): | |||
if isna(values).any(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How bad is it if you skip the check and always use the logical_or? Is it possible that the isna is more expensive than the logical_or and isnan?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From a quick check with the simple example from the test I don't really see much difference.
But, for backporting this, I would personally prefer to keep the change as minimal as possible. For master / 1.2, we should maybe re-evaluate this full branch, but see also #36611 which is basically already doing this (but more drastically, by potentially removing the use of np.in1d entirely)
Note that in the if
clause here it is isna(values)
(so on the values passed to the isin()
method), which avoids isnan(comps)
(the values of the Series), where values
is typically much smaller than comps
.
@simonjayhawkins ok for you? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jorisvandenbossche lgtm
@meeseeksdev backport 1.1.x |
…n and mixed object dtype (causing regression in read_csv)
…d object dtype (causing regression in read_csv) (#37517) Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
…ing regression in read_csv) (pandas-dev#37499)
…ing regression in read_csv) (pandas-dev#37499)
Closes #37094