Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

REGR: fix isin for large series with nan and mixed object dtype (causing regression in read_csv) #37499

Conversation

jorisvandenbossche
Copy link
Member

Closes #37094

@jorisvandenbossche jorisvandenbossche added Regression Functionality that used to work in a prior pandas version IO CSV read_csv, to_csv Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff labels Oct 29, 2020
@jorisvandenbossche jorisvandenbossche added this to the 1.1.4 milestone Oct 29, 2020
@jorisvandenbossche jorisvandenbossche mentioned this pull request Oct 29, 2020
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm merge on green

@@ -441,7 +441,7 @@ def isin(comps: AnyArrayLike, values: AnyArrayLike) -> np.ndarray:
if len(comps) > 1_000_000 and not is_object_dtype(comps):
# If the the values include nan we need to check for nan explicitly
# since np.nan it not equal to np.nan
if np.isnan(values).any():
if isna(values).any():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How bad is it if you skip the check and always use the logical_or? Is it possible that the isna is more expensive than the logical_or and isnan?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From a quick check with the simple example from the test I don't really see much difference.
But, for backporting this, I would personally prefer to keep the change as minimal as possible. For master / 1.2, we should maybe re-evaluate this full branch, but see also #36611 which is basically already doing this (but more drastically, by potentially removing the use of np.in1d entirely)

Note that in the if clause here it is isna(values) (so on the values passed to the isin() method), which avoids isnan(comps) (the values of the Series), where values is typically much smaller than comps.

@jorisvandenbossche
Copy link
Member Author

@simonjayhawkins ok for you?

Copy link
Member

@simonjayhawkins simonjayhawkins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jorisvandenbossche lgtm

@simonjayhawkins simonjayhawkins merged commit 4582c1c into pandas-dev:master Oct 30, 2020
@jorisvandenbossche jorisvandenbossche deleted the fix-read_csv-isin-object-large branch October 30, 2020 10:52
@simonjayhawkins
Copy link
Member

@meeseeksdev backport 1.1.x

meeseeksmachine pushed a commit to meeseeksmachine/pandas that referenced this pull request Oct 30, 2020
…n and mixed object dtype (causing regression in read_csv)
simonjayhawkins pushed a commit that referenced this pull request Oct 30, 2020
…d object dtype (causing regression in read_csv) (#37517)

Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
kesmit13 pushed a commit to kesmit13/pandas that referenced this pull request Nov 2, 2020
ukarroum pushed a commit to ukarroum/pandas that referenced this pull request Nov 2, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff IO CSV read_csv, to_csv Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: Pandas 1.1.3 read_csv raises a TypeError when dtype, and index_col are provided, and file has >1M rows
4 participants