REGR: fix isin for large series with nan and mixed object dtype (causing regression in read_csv) #37499

jorisvandenbossche · 2020-10-29T21:33:13Z

…ing regression in read_csv)

jreback

lgtm merge on green

bashtage · 2020-10-30T08:37:59Z

pandas/core/algorithms.py

@@ -441,7 +441,7 @@ def isin(comps: AnyArrayLike, values: AnyArrayLike) -> np.ndarray:
    if len(comps) > 1_000_000 and not is_object_dtype(comps):
        # If the the values include nan we need to check for nan explicitly
        # since np.nan it not equal to np.nan
-        if np.isnan(values).any():
+        if isna(values).any():


How bad is it if you skip the check and always use the logical_or? Is it possible that the isna is more expensive than the logical_or and isnan?

From a quick check with the simple example from the test I don't really see much difference.
But, for backporting this, I would personally prefer to keep the change as minimal as possible. For master / 1.2, we should maybe re-evaluate this full branch, but see also #36611 which is basically already doing this (but more drastically, by potentially removing the use of np.in1d entirely)

Note that in the if clause here it is isna(values) (so on the values passed to the isin() method), which avoids isnan(comps) (the values of the Series), where values is typically much smaller than comps.

jorisvandenbossche · 2020-10-30T10:50:32Z

@simonjayhawkins ok for you?

simonjayhawkins

Thanks @jorisvandenbossche lgtm

simonjayhawkins · 2020-10-30T11:02:59Z

@meeseeksdev backport 1.1.x

…n and mixed object dtype (causing regression in read_csv)

…d object dtype (causing regression in read_csv) (#37517) Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>

…ing regression in read_csv) (pandas-dev#37499)

REGR: fix isin for large series with nan and mixed object dtype (caus…

a62e4c0

…ing regression in read_csv)

jorisvandenbossche added Regression Functionality that used to work in a prior pandas version IO CSV read_csv, to_csv Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff labels Oct 29, 2020

jorisvandenbossche added this to the 1.1.4 milestone Oct 29, 2020

jorisvandenbossche mentioned this pull request Oct 29, 2020

RLS: 1.1.4 #37397

Closed

add whatsnew

d516b43

jreback approved these changes Oct 29, 2020

View reviewed changes

jorisvandenbossche added 2 commits October 29, 2020 22:51

also add csv test

520cf10

use pd.isna instead of np.isnan

93c79cc

bashtage reviewed Oct 30, 2020

View reviewed changes

simonjayhawkins approved these changes Oct 30, 2020

View reviewed changes

simonjayhawkins merged commit 4582c1c into pandas-dev:master Oct 30, 2020

jorisvandenbossche deleted the fix-read_csv-isin-object-large branch October 30, 2020 10:52

meeseeksmachine mentioned this pull request Oct 30, 2020

Backport PR #37499 on branch 1.1.x (REGR: fix isin for large series with nan and mixed object dtype (causing regression in read_csv)) #37517

Merged

meeseeksmachine pushed a commit to meeseeksmachine/pandas that referenced this pull request Oct 30, 2020

Backport PR pandas-dev#37499: REGR: fix isin for large series with na…

2b072fb

…n and mixed object dtype (causing regression in read_csv)

simonjayhawkins pushed a commit that referenced this pull request Oct 30, 2020

Backport PR #37499: REGR: fix isin for large series with nan and mixe…

b317a48

…d object dtype (causing regression in read_csv) (#37517) Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>

kesmit13 pushed a commit to kesmit13/pandas that referenced this pull request Nov 2, 2020

REGR: fix isin for large series with nan and mixed object dtype (caus…

630efd6

…ing regression in read_csv) (pandas-dev#37499)

ukarroum pushed a commit to ukarroum/pandas that referenced this pull request Nov 2, 2020

REGR: fix isin for large series with nan and mixed object dtype (caus…

1214754

…ing regression in read_csv) (pandas-dev#37499)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

REGR: fix isin for large series with nan and mixed object dtype (causing regression in read_csv) #37499

REGR: fix isin for large series with nan and mixed object dtype (causing regression in read_csv) #37499

jorisvandenbossche commented Oct 29, 2020

jreback left a comment

bashtage Oct 30, 2020

jorisvandenbossche Oct 30, 2020

jorisvandenbossche commented Oct 30, 2020

simonjayhawkins left a comment

simonjayhawkins commented Oct 30, 2020

REGR: fix isin for large series with nan and mixed object dtype (causing regression in read_csv) #37499

REGR: fix isin for large series with nan and mixed object dtype (causing regression in read_csv) #37499

Conversation

jorisvandenbossche commented Oct 29, 2020

jreback left a comment

Choose a reason for hiding this comment

bashtage Oct 30, 2020

Choose a reason for hiding this comment

jorisvandenbossche Oct 30, 2020

Choose a reason for hiding this comment

jorisvandenbossche commented Oct 30, 2020

simonjayhawkins left a comment

Choose a reason for hiding this comment

simonjayhawkins commented Oct 30, 2020