Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: ExtensionArray.isin treatment of missing values #42545

Closed
mzeitlin11 opened this issue Jul 15, 2021 · 2 comments
Closed

API: ExtensionArray.isin treatment of missing values #42545

mzeitlin11 opened this issue Jul 15, 2021 · 2 comments
Labels
API - Consistency Internal Consistency of API/Behavior Duplicate Report Duplicate issue or pull request ExtensionArray Extending pandas with custom dtypes or arrays. isin isin method Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate

Comments

@mzeitlin11
Copy link
Member

The current behavior singles out pd.NA:

In [3]: ser = pd.Series([pd.NA], dtype="Int64")

In [4]: ser.isin([pd.NA])
Out[4]:
0    True
dtype: boolean

In [5]: ser.isin([np.nan])
Out[5]:
0    False
dtype: boolean

StringArray.isin also follows this behavior, so it is not MaskedArray specific.

There was discussion about this being problematic since _from_sequence will treat other missing values just the same as pd.NA (#42473 (comment)). In that case, the output in the second case should also be True.

As a final option, both outputs could be pd.NA. In #38379 there was discussion of propagating pd.NA instead of True/False depending on the presence of missing values in the values argument. A nice description of that debate is here #38379 (comment).

@mzeitlin11 mzeitlin11 added Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate ExtensionArray Extending pandas with custom dtypes or arrays. API - Consistency Internal Consistency of API/Behavior Needs Triage Issue that has not been reviewed by a pandas team member isin isin method labels Jul 15, 2021
@realead
Copy link
Contributor

realead commented Jul 15, 2021

This behavior of isin was agreed upon, before pd.NA existed: #22296.

The idea back then was: in order to be consistent in all algorithms (isin, unique, mode, count and so on), the most robust way is to use the wanted equivalence relation already in khash-tables/maps. For two elements a and b being in the same equivalency class/being equivalent, the following should hold:

  • a == b
  • hash(a) == hash(b)

The decision back then was:

  • all float-nans are equivalent ( thus Python's default == and hash need to be overwritten)
  • pd.NaT is its own equivalency class
  • None is its own equivalency class

It was a safer choice: easier to implement, less performance hits, there is still possibility through preproccessing to ensure that e.g. None and np.nan are mangled and there are probably scenarios, where None and np.nan shouldn't be the same.

This behavior was extrapolated to pd.NA, i.e. pd.NA became its own equivalency class.

The question is, what should be the behavior of pd.NA? Should it be in the same equivalency class as np.nan or should it be in the same equivalency asnp.nan, pd.NaT and None, i.e. all these classes should become the same equivalency class (which is quite a change of behavior and can have subtle changes of behavior as result).

Having pd.NA and np.nan in the same equivalency class is probably quite straight forward, putting all into the same equivalency class is probably more work (and might have more impact on behavior and performance).

@mroeschke mroeschke removed the Needs Triage Issue that has not been reviewed by a pandas team member label Aug 21, 2021
@mzeitlin11
Copy link
Member Author

Closing as a duplicate in favor of #31990

@mzeitlin11 mzeitlin11 added the Duplicate Report Duplicate issue or pull request label Sep 5, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API - Consistency Internal Consistency of API/Behavior Duplicate Report Duplicate issue or pull request ExtensionArray Extending pandas with custom dtypes or arrays. isin isin method Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Projects
None yet
Development

No branches or pull requests

3 participants