-
-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: df.duplicated treats None as np.nan in object columns #21720
Comments
May be related to the discussion in #20442 |
As far as I can tell, the difference is due to the call to
|
This is documented behaviour for
However, |
As also commented on the PR, there is a similar difference between
Factorize treats them all as identical (since it needs to substitute all missing values with -1), while unique treats them as separate values.
|
The difference between (The discrepancy for |
Yes, it is true that for |
For the case in the OP, the DataFrame case is now consistent with the Series result. "fixed" in commit: [235113e] PERF: Improve performance for df.duplicated with one column subset (#45534) cc @phofl the issue remains for a DataFrame with more than one column or more than one subset. s.duplicated()
# 0 False
# 1 False
# 2 True
# 3 False
# 4 True
# dtype: bool
s.to_frame().duplicated()
# 0 False
# 1 False
# 2 True
# 3 False
# 4 True
# dtype: bool
s.to_frame().assign(dup=lambda x: x[0]).duplicated()
# 0 False
# 1 False
# 2 True
# 3 True
# 4 True
# dtype: bool |
Found out while writing tests for
.duplicated
in #21645 (so far,.duplicated
was almost exclusively tested implicitly through.drop_duplicates
)At first I thought this is intended behaviour for
DataFrame.duplicated()
, butSeries.duplicated()
does not treat it equally. This makes sense to me, since as objects,None is not np.nan
- I therefore labelled this as a bug.The text was updated successfully, but these errors were encountered: