Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: Improve performance for df.duplicated with one column subset #45534

Merged
merged 3 commits into from
Jan 21, 2022

Conversation

phofl
Copy link
Member

@phofl phofl commented Jan 21, 2022

%timeit df["a"].duplicated()
3.16 ms ± 96 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit df.duplicated(subset=['a'])
3.1 ms ± 75.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

The Index creation and the get_group_index were both at fault for the slowdown. Removing the Index call should improve the performance for longer subsets too.

@phofl phofl added Performance Memory or execution speed performance duplicated duplicated, drop_duplicates labels Jan 21, 2022
@jreback jreback added this to the 1.5 milestone Jan 21, 2022
@jreback jreback merged commit 235113e into pandas-dev:main Jan 21, 2022
@jreback
Copy link
Contributor

jreback commented Jan 21, 2022

thanks @phofl

@simonjayhawkins
Copy link
Member

@phofl several cases where the changes maybe viewed as breaking changes by users. maybe need to expand the release note or revert this PR.

@phofl
Copy link
Member Author

phofl commented Jun 13, 2022

I'd rather expand the release note than reverting, because the new behavior seems to be correct?

@simonjayhawkins
Copy link
Member

It is now only "correct" where we have a single column DataFrame and previously the DataFrame methods had bugs. I think that an inconsistency between a single column DataFrame (or single subset) is less desirable than the existing inconsistency between the Series and DataFrame methods.

Of course this is only applicable to the few cases reported and that all the bugs reported apply to object dtype columns only.

So probably not necessary to revert.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
duplicated duplicated, drop_duplicates Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PERF: DataFrame.duplicated with subset= for 1 column is slower than Series.duplicated
4 participants