PERF: Improve performance for df.duplicated with one column subset #45534

phofl · 2022-01-21T17:03:13Z

closes PERF: DataFrame.duplicated with subset= for 1 column is slower than Series.duplicated #45236
tests added / passed
Ensure all linting tests pass, see here for how to run them
whatsnew entry

%timeit df["a"].duplicated()
3.16 ms ± 96 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit df.duplicated(subset=['a'])
3.1 ms ± 75.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

The Index creation and the get_group_index were both at fault for the slowdown. Removing the Index call should improve the performance for longer subsets too.

pandas/core/frame.py

jreback · 2022-01-21T22:22:09Z

thanks @phofl

simonjayhawkins · 2022-06-11T16:18:30Z

@phofl several cases where the changes maybe viewed as breaking changes by users. maybe need to expand the release note or revert this PR.

phofl · 2022-06-13T07:16:44Z

I'd rather expand the release note than reverting, because the new behavior seems to be correct?

simonjayhawkins · 2022-06-13T08:25:01Z

It is now only "correct" where we have a single column DataFrame and previously the DataFrame methods had bugs. I think that an inconsistency between a single column DataFrame (or single subset) is less desirable than the existing inconsistency between the Series and DataFrame methods.

Of course this is only applicable to the few cases reported and that all the bugs reported apply to object dtype columns only.

So probably not necessary to revert.

…andas-dev#45534)

PERF: Improve performance for df.duplicated with one column subset

2a8c961

phofl added Performance Memory or execution speed performance duplicated duplicated, drop_duplicates labels Jan 21, 2022

jbrockmendel reviewed Jan 21, 2022

View reviewed changes

pandas/core/frame.py Show resolved Hide resolved

phofl added 2 commits January 21, 2022 21:11

Set Name to None

3cea888

Add comment

61b3a65

jreback added this to the 1.5 milestone Jan 21, 2022

jreback merged commit 235113e into pandas-dev:main Jan 21, 2022

phofl deleted the 45236 branch January 21, 2022 22:51

This was referenced Jun 10, 2022

df.duplicated and drop_duplicates raise TypeError with unhashable values. #12693

Open

BUG: df.duplicated treats None as np.nan in object columns #21720

Open

BUG: DataFrame.drop_duplicates confuses NULL bytes #34551

Open

yehoshuadimarsky pushed a commit to yehoshuadimarsky/pandas that referenced this pull request Jul 13, 2022

PERF: Improve performance for df.duplicated with one column subset (p…

93dde85

…andas-dev#45534)

tehunter mentioned this pull request Sep 23, 2022

ENH/PERF: ExtensionArray should offer a duplicated function #48747

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: Improve performance for df.duplicated with one column subset #45534

PERF: Improve performance for df.duplicated with one column subset #45534

phofl commented Jan 21, 2022

jreback commented Jan 21, 2022

simonjayhawkins commented Jun 11, 2022

phofl commented Jun 13, 2022

simonjayhawkins commented Jun 13, 2022

PERF: Improve performance for df.duplicated with one column subset #45534

PERF: Improve performance for df.duplicated with one column subset #45534

Conversation

phofl commented Jan 21, 2022

jreback commented Jan 21, 2022

simonjayhawkins commented Jun 11, 2022

phofl commented Jun 13, 2022

simonjayhawkins commented Jun 13, 2022