-
-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: adding .unique() to DF (or return_inverse for duplicated) #21357
Comments
Can you clarify what you are looking for with a small example? |
Here's a small example - I'm looking for the quantity
PS. As an unrelated remark, |
Still a pretty large example so I can't said I've read through every detail of it, but why don't you just reset the index on |
Well, I tried to be thorough with the example (twice)... Merging couple million records simultaneously on 6 object columns in the presence of NaNs might be possible to get towards the result, but it's a huge waste of time because the indices are effectively just lying there in the penultimate step of |
Also, it's established functionality in numpy. It's often useful to be able to go back and forth easily between duplicated-deduplicated. |
OK - PRs are always welcome if you have an idea on how to do it. Probably best served in |
@WillAyd PS. Can you replace the "Needs Info" tag with something more reflective of the situation? :) |
@WillAyd @jreback @TomAugspurger I'm interested in tackling this. Is there any precedent in pandas for changing the output signature of a function depending on whether a boolean is True/False (say, from Series to a tuple of Series)? This is how With these considerations, I'd suggest a new function along the lines of
With such a function we could then do (abstractly):
In terms of the example above (without the processing step):
|
-1 on adding any new methods, but would be ok with adding a |
@jreback Do I understand you correctly that you would output a tuple if In other words:
? |
yes |
When deduplicating, I need to get the mapping from the deduplicated records back to the originals.
Unfortunately:
drop_duplicates
andduplicated
provide no information about this mappingnp.unique
has areturn_inverse
-parameter, but does not work withobject
data due to python3: regression for unique on dtype=object arrays with varying items types (Trac #2188) numpy/numpy#641Let's say I have a large DF (
df
) with several object columns (say,sig
) that are the deduplication signature (subset
-parameter ofduplicated
).Then I have (here only for a subset of 100k records):
resulting in:
The performance gets worse with size, so for a few millions, I need to wait ~20min to get the inverse, whereas a pure
duplicated
call only takes around 10 sec.This caused me to hunt down the inner workings of
duplicated
, to see where the inverse gets lost. Turns out until almost the very end ofduplicated
, the indices are there:pandas/pandas/core/frame.py
Line 4384 in 3147a86
I extracted and applied the guts of
duplicated
and once havingids
for the whole set (~10sec), I have the following:yielding
So, by exposing a
return_inverse
argument -- either in.duplicated
,.drop_duplicates
, or in a separate.unique
function for DF -- this would accelerate my problem from 20min to around 11sec.I would imagine the following signature:
If
return_inverse
were added todrop_duplicates
, it would have to raise an error for the combinationkeep=False, return_inverse=True
, because no inverse can be returned in this case.I wouldn't know how to adapt the cython code for
duplicate_int64
, but since it's 5x slower thannp.unique
anyway, there's no point, IMHO. I would know how to call the numpy function at the appropriate point...Edit: just saw that
np.unique
sorts, which makes the'first'|'last'
thing a fair bit harder...The text was updated successfully, but these errors were encountered: