Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow to select index in drop_duplicates and duplicated #9708

Closed
flying-sheep opened this issue Mar 23, 2015 · 8 comments
Closed

Allow to select index in drop_duplicates and duplicated #9708

flying-sheep opened this issue Mar 23, 2015 · 8 comments

Comments

@flying-sheep
Copy link
Contributor

flying-sheep commented Mar 23, 2015

there’s no way to drop rows with duplicated index using drop_duplicates.

we’d have to add a copy of the index as column, or do this:

df[np.logical_not(df.index.duplicated(take_last=True).values)]
@TomAugspurger
Copy link
Contributor

Typically I'll use a df.groupby(level=0).last() (or more typically .first()). It works fine, but a groupby isn't necessarily the first thought for deduplication.

I'm +0 on whether we should have a dedicated method for this.

@jreback
Copy link
Contributor

jreback commented Mar 23, 2015

As @TomAugspurger indicates the following are equivalent.

I suppose the drop_duplicates section could have this an alterative example. If you would like to pull-request for a doc update would be ok.

In [6]: df = pd.DataFrame({'A' : range(4), 'B' : list('aabb')})            

In [7]: df                                                                 
Out[7]:                                                                    
   A  B                                                                    
0  0  a                                                                    
1  1  a                                                                    
2  2  b                                                                    
3  3  b                                                                    

In [9]: df2 = df.set_index('B')                                            

In [10]: df2                                                               
Out[10]:                                                                   
   A                                                                       
B                                                                          
a  0                                                                       
a  1                                                                       
b  2                                                                       
b  3   

In [13]: df2.groupby(level=0).first()                        
Out[13]:                                                     
   A                                                         
B                                                            
a  0                                                         
b  2                                                         

In [16]: df2.reset_index().drop_duplicates(subset='B',take_last=False).set_index('B')                                                      
Out[16]:                                                                                                                                   
   A                                                                                                                                       
B                                                                                                                                          
a  0                                                                                                                                       
b  2

@jreback jreback added this to the Next Major Release milestone Mar 23, 2015
@flying-sheep
Copy link
Contributor Author

sorry, i don’t get it. you mean i should add the second code block as exemple to the docs?

@jreback
Copy link
Contributor

jreback commented Mar 23, 2015

I would add the groupby method as an alternative as its is another common way of performing this task

@flying-sheep
Copy link
Contributor Author

to which file? indexing.rtf?

@jreback
Copy link
Contributor

jreback commented Mar 23, 2015

@zydariv
Copy link

zydariv commented Mar 10, 2020

Where is the problem to just add this functionality?

df2.reset_index().drop_duplicates(subset='B',take_last=False).set_index('B')
looks not really clean to me.

df2.drop_duplicates(subset='index', take_last=False)
would look much cleaner and we could add the reset_index() and set_index() into drop_duplicates()

cheers

@jreback
Copy link
Contributor

jreback commented Mar 11, 2020

something like this was already added: #30405

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants