Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Join result takes the index order of the other (right) DataFrame instead of the calling's (left) one #15582

Closed
albertvillanova opened this issue Mar 5, 2017 · 11 comments
Labels
API Design Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Milestone

Comments

@albertvillanova
Copy link
Contributor

albertvillanova commented Mar 5, 2017

Code Sample, a copy-pastable example if possible

# Your code here
df1 = pd.DataFrame({'a': [0, 10, 20]})
df2 = pd.DataFrame({'b': [200, 100]}, index=[2,1])

print(df1.join(df2, how='inner'))
print(df2.join(df1, how='inner'))

print(df1.join(df2, how='inner', sort=True))

Problem description

Contrary to what is stated in the documentation of DataFrame.join(), when using the default sort=False, the return DataFrame preserves the index order of the other (right) DataFrame, instead of the index order of the calling (left) DataFrame.

Besides, the sort=True argument does not work.

Expected Output

The expected output is that the return DataFrame should preserve the index order of the calling (left) DataFrame.

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.5.2.final.0 python-bits: 64 OS: Linux OS-release: 3.13.0-108-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: ca_ES.UTF-8 LOCALE: ca_ES.UTF-8

pandas: 0.19.2
nose: 1.3.7
pip: 8.1.2
setuptools: 27.2.0.post20161106
Cython: 0.24.1
numpy: 1.11.1
scipy: 0.18.1
statsmodels: 0.6.1
xarray: None
IPython: 5.1.0
sphinx: 1.4.6
patsy: 0.4.1
dateutil: 2.5.3
pytz: 2016.6.1
blosc: None
bottleneck: 1.1.0
tables: 3.2.3.1
numexpr: 2.6.1
matplotlib: 1.5.3
openpyxl: 2.3.2
xlrd: 1.0.0
xlwt: 1.1.2
xlsxwriter: 0.9.3
lxml: 3.6.4
bs4: 4.5.1
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.13
pymysql: None
psycopg2: None
jinja2: 2.8
boto: 2.42.0
pandas_datareader: 0.2.1

@jreback
Copy link
Contributor

jreback commented Mar 5, 2017

hmm, I think this needs a suite of tests. There are very few for sort=True (and only a single dtype anyhow). It appears you are correct that this violates the guarantee.

@jreback jreback added the Reshaping Concat, Merge/Join, Stack/Unstack, Explode label Mar 5, 2017
@sleepdeprivation
Copy link

sleepdeprivation commented Mar 5, 2017

@jreback noob here, where to write these tests?

edit: found /tests , I guess I'm just looking for guidance ?

@jreback
Copy link
Contributor

jreback commented Mar 5, 2017

pandas/tests/tools/test_join.py (there is a single sorted test now)

@sleepdeprivation
Copy link

@jreback OK cool, thanks for the quick reply

@sleepdeprivation
Copy link

@jreback So if I just swap self and other in the call to merge in pandas/core/frame.py > _join_compat
, wouldn't that adequately solve the indexing problem?

And then I guess you want tests written for join to test the sort param?

@jreback
Copy link
Contributor

jreback commented Mar 6, 2017

i think this might break a large amount of tests
see what happens

@sleepdeprivation
Copy link

Old:

>>> import pandas as pd
>>> # Your code here
... df1 = pd.DataFrame({'a': [0, 10, 20]})
>>> df2 = pd.DataFrame({'b': [200, 100]}, index=[2,1])
>>> 
>>> print(df1.join(df2, how='inner'))
    a    b
2  20  200
1  10  100
>>> print(df2.join(df1, how='inner'))
     b   a
1  100  10
2  200  20
>>> 
>>> print(df1.join(df2, how='inner', sort=True))
    a    b
2  20  200
1  10  100

New:

>>> # Your code here
... df1 = pd.DataFrame({'a': [0, 10, 20]})
>>> df2 = pd.DataFrame({'b': [200, 100]}, index=[2,1])
>>> 
>>> print(df1.join(df2, how='inner'))
DATA FRAME JOIN
     b   a
1  100  10
2  200  20
>>> print(df2.join(df1, how='inner'))
DATA FRAME JOIN
    a    b
2  20  200
1  10  100
>>> 
>>> print(df1.join(df2, how='inner', sort=True))
DATA FRAME JOIN
     b   a
1  100  10
2  200  20

my pytest all succeeds except these fail even with clean code

========================  test_traitsdoc.py  =========================
======================  test_phantom_import.py  ======================
======================  test_plot_directive.py  ======================
========================  test_docscrape.py  =========================
=========================  test_linkcode.py  =========================

They all fail with ImportError, and as I said fail even with clean code so I assume this has nothing to do with the modification...?

@jreback
Copy link
Contributor

jreback commented Mar 6, 2017

note that this is only going to matter for `how='inner'. how='right'|'left'`` will take on the orderings of those respective indexes. ``outer`` I think we cannot guarantee an ordering.

In any event need tests for all of these.

@albertvillanova
Copy link
Contributor Author

The bug is not at the DataFrame level but at the Index level.

The swap of arguments in the call to merge proposed by @sleepdeprivation is going to generate a new bug with swapped lsuffix and rsuffix. Moreover the order of the columns in the result is also exchanged (first columns from the right DataFrame, then columns from the left one).

The real bug is at the Index level, concretely in the Index.intersection() method. The get_indexer must be called on the right index and with left index as argument, so that it returns a mask which transforms the caller index (right) into the passed index (left), and not the other way around. This is the desired output: a mask on the right index which picks its elements in such a way that it gets aligned with the left index.

I have implemented/updated some tests in my PR.

@jreback jreback added this to the Next Major Release milestone Mar 7, 2017
@albertvillanova
Copy link
Contributor Author

One question about the tests: I have implemented/updated some tests in files tests/indexes/test_base.py and tests/frame/test_misc_api.py (test_join_index), but as you said there are also tests on join in file tests/tools/test_join.py. I do not understand the logic in the directory structure of the tests.

@jreback
Copy link
Contributor

jreback commented Mar 7, 2017

I think it should be self-evident. you have tests on the indexes themselves, and the api on frames (which only is a dispatch from pd.merge), and the actual impl of merge/join in tools.

albertvillanova pushed a commit to albertvillanova/pandas that referenced this issue Mar 12, 2017
albertvillanova pushed a commit to albertvillanova/pandas that referenced this issue Mar 15, 2017
Issue pandas-dev#15582: fix the sort=True bug in DataFrame.join
jreback pushed a commit to albertvillanova/pandas that referenced this issue Mar 25, 2017
jreback pushed a commit to albertvillanova/pandas that referenced this issue Mar 25, 2017
Issue pandas-dev#15582: fix the sort=True bug in DataFrame.join
@jreback jreback modified the milestones: 0.20.0, Next Major Release Mar 27, 2017
mattip pushed a commit to mattip/pandas that referenced this issue Apr 3, 2017
closes pandas-dev#15582

Author: Albert Villanova del Moral <albert.villanova@gmail.com>
Author: Jeff Reback <jeff@reback.net>

Closes pandas-dev#15583 from albertvillanova/fix-15582 and squashes the following commits:

2d4e143 [Albert Villanova del Moral] Fix pytest fixture name collision
64e86a4 [Albert Villanova del Moral] Fix test on right join
73df69e [Albert Villanova del Moral] Address requested changes
8d2e9cc [Albert Villanova del Moral] Address requested changes
968c7f1 [Jeff Reback] DOC/TST: change to use parameterization
9e39794 [Albert Villanova del Moral] Address requested changes
5bf1508 [Albert Villanova del Moral] Address requested changes
654288b [Albert Villanova del Moral] Fix Travis errors
33eb740 [Albert Villanova del Moral] Address requested changes
3c200fe [Albert Villanova del Moral] Add new tests
ef2581e [Albert Villanova del Moral] Fix Travis error
f0d9d03 [Albert Villanova del Moral] Add whatsnew
c96306d [Albert Villanova del Moral] Add sort argument to Index.join
047b513 [Albert Villanova del Moral] Address requested changes
ec836bd [Albert Villanova del Moral] Fix Travis errors
b977278 [Albert Villanova del Moral] Address requested changes
784fe75 [Albert Villanova del Moral] Fix error: line too long
1197b99 [Albert Villanova del Moral] Fix DataFrame column order when read from HDF file
d9e29f8 [Albert Villanova del Moral] Create new DatetimeIndex from the Index.intersection result
e7bcd28 [Albert Villanova del Moral] Fix typo in documentation
a4ead99 [Albert Villanova del Moral] Fix typo
c2a8dc3 [Albert Villanova del Moral] Implement tests
c12bb3f [Albert Villanova del Moral] BUG: Fix index order for Index.intersection()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants