-
-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pd.merge regression when doing a left-join with missing data on the right. Result has a Float64Index #28220
Comments
Yea I think you expected output makes sense. If you'd like to debug and see where regression was produced would certainly take a PR to fix |
Looks like a bug was introduced with change hash 41166c6 (which first introduces a failure to do the join correctly). The current incorrect behaviour was introduced with change hash cc3b2f0. Looks like both commits are from the same author - do you guys think it may be better use of time if I reached out to the author to better understand the two commits listed above and get some colour on a fix before digging in myself? |
cc @JustinZhengBC if you could take a look. |
Taking a deeper dive, I see that the original intent was to fix #24212 where missing values in the index got the last value because of fancy indexing (-1), but it looks like to solve this problem, the index was expanded to include the NA value of the dtype before doing the fancy indexing. This solution has the unfortunately side effect of promoting the dtype if it's an integer index. I'm not familiar with this part of the code, but it looks like the indexers returned by _get_join_indexers and their respective indexes should be the same shape. If that's the case, we can do a np.where to select the correct fill value from the respective result index (i.e. the left index if how='left' or the right index when how='right'). I think I can put together a PR (please let me know if my solution is not the most efficient) and some test cases. I have a suspicion that my PR likely won't keep the correct output I listed in my description, but I think keeping the left index is also an acceptable solution in this case. However, it may be an interface change between 0.24 and 0.25. Just wanted to post an update before I go an run the rest of the test suite. I've gotten the test case in #24212 and this one to pass. |
I'm not sure it's possible to make everyone happy with my PR, it feels like #17257, #24212, and myself (although I'm pretty flexible one way or the other) want mutually exclusive index behaviours for the same operation. However, in the spirit of keeping consistency with the right / left join from SQL what are everyone's thoughts about returning the join keys as the index? (in all 4 cases, inner, outer, left, and right) The data would remain in the same range as the columns being joined (i.e. no unexpected values) This change causes the test cases expecting a RangeIndex to fail so it's probably a pretty breaking change (although one could argue the current behaviour is also a quasi-breaking change). Let me know what you guys think. |
Can you give an example of what changes? |
This is an example of some the breaking behaviour based on the unit tests in test_merge.py:
This example would be my PR's solution to #17257:
This converts the resulting index to be in the set of keys expected by the domain of the join (i.e. years) - if the user wishes to keep the index of left in the resulting join, I admit it's a bit more tedious (as one would have to duplicate the index as a column; thinking out loud, a possible mitigation may be to fold the index into the resulting df as "index" + lsuffix in this case), although suboptimal it may be acceptable compromise compared to NA and a possibly dtype promotion in the index. This example would be my PR's solution to #24212:
Solves the left join from taking the last key and keeping the index in the user's expected domain. Finally, here's the sample output from different types of merges from some unit tests I'm putting together (for brevity, I've excluded parts of the cases as they would flow into the same code block, swapping left and right should be commutative, except for row ordering) Apologies for my edit, the test cases weren't quite correct (but the gist should be the same)
The PR would also work in the case of MultiIndex as well:
Sorry this reply has been long winded, please let me know what you guys think. |
I've tried a more conservative approach which hopefully works for enough people. My thoughts are by adding an additional parameter to pd.merge (I've called it index_na_value but I'm not wedded to the name) where if specified, uses that value instead of the na_value_for_dtype(index.dtype, compat=False) - then the user can accept the current default beahviour of a dtype promotion or use their own NA value specific for their domain without having to break too much older code. I've updated the pull request. |
Code Sample, a copy-pastable example if possible
Problem description
I looked on the GitHub tracker for similar issues but the closest I found was #24897. In previous versions of pandas it would return a Int64Index but now returns a Float64Index. I didn't see this behaviour documented in the release notes of 0.25, but please let me know if I've missed it. This bug is easy to reproduce in a virtualenv.
Expected Output
Output of
pd.show_versions()
INSTALLED VERSIONS
commit : None
python : 3.6.3.final.0
python-bits : 64
OS : Linux
OS-release : 3.10.0-957.el7.x86_64
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : en_US.utf-8
LANG : en_US.utf-8
LOCALE : en_US.UTF-8
pandas : 0.25.1
numpy : 1.17.1
pytz : 2019.2
dateutil : 2.8.0
pip : 19.2.3
setuptools : 41.2.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
xarray : None
xlrd : None
xlwt : None
xlsxwriter : None
The text was updated successfully, but these errors were encountered: