BUG-24212 fix when other_index has incompatible dtype #25009

JustinZhengBC · 2019-01-29T19:32:15Z

closes REGR: re-evaluate merge fix of PR #24916 #25001
tests added / passed
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

Followup to #24916, addresses the case when the other index has an incompatible dtype, so we cannot take directly from it. Currently, this PR ~~naively replaces the missing index values with the number of the rows in the other index that caused them~~ replaces the missing index values with the appropriate NA value.

~~Still working on adding cases when it is possible to combine indices of sparse/categorical dtypes without densifying.~~

codecov · 2019-01-29T20:17:31Z

Codecov Report

Merging #25009 into master will decrease coverage by 49.49%.
The diff coverage is 0%.

@@            Coverage Diff             @@
##           master   #25009      +/-   ##
==========================================
- Coverage   92.38%   42.88%   -49.5%     
==========================================
  Files         166      166              
  Lines       52401    52407       +6     
==========================================
- Hits        48409    22475   -25934     
- Misses       3992    29932   +25940

Flag	Coverage Δ
#multiple	`?`
#single	`42.88% <0%> (-0.01%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/reshape/merge.py	`9.43% <0%> (-85.05%)`	⬇️
pandas/io/formats/latex.py	`0% <0%> (-100%)`	⬇️
pandas/core/categorical.py	`0% <0%> (-100%)`	⬇️
pandas/io/sas/sas_constants.py	`0% <0%> (-100%)`	⬇️
pandas/tseries/plotting.py	`0% <0%> (-100%)`	⬇️
pandas/tseries/converter.py	`0% <0%> (-100%)`	⬇️
pandas/io/formats/html.py	`0% <0%> (-99.35%)`	⬇️
pandas/core/groupby/categorical.py	`0% <0%> (-95.46%)`	⬇️
pandas/io/sas/sas7bdat.py	`0% <0%> (-91.17%)`	⬇️
pandas/io/sas/sas_xport.py	`0% <0%> (-90.15%)`	⬇️
... and 124 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update abf0824...0e6de81. Read the comment docs.

codecov · 2019-01-29T20:17:31Z

Codecov Report

Merging #25009 into master will decrease coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #25009      +/-   ##
==========================================
- Coverage   91.97%   91.96%   -0.01%     
==========================================
  Files         175      175              
  Lines       52368    52365       -3     
==========================================
- Hits        48164    48157       -7     
- Misses       4204     4208       +4

Flag	Coverage Δ
#multiple	`90.52% <100%> (-0.01%)`	⬇️
#single	`40.7% <0%> (-0.15%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/reshape/merge.py	`94.45% <100%> (-0.03%)`	⬇️
pandas/io/gbq.py	`78.94% <0%> (-10.53%)`	⬇️
pandas/core/frame.py	`96.9% <0%> (-0.12%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9feb3ad...88cdf8b. Read the comment docs.

…nto BUG-24212

jreback · 2019-01-30T13:00:43Z

pandas/core/reshape/merge.py

+                    join_list[mask] = other_list[mask]
+                    join_index = Index(join_list, dtype=other_index.dtype,
+                                       name=other_index.name)
+                except ValueError:


we really don't want to do a try/except here. What is falling into the except?

When 'other_index' has a different dtype that causes an exception to be raised when values from it are inserted into the current index. I did not compare dtypes because in some cases differing dtypes are possible (example: int can be added to a Float64Index)

can you:

always just do this (what you have in the except)

or use is_dtype_equal to test?

I don't think is_dtype_equal would work because some combinations of different dtypes are still usable.

I agree with the first option because joining on the index of the other frame kind of makes the row order of the other frame arbitrary anyway.

jreback

can you merge master and let's have a look at this

jreback · 2019-02-01T19:30:45Z

pandas/core/reshape/merge.py

-                                   name=join_index.name)
-        return join_index
+                # if values missing (-1) from target index, replace missing
+                # values by their column position or NA if not applicable


we don't want to dispatch on the index type here at all, other than calling a method on the index. This is just ripe for errors. Need to make this much more generic.

jreback · 2019-02-01T19:31:25Z

pandas/tests/reshape/merge/test_merge.py

@@ -940,11 +941,56 @@ def test_merge_two_empty_df_no_division_error(self):
            merge(a, a, on=('a', 'b'))

    @pytest.mark.parametrize('how', ['right', 'outer'])
-    def test_merge_on_index_with_more_values(self, how):
+    @pytest.mark.parametrize('index,expected_index',


can you format this a bit better. start like

@pytest.mark.parametrize( 'index,expected_index', [(......), .....

…nto BUG-24212

JustinZhengBC · 2019-03-28T04:24:37Z

@jreback I've modified the logic so the same behavior is used for all dtypes: missing indices are filled with an appropriate NA value

jreback · 2019-03-28T23:34:02Z

pandas/tests/reshape/merge/test_merge.py

+        'index,expected_index',
+        [(CategoricalIndex([1, 2, 4]),
+          CategoricalIndex([1, 2, 4, None, None, None])),
+         (DatetimeIndex(['2001-01-01',


its ok to put multiple values on a line el.g. in the DTI and other construction to make this a bit shorter

jreback · 2019-03-28T23:34:09Z

pandas/tests/reshape/merge/test_merge.py

+         (TimedeltaIndex(['1d',
+                          '2d',
+                          '3d']),
+          TimedeltaIndex(['1d',


pandas/tests/reshape/merge/test_merge.py

jreback · 2019-03-29T12:25:23Z

pandas/core/reshape/merge.py

+                if is_integer_dtype(index.dtype):
+                    fill_value = np.nan
+                else:
+                    fill_value = na_value_for_dtype(index.dtype)


use this for all, passing compat=False

pandas/tests/reshape/merge/test_merge.py

jreback · 2019-04-20T16:54:30Z

can you merge master

JustinZhengBC · 2019-04-21T07:42:18Z

@jreback done

jreback · 2019-04-21T16:20:16Z

@JustinZhengBC so we are calling this an internal impl change, this has no outward effects? IOW user code will be unchanged? or does this have any cases that now work that didn't?

JustinZhengBC · 2019-04-22T00:05:21Z

@jreback #24916 had a whatsnew note like "bug in merge when merging by index name would sometimes result in an incorrectly numbered index," which is the same problem this addresses. There is an outwards change in that merge now fills in missing index values with NA values, whereas previously it would try to infer index values based on the other index. I have modified the whatsnew to clarify the new behaviour

…nto BUG-24212

jreback · 2019-04-28T18:42:08Z

lgtm. can you merge master; ping on green.

JustinZhengBC · 2019-05-05T17:36:40Z

@jreback done

jreback · 2019-05-05T21:21:59Z

thanks @JustinZhengBC

JustinZhengBC added 13 commits January 24, 2019 12:57

BUG-24212 fix usage of Index.take in pd.merge

b04cee7

BUG-24212 add comment

a64b8fe

BUG-24212 clarify test

022643d

BUG-24212 make _create_join_index function

e99dece

BUG-24212 add docstring and comments

b95e1fe

BUG-24212 fix regression

73be0d0

BUG-24212 alter old test

de3e2c7

fix typo

1287758

BUG-24212 remove print and move whatsnew note

bdce7ac

BUG-24212 fix when other_index has incompatible dtype

83ae393

Merge branch 'master' into BUG-24212

4cb3ab0

merge issue

66f6fe4

fix whatsnew

cf6fa14

JustinZhengBC changed the title ~~BUG-24212 fix when other_index has incompatible dtype~~ [WIP] BUG-24212 fix when other_index has incompatible dtype Jan 29, 2019

BUG-24212 fix test

0e6de81

JustinZhengBC added 2 commits January 29, 2019 12:29

BUG-24212 fix test

1da789a

Merge branch 'BUG-24212' of https://github.com/justinzhengbc/pandas i…

a0e5ffc

…nto BUG-24212

jschendel added Reshaping Concat, Merge/Join, Stack/Unstack, Explode Regression Functionality that used to work in a prior pandas version labels Jan 29, 2019

jreback requested changes Jan 30, 2019

View reviewed changes

BUG-24212 simplify take logic

27cdbc8

JustinZhengBC changed the title ~~[WIP] BUG-24212 fix when other_index has incompatible dtype~~ BUG-24212 fix when other_index has incompatible dtype Jan 31, 2019

fix import order

cd326b2

jreback requested changes Mar 26, 2019

View reviewed changes

JustinZhengBC added 4 commits March 25, 2019 19:01

Merge branch 'master' into BUG-24212

2c65ebf

make logic more generic

d8d3cdf

make logic more generic

f9e7386

Merge branch 'BUG-24212' of https://github.com/justinzhengbc/pandas i…

8a36130

…nto BUG-24212

jreback requested changes Mar 28, 2019

View reviewed changes

clean up test

7da3655

jreback added this to the 0.25.0 milestone Mar 29, 2019

jreback reviewed Mar 29, 2019

View reviewed changes

use compat=False for na_value_for_dtype

17c5497

jreback requested changes Mar 30, 2019

View reviewed changes

pandas/tests/reshape/merge/test_merge.py Show resolved Hide resolved

Merge branch 'master' into BUG-24212

720dfbb

jreback approved these changes Apr 21, 2019

View reviewed changes

clarify whatsnew

6772618

JustinZhengBC added 3 commits April 21, 2019 17:06

Merge branch 'master' into BUG-24212

dacb4bc

add PR number to whatsnew

cad4398

Merge branch 'BUG-24212' of https://github.com/justinzhengbc/pandas i…

5e2eb0f

…nto BUG-24212

Merge branch 'master' into BUG-24212

88cdf8b

jreback merged commit cc3b2f0 into pandas-dev:master May 5, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG-24212 fix when other_index has incompatible dtype #25009

BUG-24212 fix when other_index has incompatible dtype #25009

JustinZhengBC commented Jan 29, 2019 •

edited

Loading

codecov bot commented Jan 29, 2019

codecov bot commented Jan 29, 2019 •

edited

Loading

jreback Jan 30, 2019

JustinZhengBC Jan 30, 2019

jreback Jan 30, 2019

JustinZhengBC Jan 30, 2019

jreback left a comment

jreback Feb 1, 2019

jreback Feb 1, 2019

JustinZhengBC commented Mar 28, 2019

jreback Mar 28, 2019

jreback Mar 28, 2019

jreback Mar 29, 2019

jreback commented Apr 20, 2019

JustinZhengBC commented Apr 21, 2019

jreback commented Apr 21, 2019

JustinZhengBC commented Apr 22, 2019

jreback commented Apr 28, 2019

JustinZhengBC commented May 5, 2019

jreback commented May 5, 2019

BUG-24212 fix when other_index has incompatible dtype #25009

BUG-24212 fix when other_index has incompatible dtype #25009

Conversation

JustinZhengBC commented Jan 29, 2019 • edited Loading

codecov bot commented Jan 29, 2019

Codecov Report

codecov bot commented Jan 29, 2019 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JustinZhengBC commented Mar 28, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Apr 20, 2019

JustinZhengBC commented Apr 21, 2019

jreback commented Apr 21, 2019

JustinZhengBC commented Apr 22, 2019

jreback commented Apr 28, 2019

JustinZhengBC commented May 5, 2019

jreback commented May 5, 2019

JustinZhengBC commented Jan 29, 2019 •

edited

Loading

codecov bot commented Jan 29, 2019 •

edited

Loading