Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG-24212 fix when other_index has incompatible dtype #25009

Merged
merged 30 commits into from
May 5, 2019

Conversation

JustinZhengBC
Copy link
Contributor

@JustinZhengBC JustinZhengBC commented Jan 29, 2019

Followup to #24916, addresses the case when the other index has an incompatible dtype, so we cannot take directly from it. Currently, this PR naively replaces the missing index values with the number of the rows in the other index that caused them replaces the missing index values with the appropriate NA value.

Still working on adding cases when it is possible to combine indices of sparse/categorical dtypes without densifying.

@JustinZhengBC JustinZhengBC changed the title BUG-24212 fix when other_index has incompatible dtype [WIP] BUG-24212 fix when other_index has incompatible dtype Jan 29, 2019
@codecov
Copy link

codecov bot commented Jan 29, 2019

Codecov Report

Merging #25009 into master will decrease coverage by 49.49%.
The diff coverage is 0%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #25009      +/-   ##
==========================================
- Coverage   92.38%   42.88%   -49.5%     
==========================================
  Files         166      166              
  Lines       52401    52407       +6     
==========================================
- Hits        48409    22475   -25934     
- Misses       3992    29932   +25940
Flag Coverage Δ
#multiple ?
#single 42.88% <0%> (-0.01%) ⬇️
Impacted Files Coverage Δ
pandas/core/reshape/merge.py 9.43% <0%> (-85.05%) ⬇️
pandas/io/formats/latex.py 0% <0%> (-100%) ⬇️
pandas/core/categorical.py 0% <0%> (-100%) ⬇️
pandas/io/sas/sas_constants.py 0% <0%> (-100%) ⬇️
pandas/tseries/plotting.py 0% <0%> (-100%) ⬇️
pandas/tseries/converter.py 0% <0%> (-100%) ⬇️
pandas/io/formats/html.py 0% <0%> (-99.35%) ⬇️
pandas/core/groupby/categorical.py 0% <0%> (-95.46%) ⬇️
pandas/io/sas/sas7bdat.py 0% <0%> (-91.17%) ⬇️
pandas/io/sas/sas_xport.py 0% <0%> (-90.15%) ⬇️
... and 124 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update abf0824...0e6de81. Read the comment docs.

@codecov
Copy link

codecov bot commented Jan 29, 2019

Codecov Report

Merging #25009 into master will decrease coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #25009      +/-   ##
==========================================
- Coverage   91.97%   91.96%   -0.01%     
==========================================
  Files         175      175              
  Lines       52368    52365       -3     
==========================================
- Hits        48164    48157       -7     
- Misses       4204     4208       +4
Flag Coverage Δ
#multiple 90.52% <100%> (-0.01%) ⬇️
#single 40.7% <0%> (-0.15%) ⬇️
Impacted Files Coverage Δ
pandas/core/reshape/merge.py 94.45% <100%> (-0.03%) ⬇️
pandas/io/gbq.py 78.94% <0%> (-10.53%) ⬇️
pandas/core/frame.py 96.9% <0%> (-0.12%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9feb3ad...88cdf8b. Read the comment docs.

@jschendel jschendel added Reshaping Concat, Merge/Join, Stack/Unstack, Explode Regression Functionality that used to work in a prior pandas version labels Jan 29, 2019
join_list[mask] = other_list[mask]
join_index = Index(join_list, dtype=other_index.dtype,
name=other_index.name)
except ValueError:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we really don't want to do a try/except here. What is falling into the except?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When 'other_index' has a different dtype that causes an exception to be raised when values from it are inserted into the current index. I did not compare dtypes because in some cases differing dtypes are possible (example: int can be added to a Float64Index)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you:

  • always just do this (what you have in the except)
  • or use is_dtype_equal to test?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think is_dtype_equal would work because some combinations of different dtypes are still usable.

I agree with the first option because joining on the index of the other frame kind of makes the row order of the other frame arbitrary anyway.

@JustinZhengBC JustinZhengBC changed the title [WIP] BUG-24212 fix when other_index has incompatible dtype BUG-24212 fix when other_index has incompatible dtype Jan 31, 2019
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you merge master and let's have a look at this

name=join_index.name)
return join_index
# if values missing (-1) from target index, replace missing
# values by their column position or NA if not applicable
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't want to dispatch on the index type here at all, other than calling a method on the index. This is just ripe for errors. Need to make this much more generic.

@@ -940,11 +941,56 @@ def test_merge_two_empty_df_no_division_error(self):
merge(a, a, on=('a', 'b'))

@pytest.mark.parametrize('how', ['right', 'outer'])
def test_merge_on_index_with_more_values(self, how):
@pytest.mark.parametrize('index,expected_index',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you format this a bit better. start like

@pytest.mark.parametrize(
   'index,expected_index',
    [(......),
.....

@JustinZhengBC
Copy link
Contributor Author

@jreback I've modified the logic so the same behavior is used for all dtypes: missing indices are filled with an appropriate NA value

'index,expected_index',
[(CategoricalIndex([1, 2, 4]),
CategoricalIndex([1, 2, 4, None, None, None])),
(DatetimeIndex(['2001-01-01',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its ok to put multiple values on a line el.g. in the DTI and other construction to make this a bit shorter

(TimedeltaIndex(['1d',
'2d',
'3d']),
TimedeltaIndex(['1d',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

e.g. here

pandas/tests/reshape/merge/test_merge.py Show resolved Hide resolved
@jreback jreback added this to the 0.25.0 milestone Mar 29, 2019
if is_integer_dtype(index.dtype):
fill_value = np.nan
else:
fill_value = na_value_for_dtype(index.dtype)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use this for all, passing compat=False

@jreback
Copy link
Contributor

jreback commented Apr 20, 2019

can you merge master

@JustinZhengBC
Copy link
Contributor Author

@jreback done

@jreback
Copy link
Contributor

jreback commented Apr 21, 2019

@JustinZhengBC so we are calling this an internal impl change, this has no outward effects? IOW user code will be unchanged? or does this have any cases that now work that didn't?

@JustinZhengBC
Copy link
Contributor Author

@jreback #24916 had a whatsnew note like "bug in merge when merging by index name would sometimes result in an incorrectly numbered index," which is the same problem this addresses. There is an outwards change in that merge now fills in missing index values with NA values, whereas previously it would try to infer index values based on the other index. I have modified the whatsnew to clarify the new behaviour

@jreback
Copy link
Contributor

jreback commented Apr 28, 2019

lgtm. can you merge master; ping on green.

@JustinZhengBC
Copy link
Contributor Author

@jreback done

@jreback jreback merged commit cc3b2f0 into pandas-dev:master May 5, 2019
@jreback
Copy link
Contributor

jreback commented May 5, 2019

thanks @JustinZhengBC

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Regression Functionality that used to work in a prior pandas version Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

REGR: re-evaluate merge fix of PR #24916
3 participants