-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG-24212 fix when other_index has incompatible dtype #25009
Conversation
Codecov Report
@@ Coverage Diff @@
## master #25009 +/- ##
==========================================
- Coverage 92.38% 42.88% -49.5%
==========================================
Files 166 166
Lines 52401 52407 +6
==========================================
- Hits 48409 22475 -25934
- Misses 3992 29932 +25940
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #25009 +/- ##
==========================================
- Coverage 91.97% 91.96% -0.01%
==========================================
Files 175 175
Lines 52368 52365 -3
==========================================
- Hits 48164 48157 -7
- Misses 4204 4208 +4
Continue to review full report at Codecov.
|
pandas/core/reshape/merge.py
Outdated
join_list[mask] = other_list[mask] | ||
join_index = Index(join_list, dtype=other_index.dtype, | ||
name=other_index.name) | ||
except ValueError: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we really don't want to do a try/except here. What is falling into the except?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When 'other_index' has a different dtype that causes an exception to be raised when values from it are inserted into the current index. I did not compare dtypes because in some cases differing dtypes are possible (example: int
can be added to a Float64Index
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you:
- always just do this (what you have in the except)
- or use
is_dtype_equal
to test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think is_dtype_equal
would work because some combinations of different dtypes are still usable.
I agree with the first option because joining on the index of the other frame kind of makes the row order of the other frame arbitrary anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you merge master and let's have a look at this
pandas/core/reshape/merge.py
Outdated
name=join_index.name) | ||
return join_index | ||
# if values missing (-1) from target index, replace missing | ||
# values by their column position or NA if not applicable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we don't want to dispatch on the index type here at all, other than calling a method on the index. This is just ripe for errors. Need to make this much more generic.
@@ -940,11 +941,56 @@ def test_merge_two_empty_df_no_division_error(self): | |||
merge(a, a, on=('a', 'b')) | |||
|
|||
@pytest.mark.parametrize('how', ['right', 'outer']) | |||
def test_merge_on_index_with_more_values(self, how): | |||
@pytest.mark.parametrize('index,expected_index', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you format this a bit better. start like
@pytest.mark.parametrize(
'index,expected_index',
[(......),
.....
@jreback I've modified the logic so the same behavior is used for all dtypes: missing indices are filled with an appropriate NA value |
'index,expected_index', | ||
[(CategoricalIndex([1, 2, 4]), | ||
CategoricalIndex([1, 2, 4, None, None, None])), | ||
(DatetimeIndex(['2001-01-01', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
its ok to put multiple values on a line el.g. in the DTI and other construction to make this a bit shorter
(TimedeltaIndex(['1d', | ||
'2d', | ||
'3d']), | ||
TimedeltaIndex(['1d', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
e.g. here
pandas/core/reshape/merge.py
Outdated
if is_integer_dtype(index.dtype): | ||
fill_value = np.nan | ||
else: | ||
fill_value = na_value_for_dtype(index.dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use this for all, passing compat=False
can you merge master |
@jreback done |
@JustinZhengBC so we are calling this an internal impl change, this has no outward effects? IOW user code will be unchanged? or does this have any cases that now work that didn't? |
@jreback #24916 had a whatsnew note like "bug in merge when merging by index name would sometimes result in an incorrectly numbered index," which is the same problem this addresses. There is an outwards change in that merge now fills in missing index values with NA values, whereas previously it would try to infer index values based on the other index. I have modified the whatsnew to clarify the new behaviour |
lgtm. can you merge master; ping on green. |
@jreback done |
thanks @JustinZhengBC |
git diff upstream/master -u -- "*.py" | flake8 --diff
Followup to #24916, addresses the case when the other index has an incompatible dtype, so we cannot take directly from it. Currently, this PR
naively replaces the missing index values with the number of the rows in the other index that caused themreplaces the missing index values with the appropriate NA value.Still working on adding cases when it is possible to combine indices of sparse/categorical dtypes without densifying.