-
-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: merge_asof() has left_index/right_index and left_by/right_by (#14253) #14531
Conversation
I'll be the first to admit that there is some kludge in here, identified by comments in the source code. There are some code paths that can only be hit during a I think for pandas 2.0 we'll need to take a more holistic view of how to do merges. |
Current coverage is 85.20% (diff: 87.23%)
|
join_names.append(rk) | ||
else: | ||
# kludge for merge_asof(right_index=True) | ||
right_keys.append(right.index.values) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not safe as it can cause dtype conversions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which part? The .index.values
or the .append()
?
@@ -1008,7 +1072,7 @@ def _get_merge_keys(self): | |||
# validate tolerance; must be a Timedelta if we have a DTI | |||
if self.tolerance is not None: | |||
|
|||
lt = left_join_keys[self.left_on.index(self._asof_key)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The _asof_key
is defined to be self.left_on[-1]
. But since left_on
is empty when left_index
is set, this line is invalid. Really we only need the time key, which is always at the end of left_join_keys
.
@@ -117,6 +117,75 @@ def test_basic_categorical(self): | |||
by='ticker') | |||
assert_frame_equal(result, expected) | |||
|
|||
def test_basic_left_index(self): | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add the github issue as a comment
raise MergeError("can only asof on a key for right") | ||
|
||
if self.left_index and isinstance(self.left.index, MultiIndex): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you tests for these conditions that are errors?
Hmm, AppVeyor got a syntax error for a line that isn't mine:
Seems like everyone is failing with this. Filed issue #14603. |
@jreback Would this change be more appropriate for 0.19.2 or 0.20.0? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks pretty good. need some more tests coverage and see if you can remove the need to say kludge :) (IOW see if you can remove the kludge)
^^^^^^^^^^^^^^^^^^ | ||
|
||
- ``pd.merge_asof()`` can take ``left_index``/``right_index`` and ``left_by``/``right_by`` (:issue:`14253`) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can take -> gained, and say arguments at the end
right_keys.append(right[rk]._values) | ||
else: | ||
# kludge for merge_asof(right_index=True) | ||
right_keys.append(right.index.values) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this here, we don't like doing things like right.index.values
, as that can cause dtype conversions. simply append right.index
should work.
left_keys.append(left[lk]._values) | ||
join_names.append(lk) | ||
else: | ||
# kludge for merge_asof(left_index=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this a kludge? any way to make this less kludgy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is unique to pd.merge_asof()
because on
and by
are separate parameters. So a user could, for example, request left_index
and left_by
.
In a regular pd.merge()
, users cannot specify both left_index
and left_on
. (Instead, users have a MultiIndex
). That means the left_on
in _get_merge_keys()
is always empty in a pd.merge()
, but a pd.merge_asof(left_index=True, left_by=...)
will result in a left_on
array with a None
in the middle of it. See _validate_specification()
for where this happens.
I put the kludge in there to handle this case. So perhaps it isn't a kludge at all and I should just put the above explanation at the top of the function. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, so its simpler to handle all of the cases here (in basic merge) then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think for now it's simpler. It would take a massive overhaul of how generic merge works to fix this.
I'll add my notes as a comment to the function and resubmit.
|
||
if self.right_index and isinstance(self.right.index, MultiIndex): | ||
raise MergeError("right can only have one index") | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need to add tests for each of these error conditions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These tests already exist:
https://github.com/pandas-dev/pandas/pull/14531/files#diff-e00646757b932a2684fda4588f99009cR163
Field name to group by in the right DataFrame. | ||
|
||
.. versionadded:: 0.19.2 | ||
|
||
suffixes : 2-length sequence (tuple, list, ...) | ||
Suffix to apply to overlapping column names in the left and right |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add some examples of using these parameters?
There is a tiny API change (in that some of the positional args to |
@jreback @jorisvandenbossche It's all green finally. |
@chrisaycock Thanks!
I personally have a slight preference to keep this for 0.20.0, but as this was only introduced in 0.19.0, it's not a breakpoint for me |
…ndas-dev#14253) (pandas-dev#14531) (cherry picked from commit 84cad61)
Version 0.19.2 * tag 'v0.19.2': (78 commits) RLS: v0.19.2 DOC: update release notes for 0.19.2 TST: skip gbq upload test as flakey DOC: clean-up v0.19.2 whatsnew DOC: update Pandas Cheat Sheet (GH13202) DOC: Pandas Cheat Sheet TST: matplotlib 2.0 fix in log limits for barplot (GH14808) (pandas-dev#14957) flake8 fix import Remove test - from 0.20.0 PR slipped in PERF: fix getitem unique_check / initialization issue cache and remove boxing (pandas-dev#14931) CLN: Resubmit of GH14700. Fixes GH14554. Errors other than Indexing… Clean up construction of Series with dictionary and datetime index BUG: .fillna() for datetime64 with tz is passing thru floats BUG: Patch read_csv NA values behaviour ENH: merge_asof() has type specializations and can take multiple 'by' parameters (pandas-dev#13936) [Backport pandas-dev#14886] BUG: regression in DataFrame.combine_first with integer columns (GH14687) (pandas-dev#14886) Fixed KDE Plot to drop the missing values (pandas-dev#14820) ENH: merge_asof() has left_index/right_index and left_by/right_by (pandas-dev#14253) (pandas-dev#14531) TST: correct url for test file on s3 (xref pandas-dev#14587) ...
* releases: (78 commits) RLS: v0.19.2 DOC: update release notes for 0.19.2 TST: skip gbq upload test as flakey DOC: clean-up v0.19.2 whatsnew DOC: update Pandas Cheat Sheet (GH13202) DOC: Pandas Cheat Sheet TST: matplotlib 2.0 fix in log limits for barplot (GH14808) (pandas-dev#14957) flake8 fix import Remove test - from 0.20.0 PR slipped in PERF: fix getitem unique_check / initialization issue cache and remove boxing (pandas-dev#14931) CLN: Resubmit of GH14700. Fixes GH14554. Errors other than Indexing… Clean up construction of Series with dictionary and datetime index BUG: .fillna() for datetime64 with tz is passing thru floats BUG: Patch read_csv NA values behaviour ENH: merge_asof() has type specializations and can take multiple 'by' parameters (pandas-dev#13936) [Backport pandas-dev#14886] BUG: regression in DataFrame.combine_first with integer columns (GH14687) (pandas-dev#14886) Fixed KDE Plot to drop the missing values (pandas-dev#14820) ENH: merge_asof() has left_index/right_index and left_by/right_by (pandas-dev#14253) (pandas-dev#14531) TST: correct url for test file on s3 (xref pandas-dev#14587) ...
git diff upstream/master | flake8 --diff