Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: merge_asof() has left_index/right_index and left_by/right_by (#14253) #14531

Merged
merged 9 commits into from
Dec 14, 2016
Merged

Conversation

chrisaycock
Copy link
Contributor

@chrisaycock
Copy link
Contributor Author

I'll be the first to admit that there is some kludge in here, identified by comments in the source code. There are some code paths that can only be hit during a pd.merge_asof() when left_index or right_index is set, so no existing functionality will be affected.

I think for pandas 2.0 we'll need to take a more holistic view of how to do merges.

@codecov-io
Copy link

codecov-io commented Oct 28, 2016

Current coverage is 85.20% (diff: 87.23%)

No coverage report found for master at 3552dc0.

Powered by Codecov. Last update 3552dc0...acad843

@jreback jreback added Reshaping Concat, Merge/Join, Stack/Unstack, Explode API Design labels Oct 31, 2016
join_names.append(rk)
else:
# kludge for merge_asof(right_index=True)
right_keys.append(right.index.values)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not safe as it can cause dtype conversions

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which part? The .index.values or the .append()?

@@ -1008,7 +1072,7 @@ def _get_merge_keys(self):
# validate tolerance; must be a Timedelta if we have a DTI
if self.tolerance is not None:

lt = left_join_keys[self.left_on.index(self._asof_key)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The _asof_key is defined to be self.left_on[-1]. But since left_on is empty when left_index is set, this line is invalid. Really we only need the time key, which is always at the end of left_join_keys.

@@ -117,6 +117,75 @@ def test_basic_categorical(self):
by='ticker')
assert_frame_equal(result, expected)

def test_basic_left_index(self):

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add the github issue as a comment

raise MergeError("can only asof on a key for right")

if self.left_index and isinstance(self.left.index, MultiIndex):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you tests for these conditions that are errors?

@chrisaycock
Copy link
Contributor Author

chrisaycock commented Nov 7, 2016

Hmm, AppVeyor got a syntax error for a line that isn't mine:

from pip._vendor. import string_types

Seems like everyone is failing with this. Filed issue #14603.

@chrisaycock
Copy link
Contributor Author

@jreback Would this change be more appropriate for 0.19.2 or 0.20.0?

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks pretty good. need some more tests coverage and see if you can remove the need to say kludge :) (IOW see if you can remove the kludge)

^^^^^^^^^^^^^^^^^^

- ``pd.merge_asof()`` can take ``left_index``/``right_index`` and ``left_by``/``right_by`` (:issue:`14253`)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can take -> gained, and say arguments at the end

right_keys.append(right[rk]._values)
else:
# kludge for merge_asof(right_index=True)
right_keys.append(right.index.values)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this here, we don't like doing things like right.index.values, as that can cause dtype conversions. simply append right.index should work.

left_keys.append(left[lk]._values)
join_names.append(lk)
else:
# kludge for merge_asof(left_index=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this a kludge? any way to make this less kludgy?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is unique to pd.merge_asof() because on and by are separate parameters. So a user could, for example, request left_index and left_by.

In a regular pd.merge(), users cannot specify both left_index and left_on. (Instead, users have a MultiIndex). That means the left_on in _get_merge_keys() is always empty in a pd.merge(), but a pd.merge_asof(left_index=True, left_by=...) will result in a left_on array with a None in the middle of it. See _validate_specification() for where this happens.

I put the kludge in there to handle this case. So perhaps it isn't a kludge at all and I should just put the above explanation at the top of the function. What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, so its simpler to handle all of the cases here (in basic merge) then?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for now it's simpler. It would take a massive overhaul of how generic merge works to fix this.

I'll add my notes as a comment to the function and resubmit.


if self.right_index and isinstance(self.right.index, MultiIndex):
raise MergeError("right can only have one index")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to add tests for each of these error conditions

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Field name to group by in the right DataFrame.

.. versionadded:: 0.19.2

suffixes : 2-length sequence (tuple, list, ...)
Suffix to apply to overlapping column names in the left and right
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add some examples of using these parameters?

@jreback
Copy link
Contributor

jreback commented Nov 15, 2016

Would this change be more appropriate for 0.19.2 or 0.20.0?

There is a tiny API change (in that some of the positional args to pd.merge_asof changed) because adding new ones, but think that is ok as we introduced this in 0.19.0, though to be conservative we can defer to 0.20.0

@jreback
Copy link
Contributor

jreback commented Nov 18, 2016

@jorisvandenbossche

@chrisaycock
Copy link
Contributor Author

@jreback @jorisvandenbossche It's all green finally.

@jreback jreback added this to the 0.19.2 milestone Nov 22, 2016
@jorisvandenbossche jorisvandenbossche merged commit 84cad61 into pandas-dev:master Dec 14, 2016
@jorisvandenbossche
Copy link
Member

@chrisaycock Thanks!

think that is ok as we introduced this in 0.19.0, though to be conservative we can defer to 0.20.0

I personally have a slight preference to keep this for 0.20.0, but as this was only introduced in 0.19.0, it's not a breakpoint for me

jorisvandenbossche pushed a commit to jorisvandenbossche/pandas that referenced this pull request Dec 24, 2016
yarikoptic added a commit to neurodebian/pandas that referenced this pull request Jan 3, 2017
Version 0.19.2

* tag 'v0.19.2': (78 commits)
  RLS: v0.19.2
  DOC: update release notes for 0.19.2
  TST: skip gbq upload test as flakey
  DOC: clean-up v0.19.2 whatsnew
  DOC: update Pandas Cheat Sheet (GH13202)
  DOC: Pandas Cheat Sheet
  TST: matplotlib 2.0 fix in log limits for barplot (GH14808) (pandas-dev#14957)
  flake8 fix import
  Remove test - from 0.20.0 PR slipped in
  PERF: fix getitem unique_check / initialization issue
  cache and remove boxing (pandas-dev#14931)
  CLN: Resubmit of GH14700.  Fixes GH14554.  Errors other than Indexing…
  Clean up construction of Series with dictionary and datetime index
  BUG: .fillna() for datetime64 with tz is passing thru floats
  BUG: Patch read_csv NA values behaviour
  ENH: merge_asof() has type specializations and can take multiple 'by' parameters (pandas-dev#13936)
  [Backport pandas-dev#14886] BUG: regression in DataFrame.combine_first with integer columns (GH14687) (pandas-dev#14886)
  Fixed KDE Plot to drop the missing values (pandas-dev#14820)
  ENH: merge_asof() has left_index/right_index and left_by/right_by (pandas-dev#14253) (pandas-dev#14531)
  TST: correct url for test file on s3 (xref pandas-dev#14587)
  ...
yarikoptic added a commit to neurodebian/pandas that referenced this pull request Jan 3, 2017
* releases: (78 commits)
  RLS: v0.19.2
  DOC: update release notes for 0.19.2
  TST: skip gbq upload test as flakey
  DOC: clean-up v0.19.2 whatsnew
  DOC: update Pandas Cheat Sheet (GH13202)
  DOC: Pandas Cheat Sheet
  TST: matplotlib 2.0 fix in log limits for barplot (GH14808) (pandas-dev#14957)
  flake8 fix import
  Remove test - from 0.20.0 PR slipped in
  PERF: fix getitem unique_check / initialization issue
  cache and remove boxing (pandas-dev#14931)
  CLN: Resubmit of GH14700.  Fixes GH14554.  Errors other than Indexing…
  Clean up construction of Series with dictionary and datetime index
  BUG: .fillna() for datetime64 with tz is passing thru floats
  BUG: Patch read_csv NA values behaviour
  ENH: merge_asof() has type specializations and can take multiple 'by' parameters (pandas-dev#13936)
  [Backport pandas-dev#14886] BUG: regression in DataFrame.combine_first with integer columns (GH14687) (pandas-dev#14886)
  Fixed KDE Plot to drop the missing values (pandas-dev#14820)
  ENH: merge_asof() has left_index/right_index and left_by/right_by (pandas-dev#14253) (pandas-dev#14531)
  TST: correct url for test file on s3 (xref pandas-dev#14587)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add left_by/right_by and left_index/right_index to merge_asof()
4 participants