Support merging DataFrames on a combo of columns and index levels (GH 14355) #17484

jonmmease · 2017-09-09T14:42:49Z

This PR implements the changes proposed and discussed with @jorisvandenbossche @shoyer @TomAugspurger @jreback in #14355.

These changes allow the on, left_on, and right_on parameters of DataFrame.merge to accept a combination of column names and index level names. Any common index levels that are merged on are preserved as index levels in the resulting DataFrame, while all other index levels are removed.

In the case of a conflict, the column takes precedence and an ambiguity FutureWarning is raised.

closes ENH: Support merging DataFrames on a combination of columns and index levels #14355
tests added / passed
passes git diff upstream/master -u -- "*.py" | flake8 --diff
documentation updated
whatsnew entry

Note: The new df._get_column_or_level_values method introduced in this PR is the same method introduced in #17361 to support sorting DataFrames on a combination of columns and index levels. I will keep this method in sync between the two PRs during the review process.

gfyoung · 2017-09-09T19:12:19Z

doc/source/merging.rst

+.. note::
+
+   When DataFrames are merged on a string that matches an index level in both
+   frames then the index level is preserved as an index level in the resulting


"frames" --> "frames,"

gfyoung · 2017-09-09T19:12:30Z

doc/source/merging.rst

+
+.. note::
+
+   If a string matches both a column name and an index level name then a warning is


"index level name" --> "index level name,"

gfyoung · 2017-09-09T19:13:10Z

doc/source/whatsnew/v0.21.0.txt

@@ -109,6 +109,7 @@ Other Enhancements
 - :func:`date_range` now accepts 'Y' in addition to 'A' as an alias for end of year (:issue:`9313`)
 - Integration with `Apache Parquet <https://parquet.apache.org/>`__, including a new top-level :func:`read_parquet` and :func:`DataFrame.to_parquet` method, see :ref:`here <io.parquet>`. (:issue:`15838`, :issue:`17438`)
 - :func:`DataFrame.add_prefix` and :func:`DataFrame.add_suffix` now accept strings containing the '%' character. (:issue:`17151`)
+- :func:`DataFrame.merge` now accepts index level names as `on`, `left_on`, and `right_on` parameters allowing frames to be merged on a combination of columns and index levels (:issue:`14355`)


"parameters allowing" --> "parameters, allowing"

gfyoung · 2017-09-09T19:13:37Z

pandas/core/frame.py

@@ -3437,6 +3437,36 @@ def f(vals):

    # ----------------------------------------------------------------------
    # Sorting
+    def _get_column_or_level_values(self, key, axis=1,
+                                    op_description='retrieve'):


Add doc-string for this. Developers in the future will thank you. 😉

gfyoung · 2017-09-09T19:13:58Z

pandas/core/frame.py

+                    ("'%s' is both a column name and an index level.\n"
+                     "Defaulting to column but "
+                     "this will raise an ambiguity error in a "
+                     "future version") % key,


The percentile string replacement is being phased out in favor of .format. Please replace along with ALL other places where you use it.

gfyoung · 2017-09-09T19:14:39Z

pandas/core/reshape/merge.py

@@ -792,6 +815,10 @@ def _get_merge_keys(self):
        is_rkey = lambda x: isinstance(
            x, (np.ndarray, Series)) and len(x) == len(right)

+        def get_key_vals(df, key):


Add doc-string for this.

gfyoung · 2017-09-09T19:15:14Z

pandas/tests/reshape/test_merge.py

@@ -1350,6 +1358,131 @@ def f():
            household.join(log_return, how='outer')
        pytest.raises(NotImplementedError, f)

+    def test_merge_on_index_and_column(self):
+        # Construct DataFrames


Reference issue number.

codecov · 2017-09-09T20:12:15Z

Codecov Report

Merging #17484 into master will increase coverage by 0.09%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #17484      +/-   ##
==========================================
+ Coverage   91.35%   91.45%   +0.09%     
==========================================
  Files         163      157       -6     
  Lines       49691    51451    +1760     
==========================================
+ Hits        45397    47052    +1655     
- Misses       4294     4399     +105

Flag	Coverage Δ
#multiple	`89.32% <100%> (+0.18%)`	⬆️
#single	`40.53% <11.11%> (+0.79%)`	⬆️

Impacted Files	Coverage Δ
pandas/core/groupby.py	`92.04% <100%> (+0.01%)`	⬆️
pandas/core/frame.py	`97.81% <100%> (-0.09%)`	⬇️
pandas/core/reshape/merge.py	`94.4% <100%> (+0.14%)`	⬆️
pandas/core/generic.py	`95.9% <100%> (+0.17%)`	⬆️
pandas/io/gbq.py	`25% <0%> (-58.34%)`	⬇️
pandas/io/clipboard/__init__.py	`56.86% <0%> (-19.33%)`	⬇️
pandas/util/testing.py	`81.89% <0%> (-18.11%)`	⬇️
pandas/io/json/json.py	`91.75% <0%> (-8.25%)`	⬇️
pandas/io/formats/style.py	`96.21% <0%> (-3.79%)`	⬇️
pandas/core/dtypes/missing.py	`90.24% <0%> (-0.5%)`	⬇️
... and 37 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c634352...f3b95fe. Read the comment docs.

jonmmease · 2017-09-10T01:11:52Z

Thanks for the quick feedback @gfyoung. I think I've incorporated all of it in this last commit. Let me know if I've missed anything or if there are additional changes you'd like to see.

jreback

lots of comments

jreback · 2017-09-10T14:27:33Z

doc/source/merging.rst

@@ -1120,6 +1122,53 @@ This is not Implemented via ``join`` at-the-moment, however it can be done using
          labels=['left', 'right'], vertical=False);
   plt.close('all');

+Merging on a combination of columns and index levels
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+.. versionadded:: 0.21


need a blank line here

add a :ref: entry before the sub-section

jreback · 2017-09-10T14:29:51Z

pandas/core/frame.py

+
+        Parameters
+        ----------
+        key: int or object


This should only accept string keys referring to a named column or level, We do not want to have the integer/string mess like .ix again.

jreback · 2017-09-10T14:30:46Z

pandas/core/reshape/merge.py

@@ -717,7 +719,25 @@ def _maybe_add_join_keys(self, result, left_indexer, right_indexer):
                if name in result:
                    result[name] = key_col
                else:
-                    result.insert(i, name or 'key_{i}'.format(i=i), key_col)
+                    if name and name in result.index.names:


you are adding an amazing amount of logic here. This must be made simpler.

jreback · 2017-09-10T14:31:34Z

pandas/core/reshape/merge.py

+            See Also
+            --------
+            DataFrame._get_column_or_level_values
+            """


if you actually need complexity, then hide it away here

jreback · 2017-09-10T14:31:43Z

pandas/core/reshape/merge.py

+
+            Returns
+            -------
+            values: array


array -> np.ndarray

jreback · 2017-09-10T14:31:48Z

pandas/core/reshape/merge.py

+            Parameters
+            ----------
+            df: DataFrame
+            key: int or object


jreback · 2017-09-10T14:32:07Z

pandas/core/frame.py

+        axis: int, default 0
+            Axis to retrieve values from
+
+        op_description: str,  default 'retrieve'


not sure what this parameter is meant to do, its completely non-intuitive.

jreback · 2017-09-10T14:34:40Z

We may need a common framework here to incorporate the .groupby functionaility of the same name. Adding this many code branches is making things pretty complex.

jonmmease · 2017-09-11T00:14:47Z

Thanks for the feedback @jreback. I'll address your comments and take another stab at driving down the complexity. If you, or anyone else, have some ideas for a more unified way to go about this (mixing columns and index levels in merge, sort_values, groupby, etc) I'd be happy to take a step back and consider a change of course.

jonmmease · 2017-09-11T18:30:31Z

Ok, after meditating on this section of the code again I think I found a simpler approach. I created a new set of DataFrame helpers to support this change and I refactored the existing groupby column/index logic (#14432) to use these as well. I haven't revisited my sort_values updates in #17361 yet, but I should be able to rely on these same helpers there too.

Does this general approach look reasonable @jreback, @jorisvandenbossche, @gfyoung, and @TomAugspurger? Thanks!

jonmmease · 2017-10-02T22:41:47Z

Just resolved conflicts with master. @jreback do you think my updates since your review last month are heading in the right direction? I'm happy to keep iterating as you have time to provide feedback.

@jorisvandenbossche it's been a while since we first talked about these changes. Does this approach look alright to you?

Thanks!

jonmmease · 2017-10-04T21:18:26Z

I'm having trouble interpreting the circleci failure above. Is this something my changes caused?

gfyoung · 2017-10-04T21:23:14Z

@jmmease : The Circle CI builds just died...let's give it another run to see if it passes.

jonmmease · 2017-10-04T23:10:09Z

Thanks, @gfyoung

jreback · 2017-11-22T00:27:47Z

I think we could put all of the added code that you put in generic.py in another module. maybe references and make it into a mix-in class (that you just include into generic).

actually this is ok for now. will think splitting generic.py up (in another issue).

jonmmease · 2017-11-22T15:26:25Z

No problem @jreback , thanks for all you're doing to keep pandas going strong!

Just let me know if you would like something moved before merging.

# Conflicts: # doc/source/whatsnew/v0.22.0.txt

jonmmease · 2017-11-23T13:11:21Z

Just merged to fix whatsnew conflict. Is this all set @jreback?

jreback · 2017-11-23T16:39:18Z

@jmmease small linting error (we just added this, it now looks for list-comprehensions that are not generators) :>

and want @jorisvandenbossche @TomAugspurger to have a look.

jonmmease · 2017-11-23T17:31:10Z

Ok, I think I fixed the lint issues. Thanks for pointing me in the right direction @jreback

I'll stand-by for any feedback from @jorisvandenbossche and @TomAugspurger

jorisvandenbossche

Nice work!

I added some minor comments.

General notes:

For the case that an index level and column level match, shouldn't we treat this the same as a duplicate column name (in the future)? (in the idea of seeing an index as a special column)
I don't think it changes anything right now (warning that for now column is used), but I would change it in the future to the same as what happens with duplicate columns.
You wrote the code dealing with selecting labels vs levels very generic. But do we actually have a use case for the axis=1 ? (selecting labels from index or levels from columns) I think merge is only column-wise?
If we don't have such a use case, I think the code would be simpler to just assume it is only for the column labels.
Should DataFrame.join support the same?

jorisvandenbossche · 2017-11-07T21:45:16Z

pandas/core/frame.py

-    None and not merging on indexes, then it merges on the intersection of
-    the columns by default.
+    Column or index level names to join on. These must be found in both
+    DataFrames. If on is None and not merging on indexes then this defaults to


Can you put single backticks around 'on' ? (that is typically done for parameters of the function)

jorisvandenbossche · 2017-11-07T21:46:05Z

pandas/core/frame.py

@@ -195,6 +196,11 @@

    .. versionadded:: 0.21.0

+Note


Note -> Notes (section names are fixed by numpydoc)

needs update here

Thanks for catching this @jreback. Fixed

jorisvandenbossche · 2017-11-07T21:49:34Z

pandas/core/generic.py

+                "_is_label_reference is not implemented for {type}"
+                .format(type=type(self)))
+
+        return (isinstance(key, compat.string_types) and


Is this restriction of string type needed?
Maybe it is safe to do this to start with (and it will cover most of the use cases), but in general pandas does not restrict column names to be strings.

Good point, I've changed this to check that the key is non-None and hashable

jorisvandenbossche · 2017-11-23T20:21:37Z

pandas/core/frame.py

-    vectors of the length of the DataFrame to use a particular vector as
-    the join key instead of columns
+    Column or index level names to join on in the left DataFrame. Can also
+    be a vector or list of vectors of the length of the left DataFrame.


now you are changing this anyway, can you change vector with array ?

jorisvandenbossche · 2017-11-23T20:24:15Z

pandas/core/frame.py

-    the join key instead of columns
+    Column or index level names to join on in the left DataFrame. Can also
+    be a vector or list of vectors of the length of the left DataFrame.
+    These vectors are treated as though they are columns.


I think 'as if' is easier language than 'as though' for non-native speakers

jorisvandenbossche · 2017-11-23T20:34:22Z

pandas/core/generic.py

+                          level_article=level_article,
+                          level_type=level_type,
+                          label_article=label_article,
+                          label_type=label_type), FutureWarning)


this warning needs a stacklevel

I would also put the creation of the message on a separate line before, I think that will make it cleaner:

msg = (....).format(..) warnings.warn(msg, FutureWarning, stacklevel=..)

jorisvandenbossche · 2017-11-23T20:53:33Z

pandas/tests/reshape/test_merge_index_as_string.py

+                         [['outer'], ['inner'],
+                          ['outer', 'inner'],
+                          ['inner', 'outer']])
+def test_merge_indexes_and_columns_on(left_df, right_df, on, how):


This single test adds actually 144 (3x3x4x4) tests, which feels a bit as overkill to me?

I trimmed it down by a factor of 4 by removing the how fixture and parameterizing how alongside on.

jonmmease · 2017-11-23T23:58:31Z

Thanks for looking things over @jorisvandenbossche. I'll address the specific comments in an update soon but wanted to follow up on a few of your general points.

Yes, I agree that it makes sense for ambiguous index/column combinations to eventually fail in the same way that duplicate columns now fail. Let me know if there's any language in the FutureWarning or in my comments in the code that you'd like me to change to make this future intention clearer.
I'm planning to use the axis=1 case for a couple of these new methods in my next update to ENH: Support sorting frames by a combo of columns and index levels (GH 14353) #17361 (The sort_values analog of this PR) since sort_values supports that axis parameter. My hope was that these methods would make it easier to add mixed column/index support to other operations in the future, though sort_values is the only remaining operation on my personal wishlist at the moment.
I restricted index level references to strings at the moment partially because I wasn't clear on how to check for the full set of valid label/level reference types. I'm happy to open this up if there's a clear way to check this.
I'll need to think some more about join. Does anyone else have thoughts on this?

…trings)

Now we parameterize the how parameter alongside on. Reduces test-case count by a factor of 4.

Rename vector -> array

…to df.join Fixed errors that cropped up when using join on a combination of columns and index levels

jonmmease · 2017-11-26T02:34:55Z

@jorisvandenbossche I just pushed a new version that I believe addresses all of your inline comments (I've responded to a couple of these comments above).

After looking things over I realized that join was almost working already. I added some tests, fixed up a few issues, and updated the join docstring.

Let me know if there's anything else you'd like to see once you've had a chance to look things over again. Thanks!

jonmmease · 2017-11-29T15:46:54Z

Just following up. Did this last round of changes address your comments sufficiently @jorisvandenbossche?

@TomAugspurger is there anything more you'd like to see here?

cc: @jreback

Thanks!

jreback

tiny doc update

jreback · 2017-12-01T01:10:09Z

pandas/core/frame.py

@@ -195,6 +196,11 @@

    .. versionadded:: 0.21.0

+Note


needs update here

jonmmease · 2017-12-01T16:09:49Z

doc change pushed and all green @jreback @jorisvandenbossche @TomAugspurger

jreback · 2017-12-01T16:45:35Z

thanks @jmmease

if any fallout we can PR later!

jonmmease · 2017-12-01T17:28:58Z

Awesome! thanks @jreback
I'll be around and be working on #14353 next. Just ping me as any integration issues with this arise

Support merging frames on a combo of columns and index levels (GH 14355)

da94fdb

gfyoung added API Design Compat pandas objects compatability with Numpy or Python functions Enhancement labels Sep 9, 2017

gfyoung reviewed Sep 9, 2017

View reviewed changes

Cleanup for review

f8c8c53

jreback requested changes Sep 10, 2017

View reviewed changes

Jon M. Mease added 6 commits September 11, 2017 12:07

revert implementation (but keep documentation and tests)

368844a

Simplify and refactor column/level logic in merge

1c4699e

PEP8 cleanup

ac1189b

Extract column/level ambiguity warning logic into utility method

d90ed78

Add newline and add :ref: entry for new doc section

27b2d25

docstring / comment cleanup

de6f4b1

jonmmease mentioned this pull request Oct 2, 2017

ENH: Support sorting frames by a combo of columns and index levels (GH 14353) #17361

Merged

5 tasks

Merge branch 'master' into enh_14355

39d0bba

Jon M. Mease added 2 commits November 23, 2017 08:01

Merge remote-tracking branch 'upstream/master' into enh_14355

cd17c42

Merge remote-tracking branch 'upstream/master' into enh_14355

1a4e3e4

# Conflicts: # doc/source/whatsnew/v0.22.0.txt

Fix generator/list lint issues

a49012c

jorisvandenbossche reviewed Nov 23, 2017

View reviewed changes

Jon M. Mease added 6 commits November 25, 2017 21:08

Allow non-None hashable objects to reference index levels (not just s…

6fd9760

…trings)

Reduce parameterized test case count by removing how fixture

f7e04f5

Now we parameterize the how parameter alongside on. Reduces test-case count by a factor of 4.

Refactor warning code and add stacklevel

cf8e654

Use single backticks to reference method params in docstrings

e874f04

Rename vector -> array

Add tests and docstring updates for using index levels as on param …

13ce87c

…to df.join Fixed errors that cropped up when using join on a combination of columns and index levels

PEP8

b5cb4c1

jreback requested changes Dec 1, 2017

View reviewed changes

pandas/core/frame.py Outdated

@@ -195,6 +196,11 @@

.. versionadded:: 0.21.0

Note

Copy link

Contributor

jreback Dec 1, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs update here

Fixed Note->Notes in docstring

f3b95fe

jreback approved these changes Dec 1, 2017

View reviewed changes

jreback merged commit d5ffb1f into pandas-dev:master Dec 1, 2017

jreback added this to the 0.22.0 milestone Dec 1, 2017

This was referenced Dec 3, 2017

ENH: Support merging Dask DataFrames on a combination of columns and index levels dask/dask#2950

Closed

ENH: Support merging Dask DataFrames on a combination of columns and index levels (GH2950) dask/dask#2960

Merged

jonmmease mentioned this pull request Jan 11, 2018

Cleanup for GH2950 dask/dask#3059

Merged

2 tasks

Dr-Irv mentioned this pull request Mar 22, 2018

BUG: New feature allowing merging on combination of columns and index levels drops levels of index #20452

Closed


		.. note::

		If a string matches both a column name and an index level name then a warning is

Support merging DataFrames on a combo of columns and index levels (GH 14355) #17484

Support merging DataFrames on a combo of columns and index levels (GH 14355) #17484

Conversation

jonmmease commented Sep 9, 2017

gfyoung Sep 9, 2017 • edited Loading

Choose a reason for hiding this comment

gfyoung Sep 9, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gfyoung Sep 9, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Sep 9, 2017 • edited Loading

Codecov Report

jonmmease commented Sep 10, 2017

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Sep 10, 2017

jonmmease commented Sep 11, 2017

jonmmease commented Sep 11, 2017

jonmmease commented Oct 2, 2017

jonmmease commented Oct 4, 2017

gfyoung commented Oct 4, 2017

jonmmease commented Oct 4, 2017

jreback commented Nov 22, 2017

jonmmease commented Nov 22, 2017

jonmmease commented Nov 23, 2017

jreback commented Nov 23, 2017

jonmmease commented Nov 23, 2017

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jonmmease commented Nov 23, 2017

jonmmease commented Nov 26, 2017

jonmmease commented Nov 29, 2017

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jonmmease commented Dec 1, 2017

jreback commented Dec 1, 2017

jonmmease commented Dec 1, 2017

gfyoung Sep 9, 2017 •

edited

Loading

gfyoung Sep 9, 2017 •

edited

Loading

gfyoung Sep 9, 2017 •

edited

Loading

codecov bot commented Sep 9, 2017 •

edited

Loading