Skip to content

Commit

Permalink
Support merging DataFrames on a combo of columns and index levels (GH…
Browse files Browse the repository at this point in the history
… 14355) (#17484)
  • Loading branch information
jonmmease authored and jreback committed Dec 1, 2017
1 parent d74ac70 commit d5ffb1f
Show file tree
Hide file tree
Showing 10 changed files with 1,138 additions and 38 deletions.
68 changes: 62 additions & 6 deletions doc/source/merging.rst
Original file line number Diff line number Diff line change
Expand Up @@ -518,14 +518,16 @@ standard database join operations between DataFrame objects:

- ``left``: A DataFrame object
- ``right``: Another DataFrame object
- ``on``: Columns (names) to join on. Must be found in both the left and
right DataFrame objects. If not passed and ``left_index`` and
- ``on``: Column or index level names to join on. Must be found in both the left
and right DataFrame objects. If not passed and ``left_index`` and
``right_index`` are ``False``, the intersection of the columns in the
DataFrames will be inferred to be the join keys
- ``left_on``: Columns from the left DataFrame to use as keys. Can either be
column names or arrays with length equal to the length of the DataFrame
- ``right_on``: Columns from the right DataFrame to use as keys. Can either be
column names or arrays with length equal to the length of the DataFrame
- ``left_on``: Columns or index levels from the left DataFrame to use as
keys. Can either be column names, index level names, or arrays with length
equal to the length of the DataFrame
- ``right_on``: Columns or index levels from the right DataFrame to use as
keys. Can either be column names, index level names, or arrays with length
equal to the length of the DataFrame
- ``left_index``: If ``True``, use the index (row labels) from the left
DataFrame as its join key(s). In the case of a DataFrame with a MultiIndex
(hierarchical), the number of levels must match the number of join keys
Expand Down Expand Up @@ -563,6 +565,10 @@ standard database join operations between DataFrame objects:

.. versionadded:: 0.21.0

.. note::

Support for specifying index levels as the ``on``, ``left_on``, and
``right_on`` parameters was added in version 0.22.0.

The return type will be the same as ``left``. If ``left`` is a ``DataFrame``
and ``right`` is a subclass of DataFrame, the return type will still be
Expand Down Expand Up @@ -1121,6 +1127,56 @@ This is not Implemented via ``join`` at-the-moment, however it can be done using
labels=['left', 'right'], vertical=False);
plt.close('all');
.. _merging.merge_on_columns_and_levels:

Merging on a combination of columns and index levels
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. versionadded:: 0.22

Strings passed as the ``on``, ``left_on``, and ``right_on`` parameters
may refer to either column names or index level names. This enables merging
``DataFrame`` instances on a combination of index levels and columns without
resetting indexes.

.. ipython:: python
left_index = pd.Index(['K0', 'K0', 'K1', 'K2'], name='key1')
left = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3'],
'key2': ['K0', 'K1', 'K0', 'K1']},
index=left_index)
right_index = pd.Index(['K0', 'K1', 'K2', 'K2'], name='key1')
right = pd.DataFrame({'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3'],
'key2': ['K0', 'K0', 'K0', 'K1']},
index=right_index)
result = left.merge(right, on=['key1', 'key2'])
.. ipython:: python
:suppress:
@savefig merge_on_index_and_column.png
p.plot([left, right], result,
labels=['left', 'right'], vertical=False);
plt.close('all');
.. note::

When DataFrames are merged on a string that matches an index level in both
frames, the index level is preserved as an index level in the resulting
DataFrame.

.. note::

If a string matches both a column name and an index level name, then a
warning is issued and the column takes precedence. This will result in an
ambiguity error in a future version.

Overlapping value columns
~~~~~~~~~~~~~~~~~~~~~~~~~

Expand Down
31 changes: 31 additions & 0 deletions doc/source/whatsnew/v0.22.0.txt
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,37 @@ The :func:`get_dummies` now accepts a ``dtype`` argument, which specifies a dtyp
pd.get_dummies(df, columns=['c'], dtype=bool).dtypes


.. _whatsnew_0220.enhancements.merge_on_columns_and_levels:

Merging on a combination of columns and index levels
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Strings passed to :meth:`DataFrame.merge` as the ``on``, ``left_on``, and ``right_on``
parameters may now refer to either column names or index level names.
This enables merging ``DataFrame`` instances on a combination of index levels
and columns without resetting indexes. See the :ref:`Merge on columns and
levels <merging.merge_on_columns_and_levels>` documentation section.
(:issue:`14355`)

.. ipython:: python

left_index = pd.Index(['K0', 'K0', 'K1', 'K2'], name='key1')

left = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3'],
'key2': ['K0', 'K1', 'K0', 'K1']},
index=left_index)

right_index = pd.Index(['K0', 'K1', 'K2', 'K2'], name='key1')

right = pd.DataFrame({'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3'],
'key2': ['K0', 'K0', 'K0', 'K1']},
index=right_index)

left.merge(right, on=['key1', 'key2'])


.. _whatsnew_0220.enhancements.other:

Other Enhancements
Expand Down
37 changes: 23 additions & 14 deletions pandas/core/frame.py
Original file line number Diff line number Diff line change
Expand Up @@ -148,16 +148,17 @@
* inner: use intersection of keys from both frames, similar to a SQL inner
join; preserve the order of the left keys
on : label or list
Field names to join on. Must be found in both DataFrames. If on is
None and not merging on indexes, then it merges on the intersection of
the columns by default.
Column or index level names to join on. These must be found in both
DataFrames. If `on` is None and not merging on indexes then this defaults
to the intersection of the columns in both DataFrames.
left_on : label or list, or array-like
Field names to join on in left DataFrame. Can be a vector or list of
vectors of the length of the DataFrame to use a particular vector as
the join key instead of columns
Column or index level names to join on in the left DataFrame. Can also
be an array or list of arrays of the length of the left DataFrame.
These arrays are treated as if they are columns.
right_on : label or list, or array-like
Field names to join on in right DataFrame or vector/list of vectors per
left_on docs
Column or index level names to join on in the right DataFrame. Can also
be an array or list of arrays of the length of the right DataFrame.
These arrays are treated as if they are columns.
left_index : boolean, default False
Use the index from the left DataFrame as the join key(s). If it is a
MultiIndex, the number of keys in the other DataFrame (either the index
Expand Down Expand Up @@ -196,6 +197,11 @@
.. versionadded:: 0.21.0
Notes
-----
Support for specifying index levels as the `on`, `left_on`, and
`right_on` parameters was added in version 0.22.0
Examples
--------
Expand Down Expand Up @@ -5214,12 +5220,12 @@ def join(self, other, on=None, how='left', lsuffix='', rsuffix='',
Index should be similar to one of the columns in this one. If a
Series is passed, its name attribute must be set, and that will be
used as the column name in the resulting joined DataFrame
on : column name, tuple/list of column names, or array-like
Column(s) in the caller to join on the index in other,
otherwise joins index-on-index. If multiples
columns given, the passed DataFrame must have a MultiIndex. Can
pass an array as the join key if not already contained in the
calling DataFrame. Like an Excel VLOOKUP operation
on : name, tuple/list of names, or array-like
Column or index level name(s) in the caller to join on the index
in `other`, otherwise joins index-on-index. If multiple
values given, the `other` DataFrame must have a MultiIndex. Can
pass an array as the join key if it is not already contained in
the calling DataFrame. Like an Excel VLOOKUP operation
how : {'left', 'right', 'outer', 'inner'}, default: 'left'
How to handle the operation of the two objects.
Expand All @@ -5244,6 +5250,9 @@ def join(self, other, on=None, how='left', lsuffix='', rsuffix='',
on, lsuffix, and rsuffix options are not supported when passing a list
of DataFrame objects
Support for specifying index levels as the `on` parameter was added
in version 0.22.0
Examples
--------
>>> caller = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3', 'K4', 'K5'],
Expand Down
Loading

0 comments on commit d5ffb1f

Please sign in to comment.