Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support merging DataFrames on a combo of columns and index levels (GH 14355) #17484

Merged
merged 43 commits into from
Dec 1, 2017
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
43 commits
Select commit Hold shift + click to select a range
da94fdb
Support merging frames on a combo of columns and index levels (GH 14355)
Jul 19, 2017
f8c8c53
Cleanup for review
Sep 10, 2017
368844a
revert implementation (but keep documentation and tests)
Sep 11, 2017
1c4699e
Simplify and refactor column/level logic in merge
Sep 11, 2017
ac1189b
PEP8 cleanup
Sep 11, 2017
d90ed78
Extract column/level ambiguity warning logic into utility method
Sep 11, 2017
27b2d25
Add newline and add :ref: entry for new doc section
Sep 11, 2017
de6f4b1
docstring / comment cleanup
Sep 11, 2017
39d0bba
Merge branch 'master' into enh_14355
Oct 2, 2017
5b1b100
Documentation updates
Oct 9, 2017
dfc6cf7
Fix errors in _drop_columns_or_levels
Oct 9, 2017
03e3c2e
Refactor and parametrize test cases
Oct 9, 2017
bf5d349
Moved label/level helpers up to NDFrame, added axis support, and adde…
Oct 10, 2017
7da39aa
PEP8
Oct 11, 2017
f5a16ff
Revert accidental change to merging.rst
Oct 12, 2017
aa099ea
Use fixtures for new TestMergeColumnAndIndex tests
Oct 13, 2017
3be43a4
Merge branch 'master' into enh_14355
Oct 13, 2017
b655e30
Merge branch 'master' into enh_14355
Oct 20, 2017
b7e2cc2
Merge branch 'master' into enh_14355
Nov 1, 2017
e9f02b1
Update documentation for a 0.22 release
Nov 1, 2017
0cd4ef5
Merge remote-tracking branch 'upstream/master' into enh_14355
Nov 2, 2017
e029f7b
Documentation updates
Nov 6, 2017
fdddbd3
Moved test_label_or_level_utils to pandas/tests/generic
Nov 6, 2017
89061b9
Refactored level_or_level test cases to use fixtures
Nov 6, 2017
090b3e8
Moved label_or_level utils on Series and DataFrame to NDFrame
Nov 6, 2017
47ff8b8
fix test comment typo
Nov 6, 2017
59f2dce
PEP8ify
Nov 6, 2017
4c4dbd0
Merge remote-tracking branch 'upstream/master' into enh_14355
Nov 6, 2017
1d7e570
Moved column and index tests to new file
Nov 6, 2017
dd289a6
Remove test class and convert to using fixtures
Nov 6, 2017
313d2c3
Rename new test file
Nov 6, 2017
0b0397b
Documentation and testing review updates
Nov 7, 2017
bc53bef
Merge remote-tracking branch 'upstream/master' into enh_14355
Nov 7, 2017
cd17c42
Merge remote-tracking branch 'upstream/master' into enh_14355
Nov 23, 2017
1a4e3e4
Merge remote-tracking branch 'upstream/master' into enh_14355
Nov 23, 2017
a49012c
Fix generator/list lint issues
Nov 23, 2017
6fd9760
Allow non-None hashable objects to reference index levels (not just s…
Nov 26, 2017
f7e04f5
Reduce parameterized test case count by removing how fixture
Nov 26, 2017
cf8e654
Refactor warning code and add stacklevel
Nov 26, 2017
e874f04
Use single backticks to reference method params in docstrings
Nov 26, 2017
13ce87c
Add tests and docstring updates for using index levels as `on` param …
Nov 26, 2017
b5cb4c1
PEP8
Nov 26, 2017
f3b95fe
Fixed Note->Notes in docstring
Dec 1, 2017
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
68 changes: 62 additions & 6 deletions doc/source/merging.rst
Original file line number Diff line number Diff line change
Expand Up @@ -518,14 +518,16 @@ standard database join operations between DataFrame objects:

- ``left``: A DataFrame object
- ``right``: Another DataFrame object
- ``on``: Columns (names) to join on. Must be found in both the left and
right DataFrame objects. If not passed and ``left_index`` and
- ``on``: Column or index level names to join on. Must be found in both the left
and right DataFrame objects. If not passed and ``left_index`` and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

an you add a comment (here and left_on/right_on) that index level merging is new in 0.22.0 (or maybe in a Note section below)

``right_index`` are ``False``, the intersection of the columns in the
DataFrames will be inferred to be the join keys
- ``left_on``: Columns from the left DataFrame to use as keys. Can either be
column names or arrays with length equal to the length of the DataFrame
- ``right_on``: Columns from the right DataFrame to use as keys. Can either be
column names or arrays with length equal to the length of the DataFrame
- ``left_on``: Columns or index levels from the left DataFrame to use as
keys. Can either be column names, index level names, or arrays with length
equal to the length of the DataFrame
- ``right_on``: Columns or index levels from the right DataFrame to use as
keys. Can either be column names, index level names, or arrays with length
equal to the length of the DataFrame
- ``left_index``: If ``True``, use the index (row labels) from the left
DataFrame as its join key(s). In the case of a DataFrame with a MultiIndex
(hierarchical), the number of levels must match the number of join keys
Expand Down Expand Up @@ -563,6 +565,10 @@ standard database join operations between DataFrame objects:

.. versionadded:: 0.21.0

.. note::

Support for specifying index levels as the ``on``, ``left_on``, and
``right_on`` parameters was added in version 0.22.0.

The return type will be the same as ``left``. If ``left`` is a ``DataFrame``
and ``right`` is a subclass of DataFrame, the return type will still be
Expand Down Expand Up @@ -1121,6 +1127,56 @@ This is not Implemented via ``join`` at-the-moment, however it can be done using
labels=['left', 'right'], vertical=False);
plt.close('all');

.. _merging.merge_on_columns_and_levels:

Merging on a combination of columns and index levels
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. versionadded:: 0.22

Strings passed as the ``on``, ``left_on``, and ``right_on`` parameters
may refer to either column names or index level names. This enables merging
``DataFrame`` instances on a combination of index levels and columns without
resetting indexes.

.. ipython:: python

left_index = pd.Index(['K0', 'K0', 'K1', 'K2'], name='key1')

left = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3'],
'key2': ['K0', 'K1', 'K0', 'K1']},
index=left_index)

right_index = pd.Index(['K0', 'K1', 'K2', 'K2'], name='key1')

right = pd.DataFrame({'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3'],
'key2': ['K0', 'K0', 'K0', 'K1']},
index=right_index)

result = left.merge(right, on=['key1', 'key2'])

.. ipython:: python
:suppress:

@savefig merge_on_index_and_column.png
p.plot([left, right], result,
labels=['left', 'right'], vertical=False);
plt.close('all');

.. note::

When DataFrames are merged on a string that matches an index level in both
frames, the index level is preserved as an index level in the resulting
DataFrame.

.. note::

If a string matches both a column name and an index level name, then a
warning is issued and the column takes precedence. This will result in an
ambiguity error in a future version.

Overlapping value columns
~~~~~~~~~~~~~~~~~~~~~~~~~

Expand Down
31 changes: 31 additions & 0 deletions doc/source/whatsnew/v0.22.0.txt
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,37 @@ The :func:`get_dummies` now accepts a ``dtype`` argument, which specifies a dtyp
pd.get_dummies(df, columns=['c'], dtype=bool).dtypes


.. _whatsnew_0220.enhancements.merge_on_columns_and_levels:

Merging on a combination of columns and index levels
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Strings passed to :meth:`DataFrame.merge` as the ``on``, ``left_on``, and ``right_on``
parameters may now refer to either column names or index level names.
This enables merging ``DataFrame`` instances on a combination of index levels
and columns without resetting indexes. See the :ref:`Merge on columns and
levels <merging.merge_on_columns_and_levels>` documentation section.
(:issue:`14355`)

.. ipython:: python

left_index = pd.Index(['K0', 'K0', 'K1', 'K2'], name='key1')

left = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3'],
'key2': ['K0', 'K1', 'K0', 'K1']},
index=left_index)

right_index = pd.Index(['K0', 'K1', 'K2', 'K2'], name='key1')

right = pd.DataFrame({'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3'],
'key2': ['K0', 'K0', 'K0', 'K1']},
index=right_index)

left.merge(right, on=['key1', 'key2'])


.. _whatsnew_0220.enhancements.other:

Other Enhancements
Expand Down
37 changes: 23 additions & 14 deletions pandas/core/frame.py
Original file line number Diff line number Diff line change
Expand Up @@ -147,16 +147,17 @@
* inner: use intersection of keys from both frames, similar to a SQL inner
join; preserve the order of the left keys
on : label or list
Field names to join on. Must be found in both DataFrames. If on is
None and not merging on indexes, then it merges on the intersection of
the columns by default.
Column or index level names to join on. These must be found in both
DataFrames. If `on` is None and not merging on indexes then this defaults
to the intersection of the columns in both DataFrames.
left_on : label or list, or array-like
Field names to join on in left DataFrame. Can be a vector or list of
vectors of the length of the DataFrame to use a particular vector as
the join key instead of columns
Column or index level names to join on in the left DataFrame. Can also
be an array or list of arrays of the length of the left DataFrame.
These arrays are treated as if they are columns.
right_on : label or list, or array-like
Field names to join on in right DataFrame or vector/list of vectors per
left_on docs
Column or index level names to join on in the right DataFrame. Can also
be an array or list of arrays of the length of the right DataFrame.
These arrays are treated as if they are columns.
left_index : boolean, default False
Use the index from the left DataFrame as the join key(s). If it is a
MultiIndex, the number of keys in the other DataFrame (either the index
Expand Down Expand Up @@ -195,6 +196,11 @@

.. versionadded:: 0.21.0

Notes
-----
Support for specifying index levels as the `on`, `left_on`, and
`right_on` parameters was added in version 0.22.0

Examples
--------

Expand Down Expand Up @@ -5196,12 +5202,12 @@ def join(self, other, on=None, how='left', lsuffix='', rsuffix='',
Index should be similar to one of the columns in this one. If a
Series is passed, its name attribute must be set, and that will be
used as the column name in the resulting joined DataFrame
on : column name, tuple/list of column names, or array-like
Column(s) in the caller to join on the index in other,
otherwise joins index-on-index. If multiples
columns given, the passed DataFrame must have a MultiIndex. Can
pass an array as the join key if not already contained in the
calling DataFrame. Like an Excel VLOOKUP operation
on : name, tuple/list of names, or array-like
Column or index level name(s) in the caller to join on the index
in `other`, otherwise joins index-on-index. If multiple
values given, the `other` DataFrame must have a MultiIndex. Can
pass an array as the join key if it is not already contained in
the calling DataFrame. Like an Excel VLOOKUP operation
how : {'left', 'right', 'outer', 'inner'}, default: 'left'
How to handle the operation of the two objects.

Expand All @@ -5226,6 +5232,9 @@ def join(self, other, on=None, how='left', lsuffix='', rsuffix='',
on, lsuffix, and rsuffix options are not supported when passing a list
of DataFrame objects

Support for specifying index levels as the `on` parameter was added
in version 0.22.0

Examples
--------
>>> caller = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3', 'K4', 'K5'],
Expand Down
Loading