Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support merging DataFrames on a combo of columns and index levels (GH 14355) #17484

Merged
merged 43 commits into from
Dec 1, 2017
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
43 commits
Select commit Hold shift + click to select a range
da94fdb
Support merging frames on a combo of columns and index levels (GH 14355)
Jul 19, 2017
f8c8c53
Cleanup for review
Sep 10, 2017
368844a
revert implementation (but keep documentation and tests)
Sep 11, 2017
1c4699e
Simplify and refactor column/level logic in merge
Sep 11, 2017
ac1189b
PEP8 cleanup
Sep 11, 2017
d90ed78
Extract column/level ambiguity warning logic into utility method
Sep 11, 2017
27b2d25
Add newline and add :ref: entry for new doc section
Sep 11, 2017
de6f4b1
docstring / comment cleanup
Sep 11, 2017
39d0bba
Merge branch 'master' into enh_14355
Oct 2, 2017
5b1b100
Documentation updates
Oct 9, 2017
dfc6cf7
Fix errors in _drop_columns_or_levels
Oct 9, 2017
03e3c2e
Refactor and parametrize test cases
Oct 9, 2017
bf5d349
Moved label/level helpers up to NDFrame, added axis support, and adde…
Oct 10, 2017
7da39aa
PEP8
Oct 11, 2017
f5a16ff
Revert accidental change to merging.rst
Oct 12, 2017
aa099ea
Use fixtures for new TestMergeColumnAndIndex tests
Oct 13, 2017
3be43a4
Merge branch 'master' into enh_14355
Oct 13, 2017
b655e30
Merge branch 'master' into enh_14355
Oct 20, 2017
b7e2cc2
Merge branch 'master' into enh_14355
Nov 1, 2017
e9f02b1
Update documentation for a 0.22 release
Nov 1, 2017
0cd4ef5
Merge remote-tracking branch 'upstream/master' into enh_14355
Nov 2, 2017
e029f7b
Documentation updates
Nov 6, 2017
fdddbd3
Moved test_label_or_level_utils to pandas/tests/generic
Nov 6, 2017
89061b9
Refactored level_or_level test cases to use fixtures
Nov 6, 2017
090b3e8
Moved label_or_level utils on Series and DataFrame to NDFrame
Nov 6, 2017
47ff8b8
fix test comment typo
Nov 6, 2017
59f2dce
PEP8ify
Nov 6, 2017
4c4dbd0
Merge remote-tracking branch 'upstream/master' into enh_14355
Nov 6, 2017
1d7e570
Moved column and index tests to new file
Nov 6, 2017
dd289a6
Remove test class and convert to using fixtures
Nov 6, 2017
313d2c3
Rename new test file
Nov 6, 2017
0b0397b
Documentation and testing review updates
Nov 7, 2017
bc53bef
Merge remote-tracking branch 'upstream/master' into enh_14355
Nov 7, 2017
cd17c42
Merge remote-tracking branch 'upstream/master' into enh_14355
Nov 23, 2017
1a4e3e4
Merge remote-tracking branch 'upstream/master' into enh_14355
Nov 23, 2017
a49012c
Fix generator/list lint issues
Nov 23, 2017
6fd9760
Allow non-None hashable objects to reference index levels (not just s…
Nov 26, 2017
f7e04f5
Reduce parameterized test case count by removing how fixture
Nov 26, 2017
cf8e654
Refactor warning code and add stacklevel
Nov 26, 2017
e874f04
Use single backticks to reference method params in docstrings
Nov 26, 2017
13ce87c
Add tests and docstring updates for using index levels as `on` param …
Nov 26, 2017
b5cb4c1
PEP8
Nov 26, 2017
f3b95fe
Fixed Note->Notes in docstring
Dec 1, 2017
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
61 changes: 55 additions & 6 deletions doc/source/merging.rst
Original file line number Diff line number Diff line change
Expand Up @@ -518,14 +518,16 @@ standard database join operations between DataFrame objects:

- ``left``: A DataFrame object
- ``right``: Another DataFrame object
- ``on``: Columns (names) to join on. Must be found in both the left and
right DataFrame objects. If not passed and ``left_index`` and
- ``on``: Column or index level names to join on. Must be found in both the left
and right DataFrame objects. If not passed and ``left_index`` and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

an you add a comment (here and left_on/right_on) that index level merging is new in 0.22.0 (or maybe in a Note section below)

``right_index`` are ``False``, the intersection of the columns in the
DataFrames will be inferred to be the join keys
- ``left_on``: Columns from the left DataFrame to use as keys. Can either be
column names or arrays with length equal to the length of the DataFrame
- ``right_on``: Columns from the right DataFrame to use as keys. Can either be
column names or arrays with length equal to the length of the DataFrame
- ``left_on``: Columns or index levels from the left DataFrame to use as
keys. Can either be column names, index level names, or arrays with length
equal to the length of the DataFrame
- ``right_on``: Columns or index levels from the right DataFrame to use as
keys. Can either be column names, index level names, or arrays with length
equal to the length of the DataFrame
- ``left_index``: If ``True``, use the index (row labels) from the left
DataFrame as its join key(s). In the case of a DataFrame with a MultiIndex
(hierarchical), the number of levels must match the number of join keys
Expand Down Expand Up @@ -1120,6 +1122,53 @@ This is not Implemented via ``join`` at-the-moment, however it can be done using
labels=['left', 'right'], vertical=False);
plt.close('all');

Merging on a combination of columns and index levels
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. versionadded:: 0.21
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need a blank line here

add a :ref: entry before the sub-section

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


Strings passed as the ``on``, ``left_on``, and ``right_on`` parameters
may refer to either column names or index level names. This enables
the merging of DataFrames on a combination of index levels and columns without
resetting indexes.

.. ipython:: python

left_index = pd.Index(['K0', 'K0', 'K1', 'K2'], name='key1')

left = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3'],
'key2': ['K0', 'K1', 'K0', 'K1']},
index=left_index)

right_index = pd.Index(['K0', 'K1', 'K2', 'K2'], name='key1')

right = pd.DataFrame({'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3'],
'key2': ['K0', 'K0', 'K0', 'K1']},
index=right_index)

result = left.merge(right, on=['key1', 'key2'])

.. ipython:: python
:suppress:

@savefig merge_on_index_and_column.png
p.plot([left, right], result,
labels=['left', 'right'], vertical=False);
plt.close('all');

.. note::

When DataFrames are merged on a string that matches an index level in both
frames then the index level is preserved as an index level in the resulting
Copy link
Member

@gfyoung gfyoung Sep 9, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"frames" --> "frames,"

DataFrame.

.. note::

If a string matches both a column name and an index level name then a warning is
Copy link
Member

@gfyoung gfyoung Sep 9, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"index level name" --> "index level name,"

issued and the column takes precedence. This will result in an ambiguity error
in a future version.

Overlapping value columns
~~~~~~~~~~~~~~~~~~~~~~~~~

Expand Down
1 change: 1 addition & 0 deletions doc/source/whatsnew/v0.21.0.txt
Original file line number Diff line number Diff line change
Expand Up @@ -109,6 +109,7 @@ Other Enhancements
- :func:`date_range` now accepts 'Y' in addition to 'A' as an alias for end of year (:issue:`9313`)
- Integration with `Apache Parquet <https://parquet.apache.org/>`__, including a new top-level :func:`read_parquet` and :func:`DataFrame.to_parquet` method, see :ref:`here <io.parquet>`. (:issue:`15838`, :issue:`17438`)
- :func:`DataFrame.add_prefix` and :func:`DataFrame.add_suffix` now accept strings containing the '%' character. (:issue:`17151`)
- :func:`DataFrame.merge` now accepts index level names as `on`, `left_on`, and `right_on` parameters allowing frames to be merged on a combination of columns and index levels (:issue:`14355`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"parameters allowing" --> "parameters, allowing"

- `read_*` methods can now infer compression from non-string paths, such as ``pathlib.Path`` objects (:issue:`17206`).
- :func:`pd.read_sas()` now recognizes much more of the most frequently used date (datetime) formats in SAS7BDAT files (:issue:`15871`).
- :func:`DataFrame.items` and :func:`Series.items` is now present in both Python 2 and 3 and is lazy in all cases (:issue:`13918`, :issue:`17213`)
Expand Down
30 changes: 30 additions & 0 deletions pandas/core/frame.py
Original file line number Diff line number Diff line change
Expand Up @@ -3437,6 +3437,36 @@ def f(vals):

# ----------------------------------------------------------------------
# Sorting
def _get_column_or_level_values(self, key, axis=1,
op_description='retrieve'):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add doc-string for this. Developers in the future will thank you. 😉

if (is_integer(key) or
(axis == 1 and key in self) or
(axis == 0 and key in self.index)):

if axis == 1 and key in self.index.names:
warnings.warn(
("'%s' is both a column name and an index level.\n"
"Defaulting to column but "
"this will raise an ambiguity error in a "
"future version") % key,
Copy link
Member

@gfyoung gfyoung Sep 9, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The percentile string replacement is being phased out in favor of .format. Please replace along with ALL other places where you use it.

FutureWarning, stacklevel=2)

k = self.xs(key, axis=axis)._values
if k.ndim == 2:

# try to be helpful
if isinstance(self.columns, MultiIndex):
raise ValueError('Cannot %s column "%s" in a multi-index. '
'All levels must be provided explicitly'
% (op_description, str(key)))

raise ValueError('Cannot %s duplicate column "%s"' %
(op_description, str(key)))
elif key in self.index.names:
k = self.index.get_level_values(key).values
else:
raise KeyError(key)
return k

@Appender(_shared_docs['sort_values'] % _shared_doc_kwargs)
def sort_values(self, by, axis=0, ascending=True, inplace=False,
Expand Down
80 changes: 73 additions & 7 deletions pandas/core/reshape/merge.py
Original file line number Diff line number Diff line change
Expand Up @@ -524,6 +524,7 @@ def __init__(self, left, right, how='inner', on=None,
self.right_index = right_index

self.indicator = indicator
self.has_common_index_levels = False

if isinstance(self.indicator, compat.string_types):
self.indicator_name = self.indicator
Expand Down Expand Up @@ -650,6 +651,7 @@ def _maybe_add_join_keys(self, result, left_indexer, right_indexer):
left_has_missing = None
right_has_missing = None

new_index_values = {}
keys = zip(self.join_names, self.left_on, self.right_on)
for i, (name, lname, rname) in enumerate(keys):
if not _should_fill(lname, rname):
Expand Down Expand Up @@ -717,7 +719,25 @@ def _maybe_add_join_keys(self, result, left_indexer, right_indexer):
if name in result:
result[name] = key_col
else:
result.insert(i, name or 'key_{i}'.format(i=i), key_col)
if name and name in result.index.names:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are adding an amazing amount of logic here. This must be made simpler.

new_index_values[name] = key_col
else:
result.insert(
i, name or 'key_{i}'.format(i=i), key_col)

if new_index_values:
# Create new index for result
index_arrays = [new_index_values[n]
if n in new_index_values
else result.index.get_level_values(i)
for (i, n) in enumerate(result.index.names)]

if len(index_arrays) == 1:
new_index = Index(index_arrays[0], name=result.index.name)
else:
new_index = MultiIndex.from_arrays(index_arrays,
names=result.index.names)
result.index = new_index

def _get_join_indexers(self):
""" return the join indexers """
Expand Down Expand Up @@ -760,7 +780,10 @@ def _get_join_info(self):
join_index = self.left.index.take(left_indexer)
right_indexer = np.array([-1] * len(join_index))
else:
join_index = Index(np.arange(len(left_indexer)))
if not self.has_common_index_levels:
join_index = Index(np.arange(len(left_indexer)))
else:
join_index = self.left.index.take(left_indexer)

if len(join_index) == 0:
join_index = join_index.astype(object)
Expand Down Expand Up @@ -792,6 +815,10 @@ def _get_merge_keys(self):
is_rkey = lambda x: isinstance(
x, (np.ndarray, Series)) and len(x) == len(right)

def get_key_vals(df, key):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add doc-string for this.

return df._get_column_or_level_values(key, axis=self.axis,
op_description="merge on")

# Note that pd.merge_asof() has separate 'on' and 'by' parameters. A
# user could, for example, request 'left_index' and 'left_by'. In a
# regular pd.merge(), users cannot specify both 'left_index' and
Expand All @@ -812,7 +839,7 @@ def _get_merge_keys(self):
join_names.append(None) # what to do?
else:
if rk is not None:
right_keys.append(right[rk]._values)
right_keys.append(get_key_vals(right, rk))
join_names.append(rk)
else:
# work-around for merge_asof(right_index=True)
Expand All @@ -821,7 +848,7 @@ def _get_merge_keys(self):
else:
if not is_rkey(rk):
if rk is not None:
right_keys.append(right[rk]._values)
right_keys.append(get_key_vals(right, rk))
else:
# work-around for merge_asof(right_index=True)
right_keys.append(right.index)
Expand All @@ -834,7 +861,7 @@ def _get_merge_keys(self):
else:
right_keys.append(rk)
if lk is not None:
left_keys.append(left[lk]._values)
left_keys.append(get_key_vals(left, lk))
join_names.append(lk)
else:
# work-around for merge_asof(left_index=True)
Expand All @@ -846,7 +873,7 @@ def _get_merge_keys(self):
left_keys.append(k)
join_names.append(None)
else:
left_keys.append(left[k]._values)
left_keys.append(get_key_vals(left, k))
join_names.append(k)
if isinstance(self.right.index, MultiIndex):
right_keys = [lev._values.take(lab)
Expand All @@ -860,7 +887,7 @@ def _get_merge_keys(self):
right_keys.append(k)
join_names.append(None)
else:
right_keys.append(right[k]._values)
right_keys.append(get_key_vals(right, k))
join_names.append(k)
if isinstance(self.left.index, MultiIndex):
left_keys = [lev._values.take(lab)
Expand All @@ -869,10 +896,49 @@ def _get_merge_keys(self):
else:
left_keys = [self.left.index.values]

# Reset index levels that are not common to both DataFrames
common_index_levels = [(li, ri) for (li, ri) in
zip(self.left_on, self.right_on) if
isinstance(li, compat.string_types) and
li not in self.left and
isinstance(ri, compat.string_types) and
ri not in self.right]

if common_index_levels:
common_levels_right, common_levels_left = (
zip(*common_index_levels)
)

reset_left = [lev for lev in self.left.index.names
if lev not in common_levels_left]
if reset_left:
self.left.reset_index(
reset_left,
inplace=True)

reset_right = [lev for lev in self.right.index.names
if lev not in common_levels_right]
if reset_right:
self.right.reset_index(
reset_right,
inplace=True)

self.has_common_index_levels = True

if left_drop:
# Determine index levels to reset before dropping
levels_to_reset = [level for level in left_drop
if level not in self.left]
if levels_to_reset:
self.left = self.left.reset_index(levels_to_reset)
self.left = self.left.drop(left_drop, axis=1)

if right_drop:
# Determine index levels to reset before dropping
levels_to_reset = [level for level in right_drop
if level not in self.right]
if levels_to_reset:
self.right = self.right.reset_index(levels_to_reset)
self.right = self.right.drop(right_drop, axis=1)

return left_keys, right_keys, join_names
Expand Down
Loading