-
-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support merging DataFrames on a combo of columns and index levels (GH 14355) #17484
Changes from 1 commit
da94fdb
f8c8c53
368844a
1c4699e
ac1189b
d90ed78
27b2d25
de6f4b1
39d0bba
5b1b100
dfc6cf7
03e3c2e
bf5d349
7da39aa
f5a16ff
aa099ea
3be43a4
b655e30
b7e2cc2
e9f02b1
0cd4ef5
e029f7b
fdddbd3
89061b9
090b3e8
47ff8b8
59f2dce
4c4dbd0
1d7e570
dd289a6
313d2c3
0b0397b
bc53bef
cd17c42
1a4e3e4
a49012c
6fd9760
f7e04f5
cf8e654
e874f04
13ce87c
b5cb4c1
f3b95fe
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -518,14 +518,16 @@ standard database join operations between DataFrame objects: | |
|
||
- ``left``: A DataFrame object | ||
- ``right``: Another DataFrame object | ||
- ``on``: Columns (names) to join on. Must be found in both the left and | ||
right DataFrame objects. If not passed and ``left_index`` and | ||
- ``on``: Column or index level names to join on. Must be found in both the left | ||
and right DataFrame objects. If not passed and ``left_index`` and | ||
``right_index`` are ``False``, the intersection of the columns in the | ||
DataFrames will be inferred to be the join keys | ||
- ``left_on``: Columns from the left DataFrame to use as keys. Can either be | ||
column names or arrays with length equal to the length of the DataFrame | ||
- ``right_on``: Columns from the right DataFrame to use as keys. Can either be | ||
column names or arrays with length equal to the length of the DataFrame | ||
- ``left_on``: Columns or index levels from the left DataFrame to use as | ||
keys. Can either be column names, index level names, or arrays with length | ||
equal to the length of the DataFrame | ||
- ``right_on``: Columns or index levels from the right DataFrame to use as | ||
keys. Can either be column names, index level names, or arrays with length | ||
equal to the length of the DataFrame | ||
- ``left_index``: If ``True``, use the index (row labels) from the left | ||
DataFrame as its join key(s). In the case of a DataFrame with a MultiIndex | ||
(hierarchical), the number of levels must match the number of join keys | ||
|
@@ -1120,6 +1122,53 @@ This is not Implemented via ``join`` at-the-moment, however it can be done using | |
labels=['left', 'right'], vertical=False); | ||
plt.close('all'); | ||
|
||
Merging on a combination of columns and index levels | ||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
.. versionadded:: 0.21 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. need a blank line here add a There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. done |
||
|
||
Strings passed as the ``on``, ``left_on``, and ``right_on`` parameters | ||
may refer to either column names or index level names. This enables | ||
the merging of DataFrames on a combination of index levels and columns without | ||
resetting indexes. | ||
|
||
.. ipython:: python | ||
|
||
left_index = pd.Index(['K0', 'K0', 'K1', 'K2'], name='key1') | ||
|
||
left = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'], | ||
'B': ['B0', 'B1', 'B2', 'B3'], | ||
'key2': ['K0', 'K1', 'K0', 'K1']}, | ||
index=left_index) | ||
|
||
right_index = pd.Index(['K0', 'K1', 'K2', 'K2'], name='key1') | ||
|
||
right = pd.DataFrame({'C': ['C0', 'C1', 'C2', 'C3'], | ||
'D': ['D0', 'D1', 'D2', 'D3'], | ||
'key2': ['K0', 'K0', 'K0', 'K1']}, | ||
index=right_index) | ||
|
||
result = left.merge(right, on=['key1', 'key2']) | ||
|
||
.. ipython:: python | ||
:suppress: | ||
|
||
@savefig merge_on_index_and_column.png | ||
p.plot([left, right], result, | ||
labels=['left', 'right'], vertical=False); | ||
plt.close('all'); | ||
|
||
.. note:: | ||
|
||
When DataFrames are merged on a string that matches an index level in both | ||
frames then the index level is preserved as an index level in the resulting | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. "frames" --> "frames," |
||
DataFrame. | ||
|
||
.. note:: | ||
|
||
If a string matches both a column name and an index level name then a warning is | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. "index level name" --> "index level name," |
||
issued and the column takes precedence. This will result in an ambiguity error | ||
in a future version. | ||
|
||
Overlapping value columns | ||
~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -109,6 +109,7 @@ Other Enhancements | |
- :func:`date_range` now accepts 'Y' in addition to 'A' as an alias for end of year (:issue:`9313`) | ||
- Integration with `Apache Parquet <https://parquet.apache.org/>`__, including a new top-level :func:`read_parquet` and :func:`DataFrame.to_parquet` method, see :ref:`here <io.parquet>`. (:issue:`15838`, :issue:`17438`) | ||
- :func:`DataFrame.add_prefix` and :func:`DataFrame.add_suffix` now accept strings containing the '%' character. (:issue:`17151`) | ||
- :func:`DataFrame.merge` now accepts index level names as `on`, `left_on`, and `right_on` parameters allowing frames to be merged on a combination of columns and index levels (:issue:`14355`) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. "parameters allowing" --> "parameters, allowing" |
||
- `read_*` methods can now infer compression from non-string paths, such as ``pathlib.Path`` objects (:issue:`17206`). | ||
- :func:`pd.read_sas()` now recognizes much more of the most frequently used date (datetime) formats in SAS7BDAT files (:issue:`15871`). | ||
- :func:`DataFrame.items` and :func:`Series.items` is now present in both Python 2 and 3 and is lazy in all cases (:issue:`13918`, :issue:`17213`) | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -3437,6 +3437,36 @@ def f(vals): | |
|
||
# ---------------------------------------------------------------------- | ||
# Sorting | ||
def _get_column_or_level_values(self, key, axis=1, | ||
op_description='retrieve'): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Add doc-string for this. Developers in the future will thank you. 😉 |
||
if (is_integer(key) or | ||
(axis == 1 and key in self) or | ||
(axis == 0 and key in self.index)): | ||
|
||
if axis == 1 and key in self.index.names: | ||
warnings.warn( | ||
("'%s' is both a column name and an index level.\n" | ||
"Defaulting to column but " | ||
"this will raise an ambiguity error in a " | ||
"future version") % key, | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The percentile string replacement is being phased out in favor of |
||
FutureWarning, stacklevel=2) | ||
|
||
k = self.xs(key, axis=axis)._values | ||
if k.ndim == 2: | ||
|
||
# try to be helpful | ||
if isinstance(self.columns, MultiIndex): | ||
raise ValueError('Cannot %s column "%s" in a multi-index. ' | ||
'All levels must be provided explicitly' | ||
% (op_description, str(key))) | ||
|
||
raise ValueError('Cannot %s duplicate column "%s"' % | ||
(op_description, str(key))) | ||
elif key in self.index.names: | ||
k = self.index.get_level_values(key).values | ||
else: | ||
raise KeyError(key) | ||
return k | ||
|
||
@Appender(_shared_docs['sort_values'] % _shared_doc_kwargs) | ||
def sort_values(self, by, axis=0, ascending=True, inplace=False, | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -524,6 +524,7 @@ def __init__(self, left, right, how='inner', on=None, | |
self.right_index = right_index | ||
|
||
self.indicator = indicator | ||
self.has_common_index_levels = False | ||
|
||
if isinstance(self.indicator, compat.string_types): | ||
self.indicator_name = self.indicator | ||
|
@@ -650,6 +651,7 @@ def _maybe_add_join_keys(self, result, left_indexer, right_indexer): | |
left_has_missing = None | ||
right_has_missing = None | ||
|
||
new_index_values = {} | ||
keys = zip(self.join_names, self.left_on, self.right_on) | ||
for i, (name, lname, rname) in enumerate(keys): | ||
if not _should_fill(lname, rname): | ||
|
@@ -717,7 +719,25 @@ def _maybe_add_join_keys(self, result, left_indexer, right_indexer): | |
if name in result: | ||
result[name] = key_col | ||
else: | ||
result.insert(i, name or 'key_{i}'.format(i=i), key_col) | ||
if name and name in result.index.names: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. you are adding an amazing amount of logic here. This must be made simpler. |
||
new_index_values[name] = key_col | ||
else: | ||
result.insert( | ||
i, name or 'key_{i}'.format(i=i), key_col) | ||
|
||
if new_index_values: | ||
# Create new index for result | ||
index_arrays = [new_index_values[n] | ||
if n in new_index_values | ||
else result.index.get_level_values(i) | ||
for (i, n) in enumerate(result.index.names)] | ||
|
||
if len(index_arrays) == 1: | ||
new_index = Index(index_arrays[0], name=result.index.name) | ||
else: | ||
new_index = MultiIndex.from_arrays(index_arrays, | ||
names=result.index.names) | ||
result.index = new_index | ||
|
||
def _get_join_indexers(self): | ||
""" return the join indexers """ | ||
|
@@ -760,7 +780,10 @@ def _get_join_info(self): | |
join_index = self.left.index.take(left_indexer) | ||
right_indexer = np.array([-1] * len(join_index)) | ||
else: | ||
join_index = Index(np.arange(len(left_indexer))) | ||
if not self.has_common_index_levels: | ||
join_index = Index(np.arange(len(left_indexer))) | ||
else: | ||
join_index = self.left.index.take(left_indexer) | ||
|
||
if len(join_index) == 0: | ||
join_index = join_index.astype(object) | ||
|
@@ -792,6 +815,10 @@ def _get_merge_keys(self): | |
is_rkey = lambda x: isinstance( | ||
x, (np.ndarray, Series)) and len(x) == len(right) | ||
|
||
def get_key_vals(df, key): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Add doc-string for this. |
||
return df._get_column_or_level_values(key, axis=self.axis, | ||
op_description="merge on") | ||
|
||
# Note that pd.merge_asof() has separate 'on' and 'by' parameters. A | ||
# user could, for example, request 'left_index' and 'left_by'. In a | ||
# regular pd.merge(), users cannot specify both 'left_index' and | ||
|
@@ -812,7 +839,7 @@ def _get_merge_keys(self): | |
join_names.append(None) # what to do? | ||
else: | ||
if rk is not None: | ||
right_keys.append(right[rk]._values) | ||
right_keys.append(get_key_vals(right, rk)) | ||
join_names.append(rk) | ||
else: | ||
# work-around for merge_asof(right_index=True) | ||
|
@@ -821,7 +848,7 @@ def _get_merge_keys(self): | |
else: | ||
if not is_rkey(rk): | ||
if rk is not None: | ||
right_keys.append(right[rk]._values) | ||
right_keys.append(get_key_vals(right, rk)) | ||
else: | ||
# work-around for merge_asof(right_index=True) | ||
right_keys.append(right.index) | ||
|
@@ -834,7 +861,7 @@ def _get_merge_keys(self): | |
else: | ||
right_keys.append(rk) | ||
if lk is not None: | ||
left_keys.append(left[lk]._values) | ||
left_keys.append(get_key_vals(left, lk)) | ||
join_names.append(lk) | ||
else: | ||
# work-around for merge_asof(left_index=True) | ||
|
@@ -846,7 +873,7 @@ def _get_merge_keys(self): | |
left_keys.append(k) | ||
join_names.append(None) | ||
else: | ||
left_keys.append(left[k]._values) | ||
left_keys.append(get_key_vals(left, k)) | ||
join_names.append(k) | ||
if isinstance(self.right.index, MultiIndex): | ||
right_keys = [lev._values.take(lab) | ||
|
@@ -860,7 +887,7 @@ def _get_merge_keys(self): | |
right_keys.append(k) | ||
join_names.append(None) | ||
else: | ||
right_keys.append(right[k]._values) | ||
right_keys.append(get_key_vals(right, k)) | ||
join_names.append(k) | ||
if isinstance(self.left.index, MultiIndex): | ||
left_keys = [lev._values.take(lab) | ||
|
@@ -869,10 +896,49 @@ def _get_merge_keys(self): | |
else: | ||
left_keys = [self.left.index.values] | ||
|
||
# Reset index levels that are not common to both DataFrames | ||
common_index_levels = [(li, ri) for (li, ri) in | ||
zip(self.left_on, self.right_on) if | ||
isinstance(li, compat.string_types) and | ||
li not in self.left and | ||
isinstance(ri, compat.string_types) and | ||
ri not in self.right] | ||
|
||
if common_index_levels: | ||
common_levels_right, common_levels_left = ( | ||
zip(*common_index_levels) | ||
) | ||
|
||
reset_left = [lev for lev in self.left.index.names | ||
if lev not in common_levels_left] | ||
if reset_left: | ||
self.left.reset_index( | ||
reset_left, | ||
inplace=True) | ||
|
||
reset_right = [lev for lev in self.right.index.names | ||
if lev not in common_levels_right] | ||
if reset_right: | ||
self.right.reset_index( | ||
reset_right, | ||
inplace=True) | ||
|
||
self.has_common_index_levels = True | ||
|
||
if left_drop: | ||
# Determine index levels to reset before dropping | ||
levels_to_reset = [level for level in left_drop | ||
if level not in self.left] | ||
if levels_to_reset: | ||
self.left = self.left.reset_index(levels_to_reset) | ||
self.left = self.left.drop(left_drop, axis=1) | ||
|
||
if right_drop: | ||
# Determine index levels to reset before dropping | ||
levels_to_reset = [level for level in right_drop | ||
if level not in self.right] | ||
if levels_to_reset: | ||
self.right = self.right.reset_index(levels_to_reset) | ||
self.right = self.right.drop(right_drop, axis=1) | ||
|
||
return left_keys, right_keys, join_names | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
an you add a comment (here and left_on/right_on) that index level merging is new in 0.22.0 (or maybe in a Note section below)