-
-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support merging DataFrames on a combo of columns and index levels (GH 14355) #17484
Conversation
doc/source/merging.rst
Outdated
.. note:: | ||
|
||
When DataFrames are merged on a string that matches an index level in both | ||
frames then the index level is preserved as an index level in the resulting |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"frames" --> "frames,"
doc/source/merging.rst
Outdated
|
||
.. note:: | ||
|
||
If a string matches both a column name and an index level name then a warning is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"index level name" --> "index level name,"
doc/source/whatsnew/v0.21.0.txt
Outdated
@@ -109,6 +109,7 @@ Other Enhancements | |||
- :func:`date_range` now accepts 'Y' in addition to 'A' as an alias for end of year (:issue:`9313`) | |||
- Integration with `Apache Parquet <https://parquet.apache.org/>`__, including a new top-level :func:`read_parquet` and :func:`DataFrame.to_parquet` method, see :ref:`here <io.parquet>`. (:issue:`15838`, :issue:`17438`) | |||
- :func:`DataFrame.add_prefix` and :func:`DataFrame.add_suffix` now accept strings containing the '%' character. (:issue:`17151`) | |||
- :func:`DataFrame.merge` now accepts index level names as `on`, `left_on`, and `right_on` parameters allowing frames to be merged on a combination of columns and index levels (:issue:`14355`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"parameters allowing" --> "parameters, allowing"
pandas/core/frame.py
Outdated
@@ -3437,6 +3437,36 @@ def f(vals): | |||
|
|||
# ---------------------------------------------------------------------- | |||
# Sorting | |||
def _get_column_or_level_values(self, key, axis=1, | |||
op_description='retrieve'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add doc-string for this. Developers in the future will thank you. 😉
pandas/core/frame.py
Outdated
("'%s' is both a column name and an index level.\n" | ||
"Defaulting to column but " | ||
"this will raise an ambiguity error in a " | ||
"future version") % key, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The percentile string replacement is being phased out in favor of .format
. Please replace along with ALL other places where you use it.
pandas/core/reshape/merge.py
Outdated
@@ -792,6 +815,10 @@ def _get_merge_keys(self): | |||
is_rkey = lambda x: isinstance( | |||
x, (np.ndarray, Series)) and len(x) == len(right) | |||
|
|||
def get_key_vals(df, key): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add doc-string for this.
pandas/tests/reshape/test_merge.py
Outdated
@@ -1350,6 +1358,131 @@ def f(): | |||
household.join(log_return, how='outer') | |||
pytest.raises(NotImplementedError, f) | |||
|
|||
def test_merge_on_index_and_column(self): | |||
# Construct DataFrames |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reference issue number.
Codecov Report
@@ Coverage Diff @@
## master #17484 +/- ##
==========================================
+ Coverage 91.35% 91.45% +0.09%
==========================================
Files 163 157 -6
Lines 49691 51451 +1760
==========================================
+ Hits 45397 47052 +1655
- Misses 4294 4399 +105
Continue to review full report at Codecov.
|
Thanks for the quick feedback @gfyoung. I think I've incorporated all of it in this last commit. Let me know if I've missed anything or if there are additional changes you'd like to see. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lots of comments
doc/source/merging.rst
Outdated
@@ -1120,6 +1122,53 @@ This is not Implemented via ``join`` at-the-moment, however it can be done using | |||
labels=['left', 'right'], vertical=False); | |||
plt.close('all'); | |||
|
|||
Merging on a combination of columns and index levels | |||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |||
.. versionadded:: 0.21 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need a blank line here
add a :ref:
entry before the sub-section
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
pandas/core/frame.py
Outdated
|
||
Parameters | ||
---------- | ||
key: int or object |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should only accept string keys referring to a named column or level, We do not want to have the integer/string mess like .ix
again.
pandas/core/reshape/merge.py
Outdated
@@ -717,7 +719,25 @@ def _maybe_add_join_keys(self, result, left_indexer, right_indexer): | |||
if name in result: | |||
result[name] = key_col | |||
else: | |||
result.insert(i, name or 'key_{i}'.format(i=i), key_col) | |||
if name and name in result.index.names: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you are adding an amazing amount of logic here. This must be made simpler.
pandas/core/reshape/merge.py
Outdated
See Also | ||
-------- | ||
DataFrame._get_column_or_level_values | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you actually need complexity, then hide it away here
pandas/core/reshape/merge.py
Outdated
|
||
Returns | ||
------- | ||
values: array |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
array -> np.ndarray
pandas/core/reshape/merge.py
Outdated
Parameters | ||
---------- | ||
df: DataFrame | ||
key: int or object |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see above
pandas/core/frame.py
Outdated
axis: int, default 0 | ||
Axis to retrieve values from | ||
|
||
op_description: str, default 'retrieve' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure what this parameter is meant to do, its completely non-intuitive.
We may need a common framework here to incorporate the |
Thanks for the feedback @jreback. I'll address your comments and take another stab at driving down the complexity. If you, or anyone else, have some ideas for a more unified way to go about this (mixing columns and index levels in merge, sort_values, groupby, etc) I'd be happy to take a step back and consider a change of course. |
Ok, after meditating on this section of the code again I think I found a simpler approach. I created a new set of DataFrame helpers to support this change and I refactored the existing groupby column/index logic (#14432) to use these as well. I haven't revisited my Does this general approach look reasonable @jreback, @jorisvandenbossche, @gfyoung, and @TomAugspurger? Thanks! |
Just resolved conflicts with master. @jreback do you think my updates since your review last month are heading in the right direction? I'm happy to keep iterating as you have time to provide feedback. @jorisvandenbossche it's been a while since we first talked about these changes. Does this approach look alright to you? Thanks! |
I'm having trouble interpreting the circleci failure above. Is this something my changes caused? |
@jmmease : The Circle CI builds just died...let's give it another run to see if it passes. |
Thanks, @gfyoung |
actually this is ok for now. will think splitting generic.py up (in another issue). |
No problem @jreback , thanks for all you're doing to keep pandas going strong! Just let me know if you would like something moved before merging. |
# Conflicts: # doc/source/whatsnew/v0.22.0.txt
Just merged to fix whatsnew conflict. Is this all set @jreback? |
@jmmease small linting error (we just added this, it now looks for list-comprehensions that are not generators) :> and want @jorisvandenbossche @TomAugspurger to have a look. |
Ok, I think I fixed the lint issues. Thanks for pointing me in the right direction @jreback I'll stand-by for any feedback from @jorisvandenbossche and @TomAugspurger |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work!
I added some minor comments.
General notes:
-
For the case that an index level and column level match, shouldn't we treat this the same as a duplicate column name (in the future)? (in the idea of seeing an index as a special column)
I don't think it changes anything right now (warning that for now column is used), but I would change it in the future to the same as what happens with duplicate columns. -
You wrote the code dealing with selecting labels vs levels very generic. But do we actually have a use case for the
axis=1
? (selecting labels from index or levels from columns) I think merge is only column-wise?
If we don't have such a use case, I think the code would be simpler to just assume it is only for the column labels. -
Should
DataFrame.join
support the same?
pandas/core/frame.py
Outdated
None and not merging on indexes, then it merges on the intersection of | ||
the columns by default. | ||
Column or index level names to join on. These must be found in both | ||
DataFrames. If on is None and not merging on indexes then this defaults to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you put single backticks around 'on' ? (that is typically done for parameters of the function)
pandas/core/frame.py
Outdated
@@ -195,6 +196,11 @@ | |||
|
|||
.. versionadded:: 0.21.0 | |||
|
|||
Note |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note -> Notes (section names are fixed by numpydoc)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
needs update here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for catching this @jreback. Fixed
pandas/core/generic.py
Outdated
"_is_label_reference is not implemented for {type}" | ||
.format(type=type(self))) | ||
|
||
return (isinstance(key, compat.string_types) and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this restriction of string type needed?
Maybe it is safe to do this to start with (and it will cover most of the use cases), but in general pandas does not restrict column names to be strings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, I've changed this to check that the key is non-None and hashable
pandas/core/frame.py
Outdated
vectors of the length of the DataFrame to use a particular vector as | ||
the join key instead of columns | ||
Column or index level names to join on in the left DataFrame. Can also | ||
be a vector or list of vectors of the length of the left DataFrame. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
now you are changing this anyway, can you change vector with array ?
pandas/core/frame.py
Outdated
the join key instead of columns | ||
Column or index level names to join on in the left DataFrame. Can also | ||
be a vector or list of vectors of the length of the left DataFrame. | ||
These vectors are treated as though they are columns. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think 'as if' is easier language than 'as though' for non-native speakers
pandas/core/generic.py
Outdated
level_article=level_article, | ||
level_type=level_type, | ||
label_article=label_article, | ||
label_type=label_type), FutureWarning) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this warning needs a stacklevel
I would also put the creation of the message on a separate line before, I think that will make it cleaner:
msg = (....).format(..)
warnings.warn(msg, FutureWarning, stacklevel=..)
[['outer'], ['inner'], | ||
['outer', 'inner'], | ||
['inner', 'outer']]) | ||
def test_merge_indexes_and_columns_on(left_df, right_df, on, how): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This single test adds actually 144 (3x3x4x4) tests, which feels a bit as overkill to me?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I trimmed it down by a factor of 4 by removing the how
fixture and parameterizing how
alongside on
.
Thanks for looking things over @jorisvandenbossche. I'll address the specific comments in an update soon but wanted to follow up on a few of your general points.
|
Now we parameterize the how parameter alongside on. Reduces test-case count by a factor of 4.
Rename vector -> array
…to df.join Fixed errors that cropped up when using join on a combination of columns and index levels
@jorisvandenbossche I just pushed a new version that I believe addresses all of your inline comments (I've responded to a couple of these comments above). After looking things over I realized that Let me know if there's anything else you'd like to see once you've had a chance to look things over again. Thanks! |
Just following up. Did this last round of changes address your comments sufficiently @jorisvandenbossche? @TomAugspurger is there anything more you'd like to see here? cc: @jreback Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tiny doc update
pandas/core/frame.py
Outdated
@@ -195,6 +196,11 @@ | |||
|
|||
.. versionadded:: 0.21.0 | |||
|
|||
Note |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
needs update here
doc change pushed and all green @jreback @jorisvandenbossche @TomAugspurger |
thanks @jmmease if any fallout we can PR later! |
This PR implements the changes proposed and discussed with @jorisvandenbossche @shoyer @TomAugspurger @jreback in #14355.
These changes allow the
on
,left_on
, andright_on
parameters ofDataFrame.merge
to accept a combination of column names and index level names. Any common index levels that are merged on are preserved as index levels in the resulting DataFrame, while all other index levels are removed.In the case of a conflict, the column takes precedence and an ambiguity
FutureWarning
is raised.git diff upstream/master -u -- "*.py" | flake8 --diff
Note: The new
df._get_column_or_level_values
method introduced in this PR is the same method introduced in #17361 to support sorting DataFrames on a combination of columns and index levels. I will keep this method in sync between the two PRs during the review process.