Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support merging DataFrames on a combo of columns and index levels (GH 14355) #17484

Merged
merged 43 commits into from
Dec 1, 2017

Conversation

jonmmease
Copy link
Contributor

This PR implements the changes proposed and discussed with @jorisvandenbossche @shoyer @TomAugspurger @jreback in #14355.

These changes allow the on, left_on, and right_on parameters of DataFrame.merge to accept a combination of column names and index level names. Any common index levels that are merged on are preserved as index levels in the resulting DataFrame, while all other index levels are removed.

In the case of a conflict, the column takes precedence and an ambiguity FutureWarning is raised.

Note: The new df._get_column_or_level_values method introduced in this PR is the same method introduced in #17361 to support sorting DataFrames on a combination of columns and index levels. I will keep this method in sync between the two PRs during the review process.

@gfyoung gfyoung added API Design Compat pandas objects compatability with Numpy or Python functions Enhancement labels Sep 9, 2017
.. note::

When DataFrames are merged on a string that matches an index level in both
frames then the index level is preserved as an index level in the resulting
Copy link
Member

@gfyoung gfyoung Sep 9, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"frames" --> "frames,"


.. note::

If a string matches both a column name and an index level name then a warning is
Copy link
Member

@gfyoung gfyoung Sep 9, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"index level name" --> "index level name,"

@@ -109,6 +109,7 @@ Other Enhancements
- :func:`date_range` now accepts 'Y' in addition to 'A' as an alias for end of year (:issue:`9313`)
- Integration with `Apache Parquet <https://parquet.apache.org/>`__, including a new top-level :func:`read_parquet` and :func:`DataFrame.to_parquet` method, see :ref:`here <io.parquet>`. (:issue:`15838`, :issue:`17438`)
- :func:`DataFrame.add_prefix` and :func:`DataFrame.add_suffix` now accept strings containing the '%' character. (:issue:`17151`)
- :func:`DataFrame.merge` now accepts index level names as `on`, `left_on`, and `right_on` parameters allowing frames to be merged on a combination of columns and index levels (:issue:`14355`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"parameters allowing" --> "parameters, allowing"

@@ -3437,6 +3437,36 @@ def f(vals):

# ----------------------------------------------------------------------
# Sorting
def _get_column_or_level_values(self, key, axis=1,
op_description='retrieve'):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add doc-string for this. Developers in the future will thank you. 😉

("'%s' is both a column name and an index level.\n"
"Defaulting to column but "
"this will raise an ambiguity error in a "
"future version") % key,
Copy link
Member

@gfyoung gfyoung Sep 9, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The percentile string replacement is being phased out in favor of .format. Please replace along with ALL other places where you use it.

@@ -792,6 +815,10 @@ def _get_merge_keys(self):
is_rkey = lambda x: isinstance(
x, (np.ndarray, Series)) and len(x) == len(right)

def get_key_vals(df, key):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add doc-string for this.

@@ -1350,6 +1358,131 @@ def f():
household.join(log_return, how='outer')
pytest.raises(NotImplementedError, f)

def test_merge_on_index_and_column(self):
# Construct DataFrames
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reference issue number.

@codecov
Copy link

codecov bot commented Sep 9, 2017

Codecov Report

Merging #17484 into master will increase coverage by 0.09%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #17484      +/-   ##
==========================================
+ Coverage   91.35%   91.45%   +0.09%     
==========================================
  Files         163      157       -6     
  Lines       49691    51451    +1760     
==========================================
+ Hits        45397    47052    +1655     
- Misses       4294     4399     +105
Flag Coverage Δ
#multiple 89.32% <100%> (+0.18%) ⬆️
#single 40.53% <11.11%> (+0.79%) ⬆️
Impacted Files Coverage Δ
pandas/core/groupby.py 92.04% <100%> (+0.01%) ⬆️
pandas/core/frame.py 97.81% <100%> (-0.09%) ⬇️
pandas/core/reshape/merge.py 94.4% <100%> (+0.14%) ⬆️
pandas/core/generic.py 95.9% <100%> (+0.17%) ⬆️
pandas/io/gbq.py 25% <0%> (-58.34%) ⬇️
pandas/io/clipboard/__init__.py 56.86% <0%> (-19.33%) ⬇️
pandas/util/testing.py 81.89% <0%> (-18.11%) ⬇️
pandas/io/json/json.py 91.75% <0%> (-8.25%) ⬇️
pandas/io/formats/style.py 96.21% <0%> (-3.79%) ⬇️
pandas/core/dtypes/missing.py 90.24% <0%> (-0.5%) ⬇️
... and 37 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c634352...f3b95fe. Read the comment docs.

@jonmmease
Copy link
Contributor Author

Thanks for the quick feedback @gfyoung. I think I've incorporated all of it in this last commit. Let me know if I've missed anything or if there are additional changes you'd like to see.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lots of comments

@@ -1120,6 +1122,53 @@ This is not Implemented via ``join`` at-the-moment, however it can be done using
labels=['left', 'right'], vertical=False);
plt.close('all');

Merging on a combination of columns and index levels
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. versionadded:: 0.21
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need a blank line here

add a :ref: entry before the sub-section

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


Parameters
----------
key: int or object
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should only accept string keys referring to a named column or level, We do not want to have the integer/string mess like .ix again.

@@ -717,7 +719,25 @@ def _maybe_add_join_keys(self, result, left_indexer, right_indexer):
if name in result:
result[name] = key_col
else:
result.insert(i, name or 'key_{i}'.format(i=i), key_col)
if name and name in result.index.names:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are adding an amazing amount of logic here. This must be made simpler.

See Also
--------
DataFrame._get_column_or_level_values
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you actually need complexity, then hide it away here


Returns
-------
values: array
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

array -> np.ndarray

Parameters
----------
df: DataFrame
key: int or object
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see above

axis: int, default 0
Axis to retrieve values from

op_description: str, default 'retrieve'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure what this parameter is meant to do, its completely non-intuitive.

@jreback
Copy link
Contributor

jreback commented Sep 10, 2017

We may need a common framework here to incorporate the .groupby functionaility of the same name. Adding this many code branches is making things pretty complex.

@jonmmease
Copy link
Contributor Author

Thanks for the feedback @jreback. I'll address your comments and take another stab at driving down the complexity. If you, or anyone else, have some ideas for a more unified way to go about this (mixing columns and index levels in merge, sort_values, groupby, etc) I'd be happy to take a step back and consider a change of course.

@jonmmease
Copy link
Contributor Author

Ok, after meditating on this section of the code again I think I found a simpler approach. I created a new set of DataFrame helpers to support this change and I refactored the existing groupby column/index logic (#14432) to use these as well. I haven't revisited my sort_values updates in #17361 yet, but I should be able to rely on these same helpers there too.

Does this general approach look reasonable @jreback, @jorisvandenbossche, @gfyoung, and @TomAugspurger? Thanks!

@jonmmease
Copy link
Contributor Author

Just resolved conflicts with master. @jreback do you think my updates since your review last month are heading in the right direction? I'm happy to keep iterating as you have time to provide feedback.

@jorisvandenbossche it's been a while since we first talked about these changes. Does this approach look alright to you?

Thanks!

@jonmmease
Copy link
Contributor Author

I'm having trouble interpreting the circleci failure above. Is this something my changes caused?

@gfyoung
Copy link
Member

gfyoung commented Oct 4, 2017

@jmmease : The Circle CI builds just died...let's give it another run to see if it passes.

@jonmmease
Copy link
Contributor Author

Thanks, @gfyoung

@jreback
Copy link
Contributor

jreback commented Nov 22, 2017

I think we could put all of the added code that you put in generic.py in another module. maybe references and make it into a mix-in class (that you just include into generic).

actually this is ok for now. will think splitting generic.py up (in another issue).

@jonmmease
Copy link
Contributor Author

No problem @jreback , thanks for all you're doing to keep pandas going strong!

Just let me know if you would like something moved before merging.

@jonmmease
Copy link
Contributor Author

Just merged to fix whatsnew conflict. Is this all set @jreback?

@jreback
Copy link
Contributor

jreback commented Nov 23, 2017

@jmmease small linting error (we just added this, it now looks for list-comprehensions that are not generators) :>

and want @jorisvandenbossche @TomAugspurger to have a look.

@jonmmease
Copy link
Contributor Author

Ok, I think I fixed the lint issues. Thanks for pointing me in the right direction @jreback

I'll stand-by for any feedback from @jorisvandenbossche and @TomAugspurger

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work!

I added some minor comments.

General notes:

  • For the case that an index level and column level match, shouldn't we treat this the same as a duplicate column name (in the future)? (in the idea of seeing an index as a special column)
    I don't think it changes anything right now (warning that for now column is used), but I would change it in the future to the same as what happens with duplicate columns.

  • You wrote the code dealing with selecting labels vs levels very generic. But do we actually have a use case for the axis=1 ? (selecting labels from index or levels from columns) I think merge is only column-wise?
    If we don't have such a use case, I think the code would be simpler to just assume it is only for the column labels.

  • Should DataFrame.join support the same?

None and not merging on indexes, then it merges on the intersection of
the columns by default.
Column or index level names to join on. These must be found in both
DataFrames. If on is None and not merging on indexes then this defaults to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you put single backticks around 'on' ? (that is typically done for parameters of the function)

@@ -195,6 +196,11 @@

.. versionadded:: 0.21.0

Note
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note -> Notes (section names are fixed by numpydoc)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs update here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching this @jreback. Fixed

"_is_label_reference is not implemented for {type}"
.format(type=type(self)))

return (isinstance(key, compat.string_types) and
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this restriction of string type needed?
Maybe it is safe to do this to start with (and it will cover most of the use cases), but in general pandas does not restrict column names to be strings.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I've changed this to check that the key is non-None and hashable

vectors of the length of the DataFrame to use a particular vector as
the join key instead of columns
Column or index level names to join on in the left DataFrame. Can also
be a vector or list of vectors of the length of the left DataFrame.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now you are changing this anyway, can you change vector with array ?

the join key instead of columns
Column or index level names to join on in the left DataFrame. Can also
be a vector or list of vectors of the length of the left DataFrame.
These vectors are treated as though they are columns.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think 'as if' is easier language than 'as though' for non-native speakers

level_article=level_article,
level_type=level_type,
label_article=label_article,
label_type=label_type), FutureWarning)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this warning needs a stacklevel

I would also put the creation of the message on a separate line before, I think that will make it cleaner:

msg = (....).format(..)
warnings.warn(msg, FutureWarning, stacklevel=..)

[['outer'], ['inner'],
['outer', 'inner'],
['inner', 'outer']])
def test_merge_indexes_and_columns_on(left_df, right_df, on, how):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This single test adds actually 144 (3x3x4x4) tests, which feels a bit as overkill to me?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I trimmed it down by a factor of 4 by removing the how fixture and parameterizing how alongside on.

@jonmmease
Copy link
Contributor Author

Thanks for looking things over @jorisvandenbossche. I'll address the specific comments in an update soon but wanted to follow up on a few of your general points.

  1. Yes, I agree that it makes sense for ambiguous index/column combinations to eventually fail in the same way that duplicate columns now fail. Let me know if there's any language in the FutureWarning or in my comments in the code that you'd like me to change to make this future intention clearer.

  2. I'm planning to use the axis=1 case for a couple of these new methods in my next update to ENH: Support sorting frames by a combo of columns and index levels (GH 14353) #17361 (The sort_values analog of this PR) since sort_values supports that axis parameter. My hope was that these methods would make it easier to add mixed column/index support to other operations in the future, though sort_values is the only remaining operation on my personal wishlist at the moment.

  3. I restricted index level references to strings at the moment partially because I wasn't clear on how to check for the full set of valid label/level reference types. I'm happy to open this up if there's a clear way to check this.

  4. I'll need to think some more about join. Does anyone else have thoughts on this?

Jon M. Mease added 6 commits November 25, 2017 21:08
Now we parameterize the how parameter alongside on. Reduces test-case count by a factor of 4.
…to df.join

Fixed errors that cropped up when using join on a combination of columns and index levels
@jonmmease
Copy link
Contributor Author

@jorisvandenbossche I just pushed a new version that I believe addresses all of your inline comments (I've responded to a couple of these comments above).

After looking things over I realized that join was almost working already. I added some tests, fixed up a few issues, and updated the join docstring.

Let me know if there's anything else you'd like to see once you've had a chance to look things over again. Thanks!

@jonmmease
Copy link
Contributor Author

Just following up. Did this last round of changes address your comments sufficiently @jorisvandenbossche?

@TomAugspurger is there anything more you'd like to see here?

cc: @jreback

Thanks!

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tiny doc update

@@ -195,6 +196,11 @@

.. versionadded:: 0.21.0

Note
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs update here

@jonmmease
Copy link
Contributor Author

doc change pushed and all green @jreback @jorisvandenbossche @TomAugspurger

@jreback jreback merged commit d5ffb1f into pandas-dev:master Dec 1, 2017
@jreback
Copy link
Contributor

jreback commented Dec 1, 2017

thanks @jmmease

if any fallout we can PR later!

@jreback jreback added this to the 0.22.0 milestone Dec 1, 2017
@jonmmease
Copy link
Contributor Author

Awesome! thanks @jreback
I'll be around and be working on #14353 next. Just ping me as any integration issues with this arise

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Compat pandas objects compatability with Numpy or Python functions Enhancement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: Support merging DataFrames on a combination of columns and index levels
4 participants