Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stop concat from attempting to sort mismatched columns by default #20613

Merged
merged 38 commits into from
May 1, 2018

Conversation

brycepg
Copy link
Contributor

@brycepg brycepg commented Apr 5, 2018

Preserve column order upon concatenation to obey
least astonishment principle.

Allow old behavior to be enabled by adding a boolean switch to
concat and DataFrame.append, mismatch_sort, which is by default disabled.

@pep8speaks
Copy link

pep8speaks commented Apr 5, 2018

Hello @brycepg! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on May 01, 2018 at 00:20 Hours UTC

@brycepg brycepg force-pushed the master branch 2 times, most recently from f8484a3 to e3a2a34 Compare April 5, 2018 03:33
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs a few more tests
replicate any tests from the original issue
needs a whatsnew note
will comment on the impl later

@@ -20,7 +20,7 @@

def concat(objs, axis=0, join='outer', join_axes=None, ignore_index=False,
keys=None, levels=None, names=None, verify_integrity=False,
copy=True):
copy=True, mismatch_sort=False):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just call this sort
put before copy

@codecov
Copy link

codecov bot commented Apr 5, 2018

Codecov Report

Merging #20613 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #20613      +/-   ##
==========================================
+ Coverage   91.78%   91.78%   +<.01%     
==========================================
  Files         153      153              
  Lines       49324    49348      +24     
==========================================
+ Hits        45272    45296      +24     
  Misses       4052     4052
Flag Coverage Δ
#multiple 90.18% <100%> (ø) ⬆️
#single 41.94% <71.42%> (+0.02%) ⬆️
Impacted Files Coverage Δ
pandas/core/groupby/groupby.py 92.55% <ø> (ø) ⬆️
pandas/core/panel.py 97.29% <ø> (ø) ⬆️
pandas/core/base.py 96.83% <100%> (ø) ⬆️
pandas/core/frame.py 97.22% <100%> (ø) ⬆️
pandas/core/reshape/pivot.py 96.97% <100%> (ø) ⬆️
pandas/core/reshape/concat.py 97.59% <100%> (ø) ⬆️
pandas/core/indexes/api.py 98.92% <100%> (+0.14%) ⬆️
pandas/core/indexes/timedeltas.py 91.15% <0%> (-0.07%) ⬇️
pandas/core/indexes/datetimes.py 95.73% <0%> (-0.04%) ⬇️
pandas/io/pytables.py 92.41% <0%> (ø) ⬆️
... and 3 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4afc756...5e1b024. Read the comment docs.

@brycepg brycepg force-pushed the master branch 2 times, most recently from c859aab to 4c817c1 Compare April 5, 2018 04:29
Preserve column order upon concatenation to obey
least astonishment principle.

Allow old behavior to be enabled by adding a boolean switch to
concat and DataFrame.append, mismatch_sort, which is by default disabled.

Close pandas-dev#4588
@jorisvandenbossche
Copy link
Member

Thanks for working on this!

  • We also need to check how this works in combination with other keywords (eg join)
  • Do we want to ignore the keyword if all columns match? (what you currently do?) I would also find that surprising that the keyword then does not work. And given that sort=False is the default, I think we can let sort=True sort always?

@brycepg
Copy link
Contributor Author

brycepg commented Apr 5, 2018

@jreback

I've make the requested changes

@jorisvandenbossche

  • Yeah it does look like concat is used a lot internally. I'll have a look at each location its called.
  • That sounds reasonable

@jorisvandenbossche
Copy link
Member

Yeah it does look like concat is used a lot internally. I'll have a look at each location its called.

Ah, yes, that might be needed as well. But what I meant was other keywords of concat itself. Eg does the sort keyword work for both join='inner' and join='outer' ?

@@ -1160,6 +1160,7 @@ Reshaping
- Bug in :meth:`DataFrame.astype` where column metadata is lost when converting to categorical or a dictionary of dtypes (:issue:`19920`)
- Bug in :func:`cut` and :func:`qcut` where timezone information was dropped (:issue:`19872`)
- Bug in :class:`Series` constructor with a ``dtype=str``, previously raised in some cases (:issue:`19853`)
- Stop :func:`concat` and ``Dataframe.append`` from sorting columns by default. Use ``sort=True`` to retain old behavior (:issue:`4588`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be sort=True by default to preserve backwards compatibility, right?

Or rather, I think the eventual goal is to have sort=False be the default, so for now it should be

  • sort=None is the default
  • If the default is passed, use sort=True and warn that the default is changing in the futrue
  • If True or False, no warning.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually this needs a sub-section. this is a rather large change (even if its None by default). highliting it is best. pls show an example of previous and new

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TomAugspurger
If I do what @jorisvandenbossche suggests then, sort=True will not be backwards compatible because it will sort the axes in question regardless of whether the columns are mismatched.
I could have sort=None be the default, give a warning and revert to old behavior. In future versions this behavior of only sorting the axes sometimes would not be available because it doesn't make sense and concat could default to sort=False

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@brycepg I think we can do both (it might complicate the code a bit, but not too much I think, as in _get_combined_index those cases are already handled separately). As @TomAugspurger suggests, the default can be None for now, so we can raise a warning in the appropriate cases:

  • when all axes are equal (current case when there is already no sorting): no warning should be raised, as the future default of sort=False will not change anything, but add the ability to also sort the index with sort=True
  • when not all axes are equal (current case that the unwanted sorting happens): issue a warning that in the future this will no longer sort

@jreback jreback added the Reshaping Concat, Merge/Join, Stack/Unstack, Explode label Apr 5, 2018
@@ -1160,6 +1160,7 @@ Reshaping
- Bug in :meth:`DataFrame.astype` where column metadata is lost when converting to categorical or a dictionary of dtypes (:issue:`19920`)
- Bug in :func:`cut` and :func:`qcut` where timezone information was dropped (:issue:`19872`)
- Bug in :class:`Series` constructor with a ``dtype=str``, previously raised in some cases (:issue:`19853`)
- Stop :func:`concat` and ``Dataframe.append`` from sorting columns by default. Use ``sort=True`` to retain old behavior (:issue:`4588`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually this needs a sub-section. this is a rather large change (even if its None by default). highliting it is best. pls show an example of previous and new

@@ -5982,7 +5982,8 @@ def infer(x):
# ----------------------------------------------------------------------
# Merging / joining methods

def append(self, other, ignore_index=False, verify_integrity=False):
def append(self, other, ignore_index=False,
verify_integrity=False, sort=False):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sort before verify_integrity

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback why do you want sort before verify_integrity?

@@ -5995,6 +5996,8 @@ def append(self, other, ignore_index=False, verify_integrity=False):
If True, do not use the index labels.
verify_integrity : boolean, default False
If True, raise ValueError on creating index with duplicates.
sort: boolean, default False
Sort columns if given object doesn't have the same columns
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs a versionadded.

use does not

@@ -20,7 +20,7 @@

def concat(objs, axis=0, join='outer', join_axes=None, ignore_index=False,
keys=None, levels=None, names=None, verify_integrity=False,
copy=True):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually move before verify_integrity

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls do this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why make this an API breaking change?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because it more logical.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How so?

I'm OK with breaking API when necessary, but this seems unnecessary.

@@ -60,6 +60,8 @@ def concat(objs, axis=0, join='outer', join_axes=None, ignore_index=False,
verify_integrity : boolean, default False
Check whether the new concatenated axis contains duplicates. This can
be very expensive relative to the actual data concatenation
sort : boolean, default False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

simiar to above

dfa = pd.DataFrame(columns=['C', 'A'], data=[[1, 2]])
dfb = pd.DataFrame(columns=['C', 'Z'], data=[[5, 6]])
result = pd.concat([dfa, dfb])
assert result.columns.tolist() == ['C', 'A', 'Z']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

create an expected frame and use assert_frame_equal

df['a'] = [1, 2, 3]
df2 = pd.DataFrame({'a': [4, 5]})
df3 = pd.concat([df, df2])
assert df3.columns.tolist() == ['b', 'c', 'a']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

df['c'] = [1, 2, 3]
df['a'] = [1, 2, 3]
df2 = pd.DataFrame({'a': [4, 5]})
df3 = pd.concat([df, df2])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use result =

@TomAugspurger
Copy link
Contributor

@brycepg can you update based on Jeff's comments? Doing a release candidate soon (tomorrow or Wednesday hopefully), and it'd be nice to have this in.

@brycepg
Copy link
Contributor Author

brycepg commented Apr 23, 2018 via email

API: Updated the default to be compatible and warn.

DOC: updated the whatsnew and concat docstring.
@TomAugspurger
Copy link
Contributor

@brycepg I pushed some changes to your branch.

  1. I updated the API to keep this backwards compat, but you'll get a warning.
  2. I updated the docs to explain that

Will you have time today to get go through and update the tests to pass sort=True where necessary? The build log will have a bunch of warnings when it finishes.

And I didn't address most of @jreback's comments yet.

I'd like to do a release candidate tomorrow, so if you can't get to it today let me know and I'll push more fixes here.

@TomAugspurger
Copy link
Contributor

Picking this up. Would be good to have for 0.23.

@TomAugspurger TomAugspurger added this to the 0.23.0 milestone Apr 26, 2018
@TomAugspurger
Copy link
Contributor

Can someone proofread the warning text? Specifically, does the term "non-concatenation axis" make sense? I could also call it the "non-expanding axis", or something else entirely.

@TomAugspurger
Copy link
Contributor

Ah, yes, that might be needed as well. But what I meant was other keywords of concat itself. Eg does the sort keyword work for both join='inner' and join='outer' ?

We'll need to address this.

In [2]: df1 = pd.DataFrame({"a": [1, 2], "b": [1, 2], "c": [1, 2]}, columns=['b', 'a', 'c'])

In [3]: df2 = pd.DataFrame({"a": [1, 2], 'c': [3, 4]}, index=[3, 4])

In [4]: pd.concat([df1, df2], join='inner')
Out[4]:
   a  c
0  1  1
1  2  2
3  1  3
4  2  4

In [5]: pd.concat([df1, df2], join='inner', sort=False)
Out[5]:
   a  c
0  1  1
1  2  2
3  1  3
4  2  4

I assume we want the same behavior as for join='outer'.

@TomAugspurger
Copy link
Contributor

Found one more issue in crosstab. Fixing now.

@@ -507,7 +507,7 @@ def is_any_frame():
for r in compat.itervalues(result))

if isinstance(result, list):
return concat(result, keys=keys, axis=1), True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thre are a bunch more concats in this same section. basically they control the resulting ordeing of the aggregation. I guess sort=True is fine here (for the other cases).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or maybe these should be sort=False, not sure what is actually affected (as these might be aligned ops already)

@@ -1098,7 +1098,8 @@ def reset_identity(values):
group_names = self.grouper.names

result = concat(values, axis=self.axis, keys=group_keys,
levels=group_levels, names=group_names)
levels=group_levels, names=group_names,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here, there are like 10 calls to concat. I think should be explicit about sort

@@ -89,13 +110,19 @@ def conv(i):
index = indexes[0]
for other in indexes[1:]:
if not index.equals(other):

if sort is None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would move _unique_indices from a nested function to module level (e.g. same as _union_indices (and maybe conform the spelling indices / indexes), and simply add a sort= kwarg (which you are already passing into fast_unique_multiple_lists). Then you can do the warning there. just makes this whole function a bit simpler.

index = _get_objs_combined_axis(data.values(), axis=axis,
intersect=intersect)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so this always warns? shouldn't this be sort=False?

@TomAugspurger
Copy link
Contributor

K, I'll try to go through all the internal calls to concat and set explicitly.

One thing that's made this difficult is that we don't really have the old option of sort only if not aligned anymore. So just setting sort=True or sort=False won't necessarily reproduce the old output. Maybe all of these will be OK though.

@TomAugspurger
Copy link
Contributor

TomAugspurger commented May 1, 2018

I'm going to push this to 0.23.1. I'm not confident enough that we have complete test coverage for all the cases of concat used internally.

I can pick it up after PyCon, or you're welcome to work on it whenever @brycepg.

@TomAugspurger TomAugspurger modified the milestones: 0.23.0, 0.23.1 May 1, 2018
@jorisvandenbossche
Copy link
Member

If we don't include it in 0.23, I think it has to wait for 0.24. We shouldn't introduce new deprecation warnings in a bug fix release.

However, I don't fully understand the problem to not merge it now. There are still some internal use cases that you need to check? (whether sort=True/False needs to be added?)
But as far as I see, there are no failing tests? Are there still warnings being raised?
I think in many cases in internal code, it's concatting aligned results (eg in groupby), and then there should be no change or warning.

@TomAugspurger
Copy link
Contributor

TomAugspurger commented May 1, 2018 via email

@jorisvandenbossche
Copy link
Member

In what way was it broken? (there should for now mainly be a warning and new keyword, but not really change in default behaviour?)

I'm not confident enough that we have complete test coverage for all the cases of concat used internally.

I can't really say anything founded, as I didn't look enough into detail to what tests have broken throughout implementing this and what you needed to add, but since our tests are now passing, my gut feeling would say: let's use the rc period to get more real-world testing on that .. ;)

@TomAugspurger
Copy link
Contributor

In what way was it broken? (there should for now mainly be a warning and new keyword, but not really change in default behaviour?)

I don't recall the exact circumstances, but a MultiIndex was (or wasn't?) being sorting.

If you're OK with merging as is, then I can commit to adding more tests (and fixing bugs) between now and Friday. I just don't think it should hold up the release.

@jorisvandenbossche
Copy link
Member

As I said, difficult to say myself, but I am OK with relying on your judgement here.
(and I certainly approve the gist of the PR)

@TomAugspurger
Copy link
Contributor

Alright, let's do it.

#20909 for the followup.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants