Stop concat from attempting to sort mismatched columns by default #20613

brycepg · 2018-04-05T03:29:11Z

Preserve column order upon concatenation to obey
least astonishment principle.

Allow old behavior to be enabled by adding a boolean switch to
concat and DataFrame.append, mismatch_sort, which is by default disabled.

Close BUG: concat unwantedly sorts DataFrame column names if they differ #4588
tests added / passed
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

pep8speaks · 2018-04-05T03:29:15Z

Hello @brycepg! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on May 01, 2018 at 00:20 Hours UTC

jreback

needs a few more tests
replicate any tests from the original issue
needs a whatsnew note
will comment on the impl later

jreback · 2018-04-05T03:56:15Z

pandas/core/reshape/concat.py

@@ -20,7 +20,7 @@

 def concat(objs, axis=0, join='outer', join_axes=None, ignore_index=False,
           keys=None, levels=None, names=None, verify_integrity=False,
-           copy=True):
+           copy=True, mismatch_sort=False):


just call this sort
put before copy

codecov · 2018-04-05T04:27:58Z

Codecov Report

Merging #20613 into master will increase coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #20613      +/-   ##
==========================================
+ Coverage   91.78%   91.78%   +<.01%     
==========================================
  Files         153      153              
  Lines       49324    49348      +24     
==========================================
+ Hits        45272    45296      +24     
  Misses       4052     4052

Flag	Coverage Δ
#multiple	`90.18% <100%> (ø)`	⬆️
#single	`41.94% <71.42%> (+0.02%)`	⬆️

Impacted Files	Coverage Δ
pandas/core/groupby/groupby.py	`92.55% <ø> (ø)`	⬆️
pandas/core/panel.py	`97.29% <ø> (ø)`	⬆️
pandas/core/base.py	`96.83% <100%> (ø)`	⬆️
pandas/core/frame.py	`97.22% <100%> (ø)`	⬆️
pandas/core/reshape/pivot.py	`96.97% <100%> (ø)`	⬆️
pandas/core/reshape/concat.py	`97.59% <100%> (ø)`	⬆️
pandas/core/indexes/api.py	`98.92% <100%> (+0.14%)`	⬆️
pandas/core/indexes/timedeltas.py	`91.15% <0%> (-0.07%)`	⬇️
pandas/core/indexes/datetimes.py	`95.73% <0%> (-0.04%)`	⬇️
pandas/io/pytables.py	`92.41% <0%> (ø)`	⬆️
... and 3 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4afc756...5e1b024. Read the comment docs.

Preserve column order upon concatenation to obey least astonishment principle. Allow old behavior to be enabled by adding a boolean switch to concat and DataFrame.append, mismatch_sort, which is by default disabled. Close pandas-dev#4588

jorisvandenbossche · 2018-04-05T06:58:31Z

Thanks for working on this!

We also need to check how this works in combination with other keywords (eg join)
Do we want to ignore the keyword if all columns match? (what you currently do?) I would also find that surprising that the keyword then does not work. And given that sort=False is the default, I think we can let sort=True sort always?

brycepg · 2018-04-05T08:31:51Z

@jreback

I've make the requested changes

@jorisvandenbossche

Yeah it does look like concat is used a lot internally. I'll have a look at each location its called.
That sounds reasonable

jorisvandenbossche · 2018-04-05T08:34:25Z

Yeah it does look like concat is used a lot internally. I'll have a look at each location its called.

Ah, yes, that might be needed as well. But what I meant was other keywords of concat itself. Eg does the sort keyword work for both join='inner' and join='outer' ?

TomAugspurger · 2018-04-05T12:13:26Z

doc/source/whatsnew/v0.23.0.txt

@@ -1160,6 +1160,7 @@ Reshaping
 - Bug in :meth:`DataFrame.astype` where column metadata is lost when converting to categorical or a dictionary of dtypes (:issue:`19920`)
 - Bug in :func:`cut` and :func:`qcut` where timezone information was dropped (:issue:`19872`)
 - Bug in :class:`Series` constructor with a ``dtype=str``, previously raised in some cases (:issue:`19853`)
+- Stop :func:`concat` and ``Dataframe.append`` from sorting columns by default. Use ``sort=True`` to retain old behavior (:issue:`4588`)


This should be sort=True by default to preserve backwards compatibility, right?

Or rather, I think the eventual goal is to have sort=False be the default, so for now it should be

sort=None is the default

If the default is passed, use sort=True and warn that the default is changing in the futrue

If True or False, no warning.

actually this needs a sub-section. this is a rather large change (even if its None by default). highliting it is best. pls show an example of previous and new

@TomAugspurger
If I do what @jorisvandenbossche suggests then, sort=True will not be backwards compatible because it will sort the axes in question regardless of whether the columns are mismatched.
I could have sort=None be the default, give a warning and revert to old behavior. In future versions this behavior of only sorting the axes sometimes would not be available because it doesn't make sense and concat could default to sort=False

@brycepg I think we can do both (it might complicate the code a bit, but not too much I think, as in _get_combined_index those cases are already handled separately). As @TomAugspurger suggests, the default can be None for now, so we can raise a warning in the appropriate cases:

when all axes are equal (current case when there is already no sorting): no warning should be raised, as the future default of sort=False will not change anything, but add the ability to also sort the index with sort=True

when not all axes are equal (current case that the unwanted sorting happens): issue a warning that in the future this will no longer sort

jreback · 2018-04-05T15:10:28Z

doc/source/whatsnew/v0.23.0.txt

@@ -1160,6 +1160,7 @@ Reshaping
 - Bug in :meth:`DataFrame.astype` where column metadata is lost when converting to categorical or a dictionary of dtypes (:issue:`19920`)
 - Bug in :func:`cut` and :func:`qcut` where timezone information was dropped (:issue:`19872`)
 - Bug in :class:`Series` constructor with a ``dtype=str``, previously raised in some cases (:issue:`19853`)
+- Stop :func:`concat` and ``Dataframe.append`` from sorting columns by default. Use ``sort=True`` to retain old behavior (:issue:`4588`)


actually this needs a sub-section. this is a rather large change (even if its None by default). highliting it is best. pls show an example of previous and new

jreback · 2018-04-05T15:10:50Z

pandas/core/frame.py

@@ -5982,7 +5982,8 @@ def infer(x):
    # ----------------------------------------------------------------------
    # Merging / joining methods

-    def append(self, other, ignore_index=False, verify_integrity=False):
+    def append(self, other, ignore_index=False,
+               verify_integrity=False, sort=False):


sort before verify_integrity

@jreback why do you want sort before verify_integrity?

jreback · 2018-04-05T15:11:11Z

pandas/core/frame.py

@@ -5995,6 +5996,8 @@ def append(self, other, ignore_index=False, verify_integrity=False):
            If True, do not use the index labels.
        verify_integrity : boolean, default False
            If True, raise ValueError on creating index with duplicates.
+        sort: boolean, default False
+            Sort columns if given object doesn't have the same columns


needs a versionadded.

use does not

jreback · 2018-04-05T15:11:41Z

pandas/core/reshape/concat.py

@@ -20,7 +20,7 @@

 def concat(objs, axis=0, join='outer', join_axes=None, ignore_index=False,
           keys=None, levels=None, names=None, verify_integrity=False,
-           copy=True):


actually move before verify_integrity

pls do this

Why make this an API breaking change?

because it more logical.

How so?

I'm OK with breaking API when necessary, but this seems unnecessary.

jreback · 2018-04-05T15:11:48Z

pandas/core/reshape/concat.py

@@ -60,6 +60,8 @@ def concat(objs, axis=0, join='outer', join_axes=None, ignore_index=False,
    verify_integrity : boolean, default False
        Check whether the new concatenated axis contains duplicates. This can
        be very expensive relative to the actual data concatenation
+    sort : boolean, default False


simiar to above

jreback · 2018-04-05T15:12:36Z

pandas/tests/reshape/test_concat.py

+    dfa = pd.DataFrame(columns=['C', 'A'], data=[[1, 2]])
+    dfb = pd.DataFrame(columns=['C', 'Z'], data=[[5, 6]])
+    result = pd.concat([dfa, dfb])
+    assert result.columns.tolist() == ['C', 'A', 'Z']


create an expected frame and use assert_frame_equal

jreback · 2018-04-05T15:12:42Z

pandas/tests/reshape/test_concat.py

+    df['a'] = [1, 2, 3]
+    df2 = pd.DataFrame({'a': [4, 5]})
+    df3 = pd.concat([df, df2])
+    assert df3.columns.tolist() == ['b', 'c', 'a']


jreback · 2018-04-05T15:12:58Z

pandas/tests/reshape/test_concat.py

+    df['c'] = [1, 2, 3]
+    df['a'] = [1, 2, 3]
+    df2 = pd.DataFrame({'a': [4, 5]})
+    df3 = pd.concat([df, df2])


use result =

TomAugspurger · 2018-04-23T19:14:52Z

@brycepg can you update based on Jeff's comments? Doing a release candidate soon (tomorrow or Wednesday hopefully), and it'd be nice to have this in.

brycepg · 2018-04-23T19:34:52Z

Sure I’ll try to do it tonight

…

On Mon, Apr 23, 2018 at 1:15 PM, Tom Augspurger ***@***.***> wrote: ***@***.***(https://github.com/brycepg) can you update based on Jeff's comments? Doing a release candidate soon (tomorrow or Wednesday hopefully), and it'd be nice to have this in. — You are receiving this because you were mentioned. Reply to this email directly, [view it on GitHub](#20613 (comment)), or [mute the thread](https://github.com/notifications/unsubscribe-auth/AAa9yC_mpbFjUbWvFWVaeM4sWVQ3pxffks5trig-gaJpZM4THzks). {"api_version":"1.0","publisher":{"api_key":"05dde50f1d1a384dd78767c55493e4bb","name":"GitHub"},"entity":{"external_key":"github/pandas-dev/pandas","title":"pandas-dev/pandas","subtitle":"GitHub repository","main_image_url":"https://cloud.githubusercontent.com/assets/143418/17495839/a5054eac-5d88-11e6-95fc-7290892c7bb5.png","avatar_image_url":"https://cloud.githubusercontent.com/assets/143418/15842166/7c72db34-2c0b-11e6-9aed-b52498112777.png","action":{"name":"Open in ***@***.*** in #20613: @brycepg can you update based on Jeff's comments? Doing a release candidate soon (tomorrow or Wednesday hopefully), and it'd be nice to have this in."}],"action":{"name":"View Pull Request","url":"#20613 (comment)"}}}

API: Updated the default to be compatible and warn. DOC: updated the whatsnew and concat docstring.

TomAugspurger · 2018-04-26T14:46:33Z

@brycepg I pushed some changes to your branch.

I updated the API to keep this backwards compat, but you'll get a warning.
I updated the docs to explain that

Will you have time today to get go through and update the tests to pass sort=True where necessary? The build log will have a bunch of warnings when it finishes.

And I didn't address most of @jreback's comments yet.

I'd like to do a release candidate tomorrow, so if you can't get to it today let me know and I'll push more fixes here.

TomAugspurger · 2018-04-26T20:23:10Z

Picking this up. Would be good to have for 0.23.

TomAugspurger · 2018-04-26T21:17:02Z

Can someone proofread the warning text? Specifically, does the term "non-concatenation axis" make sense? I could also call it the "non-expanding axis", or something else entirely.

TomAugspurger · 2018-04-27T13:22:21Z

Ah, yes, that might be needed as well. But what I meant was other keywords of concat itself. Eg does the sort keyword work for both join='inner' and join='outer' ?

We'll need to address this.

In [2]: df1 = pd.DataFrame({"a": [1, 2], "b": [1, 2], "c": [1, 2]}, columns=['b', 'a', 'c'])

In [3]: df2 = pd.DataFrame({"a": [1, 2], 'c': [3, 4]}, index=[3, 4])

In [4]: pd.concat([df1, df2], join='inner')
Out[4]:
   a  c
0  1  1
1  2  2
3  1  3
4  2  4

In [5]: pd.concat([df1, df2], join='inner', sort=False)
Out[5]:
   a  c
0  1  1
1  2  2
3  1  3
4  2  4

I assume we want the same behavior as for join='outer'.

TomAugspurger · 2018-04-30T23:53:07Z

Found one more issue in crosstab. Fixing now.

jreback · 2018-05-01T00:10:25Z

pandas/core/base.py

@@ -507,7 +507,7 @@ def is_any_frame():
                           for r in compat.itervalues(result))

            if isinstance(result, list):
-                return concat(result, keys=keys, axis=1), True


thre are a bunch more concats in this same section. basically they control the resulting ordeing of the aggregation. I guess sort=True is fine here (for the other cases).

or maybe these should be sort=False, not sure what is actually affected (as these might be aligned ops already)

jreback · 2018-05-01T00:12:00Z

pandas/core/groupby/groupby.py

@@ -1098,7 +1098,8 @@ def reset_identity(values):
                group_names = self.grouper.names

                result = concat(values, axis=self.axis, keys=group_keys,
-                                levels=group_levels, names=group_names)
+                                levels=group_levels, names=group_names,


same here, there are like 10 calls to concat. I think should be explicit about sort

jreback · 2018-05-01T00:15:19Z

pandas/core/indexes/api.py

@@ -89,13 +110,19 @@ def conv(i):
        index = indexes[0]
        for other in indexes[1:]:
            if not index.equals(other):
+
+                if sort is None:


I would move _unique_indices from a nested function to module level (e.g. same as _union_indices (and maybe conform the spelling indices / indexes), and simply add a sort= kwarg (which you are already passing into fast_unique_multiple_lists). Then you can do the warning there. just makes this whole function a bit simpler.

jreback · 2018-05-01T00:15:51Z

pandas/core/panel.py

            index = _get_objs_combined_axis(data.values(), axis=axis,
-                                            intersect=intersect)


so this always warns? shouldn't this be sort=False?

TomAugspurger · 2018-05-01T00:26:59Z

K, I'll try to go through all the internal calls to concat and set explicitly.

One thing that's made this difficult is that we don't really have the old option of sort only if not aligned anymore. So just setting sort=True or sort=False won't necessarily reproduce the old output. Maybe all of these will be OK though.

TomAugspurger · 2018-05-01T10:53:34Z

I'm going to push this to 0.23.1. I'm not confident enough that we have complete test coverage for all the cases of concat used internally.

I can pick it up after PyCon, or you're welcome to work on it whenever @brycepg.

jorisvandenbossche · 2018-05-01T19:23:58Z

If we don't include it in 0.23, I think it has to wait for 0.24. We shouldn't introduce new deprecation warnings in a bug fix release.

However, I don't fully understand the problem to not merge it now. There are still some internal use cases that you need to check? (whether sort=True/False needs to be added?)
But as far as I see, there are no failing tests? Are there still warnings being raised?
I think in many cases in internal code, it's concatting aligned results (eg in groupby), and then there should be no change or warning.

TomAugspurger · 2018-05-01T19:30:24Z

Yeah. The concern is about internal calls to concat. For example, `Panel({b: df, a: df})` was just broken on this branch for a while with certain conditions. None of the constructor tests caught that, it was just in a pytables test that it was caught. Though perhaps panels isn't the best example, not sure how coverage is on that in general.

…

On Tue, May 1, 2018 at 2:24 PM, Joris Van den Bossche < ***@***.***> wrote: If we don't include it in 0.23, I think it has to wait for 0.24. We shouldn't introduce new deprecation warnings in a bug fix release. However, I don't fully understand the problem to not merge it now. There are still some internal use cases that you need to check? (whether sort=True/False needs to be added?) But as far as I see, there are no failing tests? Are there still warnings being raised? I think in many cases in internal code, it's concatting aligned results (eg in groupby), and then there should be no change or warning. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#20613 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABQHIuROG_dJhxtzOuPjLcHphSagFb0Pks5tuLZSgaJpZM4THzks> .

jorisvandenbossche · 2018-05-01T19:38:18Z

In what way was it broken? (there should for now mainly be a warning and new keyword, but not really change in default behaviour?)

I'm not confident enough that we have complete test coverage for all the cases of concat used internally.

I can't really say anything founded, as I didn't look enough into detail to what tests have broken throughout implementing this and what you needed to add, but since our tests are now passing, my gut feeling would say: let's use the rc period to get more real-world testing on that .. ;)

TomAugspurger · 2018-05-01T19:45:51Z

In what way was it broken? (there should for now mainly be a warning and new keyword, but not really change in default behaviour?)

I don't recall the exact circumstances, but a MultiIndex was (or wasn't?) being sorting.

If you're OK with merging as is, then I can commit to adding more tests (and fixing bugs) between now and Friday. I just don't think it should hold up the release.

jorisvandenbossche · 2018-05-01T19:51:08Z

As I said, difficult to say myself, but I am OK with relying on your judgement here.
(and I certainly approve the gist of the PR)

TomAugspurger · 2018-05-01T20:05:58Z

Alright, let's do it.

#20909 for the followup.

brycepg force-pushed the master branch 2 times, most recently from f8484a3 to e3a2a34 Compare April 5, 2018 03:33

jreback requested changes Apr 5, 2018

View reviewed changes

brycepg force-pushed the master branch from e3a2a34 to bcf835a Compare April 5, 2018 04:27

brycepg force-pushed the master branch 2 times, most recently from c859aab to 4c817c1 Compare April 5, 2018 04:29

brycepg force-pushed the master branch from 4c817c1 to 913723b Compare April 5, 2018 06:06

TomAugspurger reviewed Apr 5, 2018

View reviewed changes

jreback added the Reshaping Concat, Merge/Join, Stack/Unstack, Explode label Apr 5, 2018

jreback requested changes Apr 5, 2018

View reviewed changes

TomAugspurger added 2 commits April 26, 2018 08:38

Merge remote-tracking branch 'upstream/master' into brycepg-master

5da763f

Updates

02b2db9

API: Updated the default to be compatible and warn. DOC: updated the whatsnew and concat docstring.

TomAugspurger added this to the 0.23.0 milestone Apr 26, 2018

TomAugspurger added 3 commits April 26, 2018 15:46

Test fallout

a497763

Updated append

954a1b6

versionadded

2a20377

TomAugspurger added 2 commits April 27, 2018 06:49

Squash more test warnings

35570c4

py2 compat

983d0c1

TomAugspurger added 7 commits April 30, 2018 08:16

Merge remote-tracking branch 'upstream/master' into brycepg-master

95cdf67

Merge remote-tracking branch 'brycepg/master' into brycepg-master

d10f5bd

Prune tests

e47cbb9

Default sort

0182c98

Make both tests happy

7e58998

Explicit columns

5b58e75

List of series

074d03c

jreback requested changes May 1, 2018

View reviewed changes

test, fix pivot

5e1b024

TomAugspurger modified the milestones: 0.23.0, 0.23.1 May 1, 2018

TomAugspurger mentioned this pull request May 1, 2018

TST: Additional tests for concat(sort) argument #20909

Closed

TomAugspurger merged commit c4da79b into pandas-dev:master May 1, 2018

TomAugspurger modified the milestones: 0.23.1, 0.23.0 May 1, 2018

jsexauer mentioned this pull request May 2, 2018

DEPR: Clean up list of deprecations from prior versions #6581

Closed

1 task

adbull mentioned this pull request May 17, 2018

DOC: error in 0.23.0 concat sort warning? #21101

Closed

h-vetinari mentioned this pull request Nov 15, 2018

BUG: concat warning bubbling up through str.cat #23725

Merged

addisonlynch mentioned this pull request Feb 22, 2019

COMPAT: Added explicit sort parameter to pd.concat in Yahoo Actions pydata/pandas-datareader#613

Merged

3 tasks

jreback mentioned this pull request Nov 25, 2019

DEPR: deprecations log for removed issues #13777

Closed

jbrockmendel mentioned this pull request Dec 13, 2019

DEPR: change DataFrame.append default sort kwarg #30251

Merged

		index = _get_objs_combined_axis(data.values(), axis=axis,
		intersect=intersect)

Stop concat from attempting to sort mismatched columns by default #20613

Stop concat from attempting to sort mismatched columns by default #20613

Conversation

brycepg commented Apr 5, 2018 • edited Loading

pep8speaks commented Apr 5, 2018 • edited Loading

Comment last updated on May 01, 2018 at 00:20 Hours UTC

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Apr 5, 2018 • edited Loading

Codecov Report

jorisvandenbossche commented Apr 5, 2018

brycepg commented Apr 5, 2018

jorisvandenbossche commented Apr 5, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TomAugspurger commented Apr 23, 2018

brycepg commented Apr 23, 2018 via email

TomAugspurger commented Apr 26, 2018

TomAugspurger commented Apr 26, 2018

TomAugspurger commented Apr 26, 2018

TomAugspurger commented Apr 27, 2018

TomAugspurger commented Apr 30, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TomAugspurger commented May 1, 2018

TomAugspurger commented May 1, 2018 • edited Loading

jorisvandenbossche commented May 1, 2018

TomAugspurger commented May 1, 2018 via email

jorisvandenbossche commented May 1, 2018

TomAugspurger commented May 1, 2018

jorisvandenbossche commented May 1, 2018

TomAugspurger commented May 1, 2018

brycepg commented Apr 5, 2018 •

edited

Loading

pep8speaks commented Apr 5, 2018 •

edited

Loading

codecov bot commented Apr 5, 2018 •

edited

Loading

TomAugspurger commented May 1, 2018 •

edited

Loading