-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stop concat from attempting to sort mismatched columns by default #20613
Conversation
Hello @brycepg! Thanks for updating the PR. Cheers ! There are no PEP8 issues in this Pull Request. 🍻 Comment last updated on May 01, 2018 at 00:20 Hours UTC |
f8484a3
to
e3a2a34
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
needs a few more tests
replicate any tests from the original issue
needs a whatsnew note
will comment on the impl later
pandas/core/reshape/concat.py
Outdated
@@ -20,7 +20,7 @@ | |||
|
|||
def concat(objs, axis=0, join='outer', join_axes=None, ignore_index=False, | |||
keys=None, levels=None, names=None, verify_integrity=False, | |||
copy=True): | |||
copy=True, mismatch_sort=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just call this sort
put before copy
Codecov Report
@@ Coverage Diff @@
## master #20613 +/- ##
==========================================
+ Coverage 91.78% 91.78% +<.01%
==========================================
Files 153 153
Lines 49324 49348 +24
==========================================
+ Hits 45272 45296 +24
Misses 4052 4052
Continue to review full report at Codecov.
|
c859aab
to
4c817c1
Compare
Preserve column order upon concatenation to obey least astonishment principle. Allow old behavior to be enabled by adding a boolean switch to concat and DataFrame.append, mismatch_sort, which is by default disabled. Close pandas-dev#4588
Thanks for working on this!
|
I've make the requested changes
|
Ah, yes, that might be needed as well. But what I meant was other keywords of concat itself. Eg does the sort keyword work for both |
doc/source/whatsnew/v0.23.0.txt
Outdated
@@ -1160,6 +1160,7 @@ Reshaping | |||
- Bug in :meth:`DataFrame.astype` where column metadata is lost when converting to categorical or a dictionary of dtypes (:issue:`19920`) | |||
- Bug in :func:`cut` and :func:`qcut` where timezone information was dropped (:issue:`19872`) | |||
- Bug in :class:`Series` constructor with a ``dtype=str``, previously raised in some cases (:issue:`19853`) | |||
- Stop :func:`concat` and ``Dataframe.append`` from sorting columns by default. Use ``sort=True`` to retain old behavior (:issue:`4588`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be sort=True
by default to preserve backwards compatibility, right?
Or rather, I think the eventual goal is to have sort=False
be the default, so for now it should be
sort=None
is the default- If the default is passed, use
sort=True
and warn that the default is changing in the futrue - If True or False, no warning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually this needs a sub-section. this is a rather large change (even if its None by default). highliting it is best. pls show an example of previous and new
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@TomAugspurger
If I do what @jorisvandenbossche suggests then, sort=True
will not be backwards compatible because it will sort the axes in question regardless of whether the columns are mismatched.
I could have sort=None
be the default, give a warning and revert to old behavior. In future versions this behavior of only sorting the axes sometimes would not be available because it doesn't make sense and concat could default to sort=False
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@brycepg I think we can do both (it might complicate the code a bit, but not too much I think, as in _get_combined_index
those cases are already handled separately). As @TomAugspurger suggests, the default can be None for now, so we can raise a warning in the appropriate cases:
- when all axes are equal (current case when there is already no sorting): no warning should be raised, as the future default of
sort=False
will not change anything, but add the ability to also sort the index withsort=True
- when not all axes are equal (current case that the unwanted sorting happens): issue a warning that in the future this will no longer sort
doc/source/whatsnew/v0.23.0.txt
Outdated
@@ -1160,6 +1160,7 @@ Reshaping | |||
- Bug in :meth:`DataFrame.astype` where column metadata is lost when converting to categorical or a dictionary of dtypes (:issue:`19920`) | |||
- Bug in :func:`cut` and :func:`qcut` where timezone information was dropped (:issue:`19872`) | |||
- Bug in :class:`Series` constructor with a ``dtype=str``, previously raised in some cases (:issue:`19853`) | |||
- Stop :func:`concat` and ``Dataframe.append`` from sorting columns by default. Use ``sort=True`` to retain old behavior (:issue:`4588`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually this needs a sub-section. this is a rather large change (even if its None by default). highliting it is best. pls show an example of previous and new
pandas/core/frame.py
Outdated
@@ -5982,7 +5982,8 @@ def infer(x): | |||
# ---------------------------------------------------------------------- | |||
# Merging / joining methods | |||
|
|||
def append(self, other, ignore_index=False, verify_integrity=False): | |||
def append(self, other, ignore_index=False, | |||
verify_integrity=False, sort=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sort before verify_integrity
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback why do you want sort before verify_integrity?
pandas/core/frame.py
Outdated
@@ -5995,6 +5996,8 @@ def append(self, other, ignore_index=False, verify_integrity=False): | |||
If True, do not use the index labels. | |||
verify_integrity : boolean, default False | |||
If True, raise ValueError on creating index with duplicates. | |||
sort: boolean, default False | |||
Sort columns if given object doesn't have the same columns |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
needs a versionadded.
use does not
@@ -20,7 +20,7 @@ | |||
|
|||
def concat(objs, axis=0, join='outer', join_axes=None, ignore_index=False, | |||
keys=None, levels=None, names=None, verify_integrity=False, | |||
copy=True): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually move before verify_integrity
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pls do this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why make this an API breaking change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because it more logical.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How so?
I'm OK with breaking API when necessary, but this seems unnecessary.
pandas/core/reshape/concat.py
Outdated
@@ -60,6 +60,8 @@ def concat(objs, axis=0, join='outer', join_axes=None, ignore_index=False, | |||
verify_integrity : boolean, default False | |||
Check whether the new concatenated axis contains duplicates. This can | |||
be very expensive relative to the actual data concatenation | |||
sort : boolean, default False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
simiar to above
pandas/tests/reshape/test_concat.py
Outdated
dfa = pd.DataFrame(columns=['C', 'A'], data=[[1, 2]]) | ||
dfb = pd.DataFrame(columns=['C', 'Z'], data=[[5, 6]]) | ||
result = pd.concat([dfa, dfb]) | ||
assert result.columns.tolist() == ['C', 'A', 'Z'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
create an expected frame and use assert_frame_equal
pandas/tests/reshape/test_concat.py
Outdated
df['a'] = [1, 2, 3] | ||
df2 = pd.DataFrame({'a': [4, 5]}) | ||
df3 = pd.concat([df, df2]) | ||
assert df3.columns.tolist() == ['b', 'c', 'a'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
pandas/tests/reshape/test_concat.py
Outdated
df['c'] = [1, 2, 3] | ||
df['a'] = [1, 2, 3] | ||
df2 = pd.DataFrame({'a': [4, 5]}) | ||
df3 = pd.concat([df, df2]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use result =
@brycepg can you update based on Jeff's comments? Doing a release candidate soon (tomorrow or Wednesday hopefully), and it'd be nice to have this in. |
Sure I’ll try to do it tonight
…On Mon, Apr 23, 2018 at 1:15 PM, Tom Augspurger ***@***.***> wrote:
***@***.***(https://github.com/brycepg) can you update based on Jeff's comments? Doing a release candidate soon (tomorrow or Wednesday hopefully), and it'd be nice to have this in.
—
You are receiving this because you were mentioned.
Reply to this email directly, [view it on GitHub](#20613 (comment)), or [mute the thread](https://github.com/notifications/unsubscribe-auth/AAa9yC_mpbFjUbWvFWVaeM4sWVQ3pxffks5trig-gaJpZM4THzks).
{"api_version":"1.0","publisher":{"api_key":"05dde50f1d1a384dd78767c55493e4bb","name":"GitHub"},"entity":{"external_key":"github/pandas-dev/pandas","title":"pandas-dev/pandas","subtitle":"GitHub repository","main_image_url":"https://cloud.githubusercontent.com/assets/143418/17495839/a5054eac-5d88-11e6-95fc-7290892c7bb5.png","avatar_image_url":"https://cloud.githubusercontent.com/assets/143418/15842166/7c72db34-2c0b-11e6-9aed-b52498112777.png","action":{"name":"Open in ***@***.*** in #20613: @brycepg can you update based on Jeff's comments? Doing a release candidate soon (tomorrow or Wednesday hopefully), and it'd be nice to have this in."}],"action":{"name":"View Pull Request","url":"#20613 (comment)"}}}
|
@brycepg I pushed some changes to your branch.
Will you have time today to get go through and update the tests to pass And I didn't address most of @jreback's comments yet. I'd like to do a release candidate tomorrow, so if you can't get to it today let me know and I'll push more fixes here. |
Picking this up. Would be good to have for 0.23. |
Can someone proofread the warning text? Specifically, does the term "non-concatenation axis" make sense? I could also call it the "non-expanding axis", or something else entirely. |
We'll need to address this. In [2]: df1 = pd.DataFrame({"a": [1, 2], "b": [1, 2], "c": [1, 2]}, columns=['b', 'a', 'c'])
In [3]: df2 = pd.DataFrame({"a": [1, 2], 'c': [3, 4]}, index=[3, 4])
In [4]: pd.concat([df1, df2], join='inner')
Out[4]:
a c
0 1 1
1 2 2
3 1 3
4 2 4
In [5]: pd.concat([df1, df2], join='inner', sort=False)
Out[5]:
a c
0 1 1
1 2 2
3 1 3
4 2 4 I assume we want the same behavior as for |
Found one more issue in crosstab. Fixing now. |
@@ -507,7 +507,7 @@ def is_any_frame(): | |||
for r in compat.itervalues(result)) | |||
|
|||
if isinstance(result, list): | |||
return concat(result, keys=keys, axis=1), True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thre are a bunch more concats in this same section. basically they control the resulting ordeing of the aggregation. I guess sort=True
is fine here (for the other cases).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or maybe these should be sort=False
, not sure what is actually affected (as these might be aligned ops already)
@@ -1098,7 +1098,8 @@ def reset_identity(values): | |||
group_names = self.grouper.names | |||
|
|||
result = concat(values, axis=self.axis, keys=group_keys, | |||
levels=group_levels, names=group_names) | |||
levels=group_levels, names=group_names, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here, there are like 10 calls to concat. I think should be explicit about sort
@@ -89,13 +110,19 @@ def conv(i): | |||
index = indexes[0] | |||
for other in indexes[1:]: | |||
if not index.equals(other): | |||
|
|||
if sort is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would move _unique_indices
from a nested function to module level (e.g. same as _union_indices
(and maybe conform the spelling indices / indexes), and simply add a sort=
kwarg (which you are already passing into fast_unique_multiple_lists). Then you can do the warning there. just makes this whole function a bit simpler.
index = _get_objs_combined_axis(data.values(), axis=axis, | ||
intersect=intersect) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so this always warns? shouldn't this be sort=False
?
K, I'll try to go through all the internal calls to concat and set explicitly. One thing that's made this difficult is that we don't really have the old option of sort only if not aligned anymore. So just setting |
I'm going to push this to 0.23.1. I'm not confident enough that we have complete test coverage for all the cases of concat used internally. I can pick it up after PyCon, or you're welcome to work on it whenever @brycepg. |
If we don't include it in 0.23, I think it has to wait for 0.24. We shouldn't introduce new deprecation warnings in a bug fix release. However, I don't fully understand the problem to not merge it now. There are still some internal use cases that you need to check? (whether |
Yeah. The concern is about internal calls to concat.
For example, `Panel({b: df, a: df})` was just broken on this branch for a
while with certain conditions. None of the constructor tests caught that,
it was just in a pytables test that it was caught. Though perhaps panels
isn't the best example, not sure how coverage is on that in general.
…On Tue, May 1, 2018 at 2:24 PM, Joris Van den Bossche < ***@***.***> wrote:
If we don't include it in 0.23, I think it has to wait for 0.24. We
shouldn't introduce new deprecation warnings in a bug fix release.
However, I don't fully understand the problem to not merge it now. There
are still some internal use cases that you need to check? (whether
sort=True/False needs to be added?)
But as far as I see, there are no failing tests? Are there still warnings
being raised?
I think in many cases in internal code, it's concatting aligned results
(eg in groupby), and then there should be no change or warning.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#20613 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABQHIuROG_dJhxtzOuPjLcHphSagFb0Pks5tuLZSgaJpZM4THzks>
.
|
In what way was it broken? (there should for now mainly be a warning and new keyword, but not really change in default behaviour?)
I can't really say anything founded, as I didn't look enough into detail to what tests have broken throughout implementing this and what you needed to add, but since our tests are now passing, my gut feeling would say: let's use the rc period to get more real-world testing on that .. ;) |
I don't recall the exact circumstances, but a MultiIndex was (or wasn't?) being sorting. If you're OK with merging as is, then I can commit to adding more tests (and fixing bugs) between now and Friday. I just don't think it should hold up the release. |
As I said, difficult to say myself, but I am OK with relying on your judgement here. |
Alright, let's do it. #20909 for the followup. |
Preserve column order upon concatenation to obey
least astonishment principle.
Allow old behavior to be enabled by adding a boolean switch to
concat and DataFrame.append, mismatch_sort, which is by default disabled.
git diff upstream/master -u -- "*.py" | flake8 --diff