-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: df.agg, df.transform and df.apply use different methods when axis=1 than when axis=0 #21224
Conversation
9c26e30
to
b87fc41
Compare
Codecov Report
@@ Coverage Diff @@
## master #21224 +/- ##
==========================================
+ Coverage 92.05% 92.05% +<.01%
==========================================
Files 170 170
Lines 50708 50716 +8
==========================================
+ Hits 46677 46685 +8
Misses 4031 4031
Continue to review full report at Codecov.
|
pandas/core/frame.py
Outdated
if result is None: | ||
return self.apply(func, axis=axis, args=args, **kwargs) | ||
return result | ||
|
||
@Appender(NDFrame._aggregate.__doc__, indents=2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think you need this doc string here
pandas/tests/frame/test_apply.py
Outdated
np_func, str_func = cython_table_items | ||
expected = expected_dict[str_func] | ||
|
||
if isinstance(expected, type) and issubclass(expected, Exception): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really do not like mixing exceptions with good cases in a single test. can you split.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree in principle, but think will make the code less readable, as I explained in #21123 (comment).
I've made an version where the different types of expected values are split into separate test functions, but IMO it's not an improvement in legibility...Can this not be kept as an exception?
pandas/tests/frame/test_apply.py
Outdated
return | ||
|
||
result = frame.agg(np_func, axis=axis) | ||
result_str_func = frame.agg(str_func, axis=axis) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rather than doing an if like this, can you create another fixture (or 2), that splits the functions into 2 groups: aggregators and transformers
pandas/tests/series/test_apply.py
Outdated
|
||
if isinstance(expected, type) and issubclass(expected, Exception): | ||
with pytest.raises(expected): | ||
# e.g. Series('a b'.split()).cumprod() will raise |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comments as above
689ce83
to
9c13256
Compare
doc/source/whatsnew/v0.24.0.txt
Outdated
@@ -119,7 +119,7 @@ Offsets | |||
Numeric | |||
^^^^^^^ | |||
|
|||
- | |||
- :meth:`~DataFrame.agg` now handles built-in methods like ``sum`` in the same manner when axis=1 as when axis=0 (:issue:`21224`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just say.....sum
with axis=1
. (no need for the rest of the sentence)
pandas/conftest.py
Outdated
@@ -170,3 +170,11 @@ def string_dtype(request): | |||
* 'U' | |||
""" | |||
return request.param | |||
|
|||
|
|||
@pytest.fixture(params=[0, 1], ids=lambda x: "axis {}".format(x)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually could also add 'index', 'columns'
here as well, may need to call this axis_frame
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
may as well add axis_series
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm thinking that this fixture could replace all instances where there's a parametrize
for the axes. There are quite a few in the tests. Many of those don't support "index" and "columns" and it would give some failures (29 failures when I ran it with "index" and "columns").
I could add a axis_all
fixture where params would be [0, 1, "index", "columns"]? Then people could differentiate between which one they need.
pandas/core/indexing.py
Outdated
@@ -1795,7 +1795,7 @@ def error(): | |||
error() | |||
raise | |||
except: | |||
error() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why did this need changin?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The old way hides the cause of the exception.
For example if the exception is in ax.contains
you won't see where the actual error was in the traceback, as Python thinks the exeption is in error
, while in reality is is somet´where else, hence the need to reraise.
Thought on closer inspection, just removing the two lines except: error()
would be even clearer than reraising.
pandas/tests/frame/test_apply.py
Outdated
List of three items (DataFrame, function, expected result) | ||
""" | ||
table = pd.core.base.SelectionMixin._cython_table | ||
if compat.PY36: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you make this first part a fixture in conftest
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
then not sure you really need this function, can you not compute this directly give the name of the function and table?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you make this first part a fixture in conftest
Not possible; this function is not a test function, so does not take fixtures
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
then not sure you really need this function, can you not compute this directly give the name of the function and table?
I'd want the the functions to take the lefthand side functions in _cython_table
as input, so each one is tested individually.
An alternative would be to pass the strings into the test functions (e.g. "sum"), and then get the relevant functions inside the test functions. That would mean that each test would run several subtests, which is a different kind of issue/problem.
So I think you have to choose between different kinds of ugliness here. I'd personally prefer my original solution. I'f someone has an idea for improvement, I'm wiling to give it another run, though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this can easily BE a fixture, IOW you put this function in conftest.py and call it to create the params FOR a fixture.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've almost dried my brain looking into how to do this, I just can't see it... The function _get_cython_table_params
can't take fixtures, as it's not a test function. Do you mean simply import it from conttest.py? (from pandas.conftest import cython_table
or similar) I assume that's not what you mean, tough.
Could you maybe spell it out how you see the implementation for me? I would be very grateful for that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a fixture for cython table items. Can't use fixtures in _get_cython_table_params
, so do an import from conttest.py.
The build in CirceCi is a resourceerror, so unrelated to this PR.
9c13256
to
962fcb1
Compare
Hello @topper-123! Thanks for updating the PR. Cheers ! There are no PEP8 issues in this Pull Request. 🍻 Comment last updated on July 26, 2018 at 21:03 Hours UTC |
962fcb1
to
48e1a63
Compare
can you rebase |
48e1a63
to
17dc5b9
Compare
Rebased and green. |
Hold a bit pulling this in. I think I've found an issue that needs fixing. |
I discovered a bug in my
Very much fun! The most basic bug is that >>> from pandas.tests.frame.test_apply import TestDataFrameAggregate
>>> x = TestDataFrameAggregate()
>>> x.frame.apply([np.sqrt], axis=0) # ok
A B C D
sqrt sqrt sqrt sqrt
ukh1GVv7xX 1.503982 NaN 0.491756 0.787992
wIHkS3BsKA 1.111148 NaN 1.517181 NaN
YWAqLcsqUP NaN 0.771865 0.760905 0.883760
QhHqzICDVo 0.448304 1.278314 0.517722 0.948910
vEqhEx4vkR NaN NaN NaN 0.913545
4Q1LUyQq9C 0.998117 0.849454 NaN NaN
XvrwqrxIIK NaN 0.506467 0.366948 1.183702
xLAqsZsf4n 0.864088 NaN 0.825798 0.865628
Bx8AXfOzTv 1.025291 NaN 0.753349 0.108052
9LoWz0qpSu 0.724288 NaN 1.708485 0.107700
HZI74uGF4k NaN 0.875412 NaN NaN
t6Ds6vcRKU NaN NaN 0.651367 1.436792
SHSaccg7Wz 0.437198 NaN NaN NaN
iZe5ctx7w1 0.127975 1.119973 0.395028 NaN
3aoqjBGybQ NaN NaN NaN 0.772779
saCF7tPAnv 0.719474 NaN 1.045202 NaN
2uo0g5oocb 0.132254 0.563252 NaN NaN
3duZHu6SDk 0.628383 NaN 0.546263 NaN
p4ug60WPOR NaN 0.641512 0.406602 0.683798
NbKillWJPL NaN NaN 0.699454 NaN
Yc7hoe5odY 0.709012 NaN NaN NaN
ZdtVhBzFQZ 0.516489 NaN NaN 0.351688
SwaSfXJLVm NaN 0.780858 1.394233 1.076297
LFYhRyDnhZ 0.564244 0.925111 NaN 0.359451
JvfkNGiVkv 1.290368 NaN 0.416229 NaN
iyXPFeeDM2 1.062631 0.532897 NaN NaN
sy6MqqOmfN 1.443564 0.918141 0.963493 NaN
vkIoXhAka6 NaN NaN NaN 0.322189
6MDIrdaygD NaN NaN 0.610091 NaN
dhZgEA2cGP 1.505071 NaN NaN NaN
>>> x.frame.apply([np.sqrt], axis=1)
TypeError: ("'list' object is not callable", 'occurred at index J5bdGdv0g8') # master
# this PR below
A B C D
ukh1GVv7xX sqrt 1.503982 NaN 0.491756 0.787992
wIHkS3BsKA sqrt 1.111148 NaN 1.517181 NaN
YWAqLcsqUP sqrt NaN 0.771865 0.760905 0.883760
QhHqzICDVo sqrt 0.448304 1.278314 0.517722 0.948910
vEqhEx4vkR sqrt NaN NaN NaN 0.913545
4Q1LUyQq9C sqrt 0.998117 0.849454 NaN NaN
XvrwqrxIIK sqrt NaN 0.506467 0.366948 1.183702
xLAqsZsf4n sqrt 0.864088 NaN 0.825798 0.865628
Bx8AXfOzTv sqrt 1.025291 NaN 0.753349 0.108052
9LoWz0qpSu sqrt 0.724288 NaN 1.708485 0.107700
HZI74uGF4k sqrt NaN 0.875412 NaN NaN
t6Ds6vcRKU sqrt NaN NaN 0.651367 1.436792
SHSaccg7Wz sqrt 0.437198 NaN NaN NaN
iZe5ctx7w1 sqrt 0.127975 1.119973 0.395028 NaN
3aoqjBGybQ sqrt NaN NaN NaN 0.772779
saCF7tPAnv sqrt 0.719474 NaN 1.045202 NaN
2uo0g5oocb sqrt 0.132254 0.563252 NaN NaN
3duZHu6SDk sqrt 0.628383 NaN 0.546263 NaN
p4ug60WPOR sqrt NaN 0.641512 0.406602 0.683798
NbKillWJPL sqrt NaN NaN 0.699454 NaN
Yc7hoe5odY sqrt 0.709012 NaN NaN NaN
ZdtVhBzFQZ sqrt 0.516489 NaN NaN 0.351688
SwaSfXJLVm sqrt NaN 0.780858 1.394233 1.076297
LFYhRyDnhZ sqrt 0.564244 0.925111 NaN 0.359451
JvfkNGiVkv sqrt 1.290368 NaN 0.416229 NaN
iyXPFeeDM2 sqrt 1.062631 0.532897 NaN NaN
sy6MqqOmfN sqrt 1.443564 0.918141 0.963493 NaN
vkIoXhAka6 sqrt NaN NaN NaN 0.322189
6MDIrdaygD sqrt NaN NaN 0.610091 NaN
dhZgEA2cGP sqrt 1.505071 NaN NaN NaN Similar problem where in master when calling Likewise with many functions, now |
# dispatch to agg | ||
if isinstance(self.f, (list, dict)): | ||
return self.obj.aggregate(self.f, axis=self.axis, | ||
*self.args, **self.kwds) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if isinstance(self.f, (list, dict))
should also be called when axis=1, so moved up to FrameApply.get_results
.
|
||
return result | ||
|
||
cls.transform = transform |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when transform was added by calling _add_series_or_dataframe_operations
that method shadowed a transform
method on DataFrame
. As transform doesnt need to be added in any special way, I just moved it to be a normal instance method.
f_abs = np.abs(self.frame) | ||
f_sqrt = np.sqrt(self.frame) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
having "absolute" come before "sqrt" maintains alphabetical ordering, and makes creating multindexes easier below.
4b5b2c3
to
18bdf54
Compare
1f50505
to
06f1df7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
small change, otherwise lgtm. ping on green.
pandas/util/testing.py
Outdated
@@ -2826,3 +2826,28 @@ def skipna_wrapper(x): | |||
return alternative(nona) | |||
|
|||
return skipna_wrapper | |||
|
|||
|
|||
def _get_cython_table_params(ndframe, func_names_and_expected): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm didn't realize you are actually importing from conftest, so on 2nd though, move this to pandas.conftest
and import from there (you have to explicity import non-fixtures FYI)
green |
06f1df7
to
caaa912
Compare
caaa912
to
39ced29
Compare
thanks @topper-123 nice patch! |
* master: BENCH: asv csv reading benchmarks no longer read StringIO objects off the end (pandas-dev#21807) BUG: df.agg, df.transform and df.apply use different methods when axis=1 than when axis=0 (pandas-dev#21224) BUG: bug in GroupBy.count where arg minlength passed to np.bincount must be None for np<1.13 (pandas-dev#21957) CLN: Vbench to asv conversion script (pandas-dev#22089) consistent docstring (pandas-dev#22066) TST: skip pytables test with not-updated pytables conda package (pandas-dev#22099) CLN: Remove Legacy MultiIndex Index Compatibility (pandas-dev#21740) DOC: Reword doc for filepath_or_buffer in read_csv (pandas-dev#22058) BUG: rolling with MSVC 2017 build (pandas-dev#21813)
…s=1 than when axis=0 (pandas-dev#21224)
git diff upstream/master -u -- "*.py" | flake8 --diff
This is a splitoff from #21123, to only fix #16679. #19629 will be fixed in a separate PR afterwards.
Passing functions to
df.agg
,df.transform
anddf.apply
may use different methods whenaxis=1
, than when,axis=0
, and give different results when NaNs are supplied.Explanation
Passing the functions in
SelectionMixin._cython_table
todf.agg
should defer to use the relevant cython functions. This currently works as expected whenaxis=0
, but not whenaxis=1
.The reason for this difference is that
df.aggregate
currently defers todf._aggregate
whenaxis=0
, but defers todf.apply
, whenaxis=1
, and these may give different result when passed functions and the series/frame contains Nan values. I've solved this by transposing df inDataFrame._aggragate
whenaxis=1
, and passing the possibly transposed on to the super method.Also,
df.apply
delegates back todf.agg
, when given lists or dicts as inputs, but only works when axis=0. This PR fixes this, so axis=1 works the as axis=0.The tests have been heavily parametrized, helping ensure that various ways to call the methods now give correct results for both axes.
@WillAyd @jreback (reviewers of #21123)