Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: df.agg, df.transform and df.apply use different methods when axis=1 than when axis=0 #21224

Merged
merged 7 commits into from
Jul 28, 2018

Conversation

topper-123
Copy link
Contributor

@topper-123 topper-123 commented May 27, 2018

This is a splitoff from #21123, to only fix #16679. #19629 will be fixed in a separate PR afterwards.

Passing functions to df.agg, df.transform and df.apply may use different methods when axis=1, than when,axis=0, and give different results when NaNs are supplied.

Explanation

Passing the functions in SelectionMixin._cython_table to df.agg should defer to use the relevant cython functions. This currently works as expected when axis=0, but not when axis=1.

The reason for this difference is that df.aggregate currently defers to df._aggregate when axis=0, but defers to df.apply, when axis=1, and these may give different result when passed functions and the series/frame contains Nan values. I've solved this by transposing df in DataFrame._aggragate when axis=1, and passing the possibly transposed on to the super method.

Also, df.apply delegates back to df.agg, when given lists or dicts as inputs, but only works when axis=0. This PR fixes this, so axis=1 works the as axis=0.

The tests have been heavily parametrized, helping ensure that various ways to call the methods now give correct results for both axes.

@WillAyd @jreback (reviewers of #21123)

@codecov
Copy link

codecov bot commented May 27, 2018

Codecov Report

Merging #21224 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #21224      +/-   ##
==========================================
+ Coverage   92.05%   92.05%   +<.01%     
==========================================
  Files         170      170              
  Lines       50708    50716       +8     
==========================================
+ Hits        46677    46685       +8     
  Misses       4031     4031
Flag Coverage Δ
#multiple 90.46% <100%> (ø) ⬆️
#single 42.35% <21.73%> (-0.01%) ⬇️
Impacted Files Coverage Δ
pandas/core/apply.py 96.75% <100%> (-0.03%) ⬇️
pandas/core/generic.py 96.47% <100%> (-0.01%) ⬇️
pandas/core/frame.py 97.21% <100%> (+0.01%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7a2fbce...39ced29. Read the comment docs.

if result is None:
return self.apply(func, axis=axis, args=args, **kwargs)
return result

@Appender(NDFrame._aggregate.__doc__, indents=2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you need this doc string here

np_func, str_func = cython_table_items
expected = expected_dict[str_func]

if isinstance(expected, type) and issubclass(expected, Exception):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really do not like mixing exceptions with good cases in a single test. can you split.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree in principle, but think will make the code less readable, as I explained in #21123 (comment).

I've made an version where the different types of expected values are split into separate test functions, but IMO it's not an improvement in legibility...Can this not be kept as an exception?

return

result = frame.agg(np_func, axis=axis)
result_str_func = frame.agg(str_func, axis=axis)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rather than doing an if like this, can you create another fixture (or 2), that splits the functions into 2 groups: aggregators and transformers


if isinstance(expected, type) and issubclass(expected, Exception):
with pytest.raises(expected):
# e.g. Series('a b'.split()).cumprod() will raise
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comments as above

@jreback jreback added Reshaping Concat, Merge/Join, Stack/Unstack, Explode Numeric Operations Arithmetic, Comparison, and Logical operations labels May 29, 2018
@topper-123 topper-123 force-pushed the axis_1_agg_funcs branch 6 times, most recently from 689ce83 to 9c13256 Compare June 9, 2018 21:51
@@ -119,7 +119,7 @@ Offsets
Numeric
^^^^^^^

-
- :meth:`~DataFrame.agg` now handles built-in methods like ``sum`` in the same manner when axis=1 as when axis=0 (:issue:`21224`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just say.....sum with axis=1. (no need for the rest of the sentence)

@@ -170,3 +170,11 @@ def string_dtype(request):
* 'U'
"""
return request.param


@pytest.fixture(params=[0, 1], ids=lambda x: "axis {}".format(x))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually could also add 'index', 'columns' here as well, may need to call this axis_frame

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

may as well add axis_series

Copy link
Contributor Author

@topper-123 topper-123 Jun 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking that this fixture could replace all instances where there's a parametrize for the axes. There are quite a few in the tests. Many of those don't support "index" and "columns" and it would give some failures (29 failures when I ran it with "index" and "columns").

I could add a axis_all fixture where params would be [0, 1, "index", "columns"]? Then people could differentiate between which one they need.

@@ -1795,7 +1795,7 @@ def error():
error()
raise
except:
error()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why did this need changin?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The old way hides the cause of the exception.

For example if the exception is in ax.contains you won't see where the actual error was in the traceback, as Python thinks the exeption is in error, while in reality is is somet´where else, hence the need to reraise.

Thought on closer inspection, just removing the two lines except: error() would be even clearer than reraising.

List of three items (DataFrame, function, expected result)
"""
table = pd.core.base.SelectionMixin._cython_table
if compat.PY36:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you make this first part a fixture in conftest

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then not sure you really need this function, can you not compute this directly give the name of the function and table?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you make this first part a fixture in conftest

Not possible; this function is not a test function, so does not take fixtures

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then not sure you really need this function, can you not compute this directly give the name of the function and table?

I'd want the the functions to take the lefthand side functions in _cython_table as input, so each one is tested individually.

An alternative would be to pass the strings into the test functions (e.g. "sum"), and then get the relevant functions inside the test functions. That would mean that each test would run several subtests, which is a different kind of issue/problem.

So I think you have to choose between different kinds of ugliness here. I'd personally prefer my original solution. I'f someone has an idea for improvement, I'm wiling to give it another run, though

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can easily BE a fixture, IOW you put this function in conftest.py and call it to create the params FOR a fixture.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've almost dried my brain looking into how to do this, I just can't see it... The function _get_cython_table_params can't take fixtures, as it's not a test function. Do you mean simply import it from conttest.py? (from pandas.conftest import cython_table or similar) I assume that's not what you mean, tough.

Could you maybe spell it out how you see the implementation for me? I would be very grateful for that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a fixture for cython table items. Can't use fixtures in _get_cython_table_params, so do an import from conttest.py.

The build in CirceCi is a resourceerror, so unrelated to this PR.

@pep8speaks
Copy link

pep8speaks commented Jun 23, 2018

Hello @topper-123! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on July 26, 2018 at 21:03 Hours UTC

@jreback jreback mentioned this pull request Jul 7, 2018
3 tasks
@jreback
Copy link
Contributor

jreback commented Jul 7, 2018

can you rebase

@topper-123
Copy link
Contributor Author

Rebased and green.

@topper-123
Copy link
Contributor Author

Hold a bit pulling this in. I think I've found an issue that needs fixing.

@topper-123
Copy link
Contributor Author

topper-123 commented Jul 16, 2018

I discovered a bug in my df.aggregate implentation. Solving that unearthed several other bugs, as:

  • df.transform relies on df.aggregate
  • Fixing the bug for df.transform required tests for df.transform(func, axis=1), which unearthed a bug in df.apply(func, axis=1), which also needed fixing

Very much fun!

The most basic bug is that df.apply([func], axis=1) doesn't work in master. This fixes that, and subsequent bugs in df.transform and df.agg.

>>> from pandas.tests.frame.test_apply import TestDataFrameAggregate
>>> x = TestDataFrameAggregate()
>>> x.frame.apply([np.sqrt], axis=0)  # ok
                   A         B         C         D
                sqrt      sqrt      sqrt      sqrt
ukh1GVv7xX  1.503982       NaN  0.491756  0.787992
wIHkS3BsKA  1.111148       NaN  1.517181       NaN
YWAqLcsqUP       NaN  0.771865  0.760905  0.883760
QhHqzICDVo  0.448304  1.278314  0.517722  0.948910
vEqhEx4vkR       NaN       NaN       NaN  0.913545
4Q1LUyQq9C  0.998117  0.849454       NaN       NaN
XvrwqrxIIK       NaN  0.506467  0.366948  1.183702
xLAqsZsf4n  0.864088       NaN  0.825798  0.865628
Bx8AXfOzTv  1.025291       NaN  0.753349  0.108052
9LoWz0qpSu  0.724288       NaN  1.708485  0.107700
HZI74uGF4k       NaN  0.875412       NaN       NaN
t6Ds6vcRKU       NaN       NaN  0.651367  1.436792
SHSaccg7Wz  0.437198       NaN       NaN       NaN
iZe5ctx7w1  0.127975  1.119973  0.395028       NaN
3aoqjBGybQ       NaN       NaN       NaN  0.772779
saCF7tPAnv  0.719474       NaN  1.045202       NaN
2uo0g5oocb  0.132254  0.563252       NaN       NaN
3duZHu6SDk  0.628383       NaN  0.546263       NaN
p4ug60WPOR       NaN  0.641512  0.406602  0.683798
NbKillWJPL       NaN       NaN  0.699454       NaN
Yc7hoe5odY  0.709012       NaN       NaN       NaN
ZdtVhBzFQZ  0.516489       NaN       NaN  0.351688
SwaSfXJLVm       NaN  0.780858  1.394233  1.076297
LFYhRyDnhZ  0.564244  0.925111       NaN  0.359451
JvfkNGiVkv  1.290368       NaN  0.416229       NaN
iyXPFeeDM2  1.062631  0.532897       NaN       NaN
sy6MqqOmfN  1.443564  0.918141  0.963493       NaN
vkIoXhAka6       NaN       NaN       NaN  0.322189
6MDIrdaygD       NaN       NaN  0.610091       NaN
dhZgEA2cGP  1.505071       NaN       NaN       NaN
>>> x.frame.apply([np.sqrt], axis=1)
TypeError: ("'list' object is not callable", 'occurred at index J5bdGdv0g8')  # master
# this PR below
                        A         B         C         D
ukh1GVv7xX sqrt  1.503982       NaN  0.491756  0.787992
wIHkS3BsKA sqrt  1.111148       NaN  1.517181       NaN
YWAqLcsqUP sqrt       NaN  0.771865  0.760905  0.883760
QhHqzICDVo sqrt  0.448304  1.278314  0.517722  0.948910
vEqhEx4vkR sqrt       NaN       NaN       NaN  0.913545
4Q1LUyQq9C sqrt  0.998117  0.849454       NaN       NaN
XvrwqrxIIK sqrt       NaN  0.506467  0.366948  1.183702
xLAqsZsf4n sqrt  0.864088       NaN  0.825798  0.865628
Bx8AXfOzTv sqrt  1.025291       NaN  0.753349  0.108052
9LoWz0qpSu sqrt  0.724288       NaN  1.708485  0.107700
HZI74uGF4k sqrt       NaN  0.875412       NaN       NaN
t6Ds6vcRKU sqrt       NaN       NaN  0.651367  1.436792
SHSaccg7Wz sqrt  0.437198       NaN       NaN       NaN
iZe5ctx7w1 sqrt  0.127975  1.119973  0.395028       NaN
3aoqjBGybQ sqrt       NaN       NaN       NaN  0.772779
saCF7tPAnv sqrt  0.719474       NaN  1.045202       NaN
2uo0g5oocb sqrt  0.132254  0.563252       NaN       NaN
3duZHu6SDk sqrt  0.628383       NaN  0.546263       NaN
p4ug60WPOR sqrt       NaN  0.641512  0.406602  0.683798
NbKillWJPL sqrt       NaN       NaN  0.699454       NaN
Yc7hoe5odY sqrt  0.709012       NaN       NaN       NaN
ZdtVhBzFQZ sqrt  0.516489       NaN       NaN  0.351688
SwaSfXJLVm sqrt       NaN  0.780858  1.394233  1.076297
LFYhRyDnhZ sqrt  0.564244  0.925111       NaN  0.359451
JvfkNGiVkv sqrt  1.290368       NaN  0.416229       NaN
iyXPFeeDM2 sqrt  1.062631  0.532897       NaN       NaN
sy6MqqOmfN sqrt  1.443564  0.918141  0.963493       NaN
vkIoXhAka6 sqrt       NaN       NaN       NaN  0.322189
6MDIrdaygD sqrt       NaN       NaN  0.610091       NaN
dhZgEA2cGP sqrt  1.505071       NaN       NaN       NaN

Similar problem where in master when calling df.transform([func], axis=1) and df.agg([func], axis=1).

Likewise with many functions, now df.agg(['mean', 'sum'], axis=1) is possible, while previously only df.agg(['mean', 'sum'], axis=0) was possible.

# dispatch to agg
if isinstance(self.f, (list, dict)):
return self.obj.aggregate(self.f, axis=self.axis,
*self.args, **self.kwds)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if isinstance(self.f, (list, dict)) should also be called when axis=1, so moved up to FrameApply.get_results.


return result

cls.transform = transform
Copy link
Contributor Author

@topper-123 topper-123 Jul 16, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when transform was added by calling _add_series_or_dataframe_operations that method shadowed a transform method on DataFrame. As transform doesnt need to be added in any special way, I just moved it to be a normal instance method.

f_abs = np.abs(self.frame)
f_sqrt = np.sqrt(self.frame)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

having "absolute" come before "sqrt" maintains alphabetical ordering, and makes creating multindexes easier below.

@topper-123 topper-123 force-pushed the axis_1_agg_funcs branch 5 times, most recently from 4b5b2c3 to 18bdf54 Compare July 16, 2018 16:37
@topper-123 topper-123 force-pushed the axis_1_agg_funcs branch 3 times, most recently from 1f50505 to 06f1df7 Compare July 25, 2018 20:57
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small change, otherwise lgtm. ping on green.

@@ -2826,3 +2826,28 @@ def skipna_wrapper(x):
return alternative(nona)

return skipna_wrapper


def _get_cython_table_params(ndframe, func_names_and_expected):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm didn't realize you are actually importing from conftest, so on 2nd though, move this to pandas.conftest and import from there (you have to explicity import non-fixtures FYI)

@jreback jreback added this to the 0.24.0 milestone Jul 26, 2018
@topper-123
Copy link
Contributor Author

green

@jreback jreback merged commit 848b69c into pandas-dev:master Jul 28, 2018
@jreback
Copy link
Contributor

jreback commented Jul 28, 2018

thanks @topper-123 nice patch!

minggli added a commit to minggli/pandas that referenced this pull request Jul 28, 2018
* master:
  BENCH: asv csv reading benchmarks no longer read StringIO objects off the end (pandas-dev#21807)
  BUG: df.agg, df.transform and df.apply use different methods when axis=1 than when axis=0 (pandas-dev#21224)
  BUG: bug in GroupBy.count where arg minlength passed to np.bincount must be None for np<1.13 (pandas-dev#21957)
  CLN: Vbench to asv conversion script (pandas-dev#22089)
  consistent docstring (pandas-dev#22066)
  TST: skip pytables test with not-updated pytables conda package (pandas-dev#22099)
  CLN: Remove Legacy MultiIndex Index Compatibility (pandas-dev#21740)
  DOC: Reword doc for filepath_or_buffer in read_csv (pandas-dev#22058)
  BUG: rolling with MSVC 2017 build (pandas-dev#21813)
topper-123 added a commit to topper-123/pandas that referenced this pull request Jul 29, 2018
topper-123 added a commit to topper-123/pandas that referenced this pull request Jul 29, 2018
jreback pushed a commit that referenced this pull request Jul 29, 2018
@topper-123 topper-123 deleted the axis_1_agg_funcs branch September 16, 2018 23:20
Sup3rGeo pushed a commit to Sup3rGeo/pandas that referenced this pull request Oct 1, 2018
Sup3rGeo pushed a commit to Sup3rGeo/pandas that referenced this pull request Oct 1, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Numeric Operations Arithmetic, Comparison, and Logical operations Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

agg method with list of functions does not work with axis=1
3 participants