BUG: df.agg, df.transform and df.apply use different methods when axis=1 than when axis=0 #21224

topper-123 · 2018-05-27T21:40:04Z

closes agg method with list of functions does not work with axis=1 #16679
tests added / passed
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

This is a splitoff from #21123, to only fix #16679. #19629 will be fixed in a separate PR afterwards.

Passing functions to df.agg, df.transform and df.apply may use different methods when axis=1, than when,axis=0, and give different results when NaNs are supplied.

Explanation

Passing the functions in SelectionMixin._cython_table to df.agg should defer to use the relevant cython functions. This currently works as expected when axis=0, but not when axis=1.

The reason for this difference is that df.aggregate currently defers to df._aggregate when axis=0, but defers to df.apply, when axis=1, and these may give different result when passed functions and the series/frame contains Nan values. I've solved this by transposing df in DataFrame._aggragate when axis=1, and passing the possibly transposed on to the super method.

Also, df.apply delegates back to df.agg, when given lists or dicts as inputs, but only works when axis=0. This PR fixes this, so axis=1 works the as axis=0.

The tests have been heavily parametrized, helping ensure that various ways to call the methods now give correct results for both axes.

@WillAyd @jreback (reviewers of #21123)

codecov · 2018-05-27T22:49:57Z

Codecov Report

Merging #21224 into master will increase coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #21224      +/-   ##
==========================================
+ Coverage   92.05%   92.05%   +<.01%     
==========================================
  Files         170      170              
  Lines       50708    50716       +8     
==========================================
+ Hits        46677    46685       +8     
  Misses       4031     4031

Flag	Coverage Δ
#multiple	`90.46% <100%> (ø)`	⬆️
#single	`42.35% <21.73%> (-0.01%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/apply.py	`96.75% <100%> (-0.03%)`	⬇️
pandas/core/generic.py	`96.47% <100%> (-0.01%)`	⬇️
pandas/core/frame.py	`97.21% <100%> (+0.01%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7a2fbce...39ced29. Read the comment docs.

jreback · 2018-05-29T01:03:12Z

pandas/core/frame.py

        if result is None:
            return self.apply(func, axis=axis, args=args, **kwargs)
        return result

+    @Appender(NDFrame._aggregate.__doc__, indents=2)


I don't think you need this doc string here

jreback · 2018-05-29T01:04:11Z

pandas/tests/frame/test_apply.py

+        np_func, str_func = cython_table_items
+        expected = expected_dict[str_func]
+
+        if isinstance(expected, type) and issubclass(expected, Exception):


I really do not like mixing exceptions with good cases in a single test. can you split.

I agree in principle, but think will make the code less readable, as I explained in #21123 (comment).

I've made an version where the different types of expected values are split into separate test functions, but IMO it's not an improvement in legibility...Can this not be kept as an exception?

jreback · 2018-05-29T01:05:40Z

pandas/tests/frame/test_apply.py

+            return
+
+        result = frame.agg(np_func, axis=axis)
+        result_str_func = frame.agg(str_func, axis=axis)


rather than doing an if like this, can you create another fixture (or 2), that splits the functions into 2 groups: aggregators and transformers

jreback · 2018-05-29T01:05:54Z

pandas/tests/series/test_apply.py

+
+        if isinstance(expected, type) and issubclass(expected, Exception):
+            with pytest.raises(expected):
+                # e.g. Series('a b'.split()).cumprod() will raise


same comments as above

jreback · 2018-06-19T01:04:59Z

doc/source/whatsnew/v0.24.0.txt

@@ -119,7 +119,7 @@ Offsets
 Numeric
 ^^^^^^^

-
+- :meth:`~DataFrame.agg` now handles built-in methods like ``sum`` in the same manner when axis=1 as when axis=0 (:issue:`21224`)


just say.....sum with axis=1. (no need for the rest of the sentence)

jreback · 2018-06-19T01:05:52Z

pandas/conftest.py

@@ -170,3 +170,11 @@ def string_dtype(request):
    * 'U'
    """
    return request.param
+
+
+@pytest.fixture(params=[0, 1], ids=lambda x: "axis {}".format(x))


actually could also add 'index', 'columns' here as well, may need to call this axis_frame

may as well add axis_series

I'm thinking that this fixture could replace all instances where there's a parametrize for the axes. There are quite a few in the tests. Many of those don't support "index" and "columns" and it would give some failures (29 failures when I ran it with "index" and "columns").

I could add a axis_all fixture where params would be [0, 1, "index", "columns"]? Then people could differentiate between which one they need.

jreback · 2018-06-19T01:07:08Z

pandas/core/indexing.py

@@ -1795,7 +1795,7 @@ def error():
                    error()
                raise
            except:
-                error()


why did this need changin?

The old way hides the cause of the exception.

For example if the exception is in ax.contains you won't see where the actual error was in the traceback, as Python thinks the exeption is in error, while in reality is is somet´where else, hence the need to reraise.

Thought on closer inspection, just removing the two lines except: error() would be even clearer than reraising.

jreback · 2018-06-19T01:09:16Z

pandas/tests/frame/test_apply.py

+        List of three items (DataFrame, function, expected result)
+    """
+    table = pd.core.base.SelectionMixin._cython_table
+    if compat.PY36:


can you make this first part a fixture in conftest

then not sure you really need this function, can you not compute this directly give the name of the function and table?

can you make this first part a fixture in conftest

Not possible; this function is not a test function, so does not take fixtures

then not sure you really need this function, can you not compute this directly give the name of the function and table?

I'd want the the functions to take the lefthand side functions in _cython_table as input, so each one is tested individually.

An alternative would be to pass the strings into the test functions (e.g. "sum"), and then get the relevant functions inside the test functions. That would mean that each test would run several subtests, which is a different kind of issue/problem.

So I think you have to choose between different kinds of ugliness here. I'd personally prefer my original solution. I'f someone has an idea for improvement, I'm wiling to give it another run, though

this can easily BE a fixture, IOW you put this function in conftest.py and call it to create the params FOR a fixture.

I've almost dried my brain looking into how to do this, I just can't see it... The function _get_cython_table_params can't take fixtures, as it's not a test function. Do you mean simply import it from conttest.py? (from pandas.conftest import cython_table or similar) I assume that's not what you mean, tough.

Could you maybe spell it out how you see the implementation for me? I would be very grateful for that.

I've added a fixture for cython table items. Can't use fixtures in _get_cython_table_params, so do an import from conttest.py.

The build in CirceCi is a resourceerror, so unrelated to this PR.

pep8speaks · 2018-06-23T10:09:10Z

Hello @topper-123! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on July 26, 2018 at 21:03 Hours UTC

jreback · 2018-07-07T22:53:23Z

can you rebase

topper-123 · 2018-07-15T16:58:49Z

Rebased and green.

topper-123 · 2018-07-15T21:11:02Z

Hold a bit pulling this in. I think I've found an issue that needs fixing.

topper-123 · 2018-07-16T14:09:04Z

I discovered a bug in my df.aggregate implentation. Solving that unearthed several other bugs, as:

df.transform relies on df.aggregate
Fixing the bug for df.transform required tests for df.transform(func, axis=1), which unearthed a bug in df.apply(func, axis=1), which also needed fixing

Very much fun!

The most basic bug is that df.apply([func], axis=1) doesn't work in master. This fixes that, and subsequent bugs in df.transform and df.agg.

>>> from pandas.tests.frame.test_apply import TestDataFrameAggregate
>>> x = TestDataFrameAggregate()
>>> x.frame.apply([np.sqrt], axis=0)  # ok
                   A         B         C         D
                sqrt      sqrt      sqrt      sqrt
ukh1GVv7xX  1.503982       NaN  0.491756  0.787992
wIHkS3BsKA  1.111148       NaN  1.517181       NaN
YWAqLcsqUP       NaN  0.771865  0.760905  0.883760
QhHqzICDVo  0.448304  1.278314  0.517722  0.948910
vEqhEx4vkR       NaN       NaN       NaN  0.913545
4Q1LUyQq9C  0.998117  0.849454       NaN       NaN
XvrwqrxIIK       NaN  0.506467  0.366948  1.183702
xLAqsZsf4n  0.864088       NaN  0.825798  0.865628
Bx8AXfOzTv  1.025291       NaN  0.753349  0.108052
9LoWz0qpSu  0.724288       NaN  1.708485  0.107700
HZI74uGF4k       NaN  0.875412       NaN       NaN
t6Ds6vcRKU       NaN       NaN  0.651367  1.436792
SHSaccg7Wz  0.437198       NaN       NaN       NaN
iZe5ctx7w1  0.127975  1.119973  0.395028       NaN
3aoqjBGybQ       NaN       NaN       NaN  0.772779
saCF7tPAnv  0.719474       NaN  1.045202       NaN
2uo0g5oocb  0.132254  0.563252       NaN       NaN
3duZHu6SDk  0.628383       NaN  0.546263       NaN
p4ug60WPOR       NaN  0.641512  0.406602  0.683798
NbKillWJPL       NaN       NaN  0.699454       NaN
Yc7hoe5odY  0.709012       NaN       NaN       NaN
ZdtVhBzFQZ  0.516489       NaN       NaN  0.351688
SwaSfXJLVm       NaN  0.780858  1.394233  1.076297
LFYhRyDnhZ  0.564244  0.925111       NaN  0.359451
JvfkNGiVkv  1.290368       NaN  0.416229       NaN
iyXPFeeDM2  1.062631  0.532897       NaN       NaN
sy6MqqOmfN  1.443564  0.918141  0.963493       NaN
vkIoXhAka6       NaN       NaN       NaN  0.322189
6MDIrdaygD       NaN       NaN  0.610091       NaN
dhZgEA2cGP  1.505071       NaN       NaN       NaN
>>> x.frame.apply([np.sqrt], axis=1)
TypeError: ("'list' object is not callable", 'occurred at index J5bdGdv0g8')  # master
# this PR below
                        A         B         C         D
ukh1GVv7xX sqrt  1.503982       NaN  0.491756  0.787992
wIHkS3BsKA sqrt  1.111148       NaN  1.517181       NaN
YWAqLcsqUP sqrt       NaN  0.771865  0.760905  0.883760
QhHqzICDVo sqrt  0.448304  1.278314  0.517722  0.948910
vEqhEx4vkR sqrt       NaN       NaN       NaN  0.913545
4Q1LUyQq9C sqrt  0.998117  0.849454       NaN       NaN
XvrwqrxIIK sqrt       NaN  0.506467  0.366948  1.183702
xLAqsZsf4n sqrt  0.864088       NaN  0.825798  0.865628
Bx8AXfOzTv sqrt  1.025291       NaN  0.753349  0.108052
9LoWz0qpSu sqrt  0.724288       NaN  1.708485  0.107700
HZI74uGF4k sqrt       NaN  0.875412       NaN       NaN
t6Ds6vcRKU sqrt       NaN       NaN  0.651367  1.436792
SHSaccg7Wz sqrt  0.437198       NaN       NaN       NaN
iZe5ctx7w1 sqrt  0.127975  1.119973  0.395028       NaN
3aoqjBGybQ sqrt       NaN       NaN       NaN  0.772779
saCF7tPAnv sqrt  0.719474       NaN  1.045202       NaN
2uo0g5oocb sqrt  0.132254  0.563252       NaN       NaN
3duZHu6SDk sqrt  0.628383       NaN  0.546263       NaN
p4ug60WPOR sqrt       NaN  0.641512  0.406602  0.683798
NbKillWJPL sqrt       NaN       NaN  0.699454       NaN
Yc7hoe5odY sqrt  0.709012       NaN       NaN       NaN
ZdtVhBzFQZ sqrt  0.516489       NaN       NaN  0.351688
SwaSfXJLVm sqrt       NaN  0.780858  1.394233  1.076297
LFYhRyDnhZ sqrt  0.564244  0.925111       NaN  0.359451
JvfkNGiVkv sqrt  1.290368       NaN  0.416229       NaN
iyXPFeeDM2 sqrt  1.062631  0.532897       NaN       NaN
sy6MqqOmfN sqrt  1.443564  0.918141  0.963493       NaN
vkIoXhAka6 sqrt       NaN       NaN       NaN  0.322189
6MDIrdaygD sqrt       NaN       NaN  0.610091       NaN
dhZgEA2cGP sqrt  1.505071       NaN       NaN       NaN

Similar problem where in master when calling df.transform([func], axis=1) and df.agg([func], axis=1).

Likewise with many functions, now df.agg(['mean', 'sum'], axis=1) is possible, while previously only df.agg(['mean', 'sum'], axis=0) was possible.

topper-123 · 2018-07-16T14:11:49Z

pandas/core/apply.py

-        # dispatch to agg
-        if isinstance(self.f, (list, dict)):
-            return self.obj.aggregate(self.f, axis=self.axis,
-                                      *self.args, **self.kwds)


if isinstance(self.f, (list, dict)) should also be called when axis=1, so moved up to FrameApply.get_results.

topper-123 · 2018-07-16T14:14:28Z

pandas/core/generic.py


-            return result
-
-        cls.transform = transform


when transform was added by calling _add_series_or_dataframe_operations that method shadowed a transform method on DataFrame. As transform doesnt need to be added in any special way, I just moved it to be a normal instance method.

topper-123 · 2018-07-16T14:17:25Z

pandas/tests/frame/test_apply.py

            f_abs = np.abs(self.frame)
+            f_sqrt = np.sqrt(self.frame)


having "absolute" come before "sqrt" maintains alphabetical ordering, and makes creating multindexes easier below.

jreback

small change, otherwise lgtm. ping on green.

jreback · 2018-07-26T12:44:55Z

pandas/util/testing.py

@@ -2826,3 +2826,28 @@ def skipna_wrapper(x):
            return alternative(nona)

    return skipna_wrapper
+
+
+def _get_cython_table_params(ndframe, func_names_and_expected):


hmm didn't realize you are actually importing from conftest, so on 2nd though, move this to pandas.conftest and import from there (you have to explicity import non-fixtures FYI)

topper-123 · 2018-07-26T15:03:14Z

green

jreback · 2018-07-28T14:24:55Z

thanks @topper-123 nice patch!

* master: BENCH: asv csv reading benchmarks no longer read StringIO objects off the end (pandas-dev#21807) BUG: df.agg, df.transform and df.apply use different methods when axis=1 than when axis=0 (pandas-dev#21224) BUG: bug in GroupBy.count where arg minlength passed to np.bincount must be None for np<1.13 (pandas-dev#21957) CLN: Vbench to asv conversion script (pandas-dev#22089) consistent docstring (pandas-dev#22066) TST: skip pytables test with not-updated pytables conda package (pandas-dev#22099) CLN: Remove Legacy MultiIndex Index Compatibility (pandas-dev#21740) DOC: Reword doc for filepath_or_buffer in read_csv (pandas-dev#22058) BUG: rolling with MSVC 2017 build (pandas-dev#21813)

…s=1 than when axis=0 (pandas-dev#21224)

topper-123 force-pushed the axis_1_agg_funcs branch from 9c26e30 to b87fc41 Compare May 27, 2018 21:56

topper-123 mentioned this pull request May 28, 2018

ENH: add np.nan funcs to _cython_table #21123

Closed

3 tasks

jreback requested changes May 29, 2018

View reviewed changes

jreback added Reshaping Concat, Merge/Join, Stack/Unstack, Explode Numeric Operations Arithmetic, Comparison, and Logical operations labels May 29, 2018

topper-123 force-pushed the axis_1_agg_funcs branch 6 times, most recently from 689ce83 to 9c13256 Compare June 9, 2018 21:51

jreback requested changes Jun 19, 2018

View reviewed changes

topper-123 force-pushed the axis_1_agg_funcs branch from 9c13256 to 962fcb1 Compare June 23, 2018 10:09

topper-123 force-pushed the axis_1_agg_funcs branch from 962fcb1 to 48e1a63 Compare June 23, 2018 10:13

jreback mentioned this pull request Jul 7, 2018

BUG: agg with axis=1 #19605

Closed

3 tasks

topper-123 force-pushed the axis_1_agg_funcs branch from 48e1a63 to 17dc5b9 Compare July 15, 2018 15:31

topper-123 commented Jul 16, 2018

View reviewed changes

topper-123 force-pushed the axis_1_agg_funcs branch 5 times, most recently from 4b5b2c3 to 18bdf54 Compare July 16, 2018 16:37

topper-123 force-pushed the axis_1_agg_funcs branch 3 times, most recently from 1f50505 to 06f1df7 Compare July 25, 2018 20:57

jreback requested changes Jul 26, 2018

View reviewed changes

jreback added this to the 0.24.0 milestone Jul 26, 2018

tp added 6 commits July 26, 2018 19:40

Fix bug where df.agg(..., axis=1) gives wrong result

3d95bfc

Fix tests for bug where df.agg(..., axis=1) gives wrong result

262bd3e

changed according to comments

ed43757

correct apply(axis=1) and related bugs

2be3747

'index' and 'columns' added to fixture and related changes.

b6382d4

add conftest cython_table_items + a few corrections

5ad024c

topper-123 force-pushed the axis_1_agg_funcs branch from 06f1df7 to caaa912 Compare July 26, 2018 18:41

clarified according to comments

39ced29

topper-123 force-pushed the axis_1_agg_funcs branch from caaa912 to 39ced29 Compare July 26, 2018 21:03

jreback approved these changes Jul 28, 2018

View reviewed changes

jreback merged commit 848b69c into pandas-dev:master Jul 28, 2018

topper-123 added a commit to topper-123/pandas that referenced this pull request Jul 29, 2018

fix for pandas-dev#21224 wrong sort order

48bbaae

topper-123 added a commit to topper-123/pandas that referenced this pull request Jul 29, 2018

fix for pandas-dev#21224 wrong sort order

0cddc82

This was referenced Jul 29, 2018

BUG: _cython_table bug fix #22110

Merged

ENH: add np.nan* funcs to cython_table #22109

Merged

jreback pushed a commit that referenced this pull request Jul 29, 2018

fix for #21224 wrong sort order (#22110)

415a01e

topper-123 mentioned this pull request Aug 1, 2018

BUG: _cython_table bug (part 2) #22156

Closed

This was referenced Sep 8, 2018

DOC: improve doc string for .aggregate and df.transform #22636

Closed

DOC: improve doc string for .aggregate and .transform #22641

Merged

topper-123 deleted the axis_1_agg_funcs branch September 16, 2018 23:20

Sup3rGeo pushed a commit to Sup3rGeo/pandas that referenced this pull request Oct 1, 2018

BUG: df.agg, df.transform and df.apply use different methods when axi…

638b0ad

…s=1 than when axis=0 (pandas-dev#21224)

Sup3rGeo pushed a commit to Sup3rGeo/pandas that referenced this pull request Oct 1, 2018

fix for pandas-dev#21224 wrong sort order (pandas-dev#22110)

9c9bb06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: df.agg, df.transform and df.apply use different methods when axis=1 than when axis=0 #21224

BUG: df.agg, df.transform and df.apply use different methods when axis=1 than when axis=0 #21224

topper-123 commented May 27, 2018 •

edited

Loading

codecov bot commented May 27, 2018 •

edited

Loading

jreback May 29, 2018

jreback May 29, 2018

topper-123 May 30, 2018

jreback May 29, 2018

jreback May 29, 2018

jreback Jun 19, 2018

jreback Jun 19, 2018

jreback Jun 19, 2018

topper-123 Jun 23, 2018 •

edited

Loading

jreback Jun 19, 2018

topper-123 Jun 19, 2018

jreback Jun 19, 2018

jreback Jun 19, 2018

topper-123 Jun 23, 2018

topper-123 Jun 23, 2018

jreback Jul 20, 2018

topper-123 Jul 20, 2018

topper-123 Jul 22, 2018

pep8speaks commented Jun 23, 2018 •

edited

Loading

jreback commented Jul 7, 2018

topper-123 commented Jul 15, 2018

topper-123 commented Jul 15, 2018

topper-123 commented Jul 16, 2018 •

edited

Loading

topper-123 Jul 16, 2018

topper-123 Jul 16, 2018 •

edited

Loading

topper-123 Jul 16, 2018

jreback left a comment

jreback Jul 26, 2018

topper-123 commented Jul 26, 2018

jreback commented Jul 28, 2018

@@ @@ -1795,7 +1795,7 @@ def error(): @@
                                   error()
                               raise
                           except:
-                              error()

BUG: df.agg, df.transform and df.apply use different methods when axis=1 than when axis=0 #21224

BUG: df.agg, df.transform and df.apply use different methods when axis=1 than when axis=0 #21224

Conversation

topper-123 commented May 27, 2018 • edited Loading

Explanation

codecov bot commented May 27, 2018 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

topper-123 Jun 23, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pep8speaks commented Jun 23, 2018 • edited Loading

Comment last updated on July 26, 2018 at 21:03 Hours UTC

jreback commented Jul 7, 2018

topper-123 commented Jul 15, 2018

topper-123 commented Jul 15, 2018

topper-123 commented Jul 16, 2018 • edited Loading

Choose a reason for hiding this comment

topper-123 Jul 16, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

topper-123 commented Jul 26, 2018

jreback commented Jul 28, 2018

topper-123 commented May 27, 2018 •

edited

Loading

codecov bot commented May 27, 2018 •

edited

Loading

topper-123 Jun 23, 2018 •

edited

Loading

pep8speaks commented Jun 23, 2018 •

edited

Loading

topper-123 commented Jul 16, 2018 •

edited

Loading

topper-123 Jul 16, 2018 •

edited

Loading