Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: add np.nan funcs to _cython_table #21123

Closed

Conversation

topper-123
Copy link
Contributor

@topper-123 topper-123 commented May 18, 2018

closes #19629
closes #21134

  • tests added / passed
  • passes git diff upstream/master -u -- "*.py" | flake8 --diff
  • whatsnew entry

This started as a copy of #19670 by @AaronCritchley, but has solved two bugs that the tests surfaced along the way.

Bug 1:
there is currently a bug in df.aggregate, where the method incorrectly defers to df.apply in a corner case. This only shows up in the result when using numpy < 1.13 and passing np.nan* functions to df.aggregate. This is the reason for the change in base.py line 571. (see #8383 for further details on the bug in numpy<1.13 and how it affects pandas.)

Bug 2:
Passing builtins to df.aggregate is ok when axis=0, but gives wrong result,when axis=1 (#21134). The reason for this difference is that df.aggregate defers to df._aggregate when axis=0, but defers to df.apply, when axis=1, and these give different result when passed funcions and the series/frame contains Nan values. This can be solved by transposing df and defering the transposed frame to its _aggragate method when axis=1.

The added tests have been heavily parametrized (this helped unearth the bugs above). Thet have been placed in series/test_apply.py and frame/test_apply, as a lot of other tests for ser/df.aggregate were already there.

@topper-123 topper-123 force-pushed the np-nan-funcs-to-cython-map branch from 0537897 to b0a7a0f Compare May 18, 2018 23:21
@codecov
Copy link

codecov bot commented May 18, 2018

Codecov Report

Merging #21123 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #21123      +/-   ##
==========================================
+ Coverage   91.84%   91.84%   +<.01%     
==========================================
  Files         153      153              
  Lines       49505    49512       +7     
==========================================
+ Hits        45466    45473       +7     
  Misses       4039     4039
Flag Coverage Δ
#multiple 90.24% <100%> (ø) ⬆️
#single 41.88% <21.87%> (ø) ⬆️
Impacted Files Coverage Δ
pandas/core/frame.py 97.22% <100%> (-0.01%) ⬇️
pandas/core/groupby/groupby.py 92.66% <100%> (ø) ⬆️
pandas/core/base.py 96.87% <100%> (+0.04%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f6abb61...580edcf. Read the comment docs.

pd.Series([1, 2, 3, 4, 5, 6]),
pd.DataFrame([[1, 2, 3], [4, 5, 6]])
])
def nan_test_object(request):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add NA data to these?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

if f and not args and not kwargs:
return getattr(self, f)(), None
if f:
return getattr(self, f)(*args, **kwargs), None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this equivalent? Wondering if there's any case where providing args / kwargs before would have routed the function to a different place

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not, but this is better :-). This fixes the bug in that was causing #19629 to fail.

The issue is subtle, but it has to do with a bug in numpy < 1.13 np.nan* functions. Numpy < 1.13 handles e.g. np.nanmin(pd_obj) incorrectly while it handles np.min(pd_obj) correctly. Numpy >= 1.13 handles both correctly. See #19753 for another issue regarding the same numpy problem.

Anyway, if f is a string, you should call getattr(self, f)(*args, **kwargs), so it was perhaps more luck than design that the previous version did work:-)

(np.min, np.nanmin),
])
def test_np_nan_functions(standard, nan_method, nan_test_object):
tm.assert_almost_equal(nan_test_object.agg(standard),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be a silly question but can we not use the frame / series equals methods here? Don't think precision is that much of a factor with the fixtures

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these are ok as nan_test_object can be either a series or a frame. I think the name should be changed though, to e.g. series_or_frame.

@pep8speaks
Copy link

pep8speaks commented May 19, 2018

Hello @topper-123! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on May 26, 2018 at 07:47 Hours UTC

@topper-123 topper-123 force-pushed the np-nan-funcs-to-cython-map branch 8 times, most recently from 2313461 to 075427d Compare May 19, 2018 16:36
^^^^^^^

- :meth:`~DataFrame.agg` now correctly handles numpy NaN-aware methods like :meth:`numpy.nansum` (:issue:`19629`)
- :meth:`~DataFrame.agg` now correctly handles built-in methods like ``sum`` when axis=1 (:issue:`19629`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't think you meant to add this in this PR?

Copy link
Contributor Author

@topper-123 topper-123 May 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, without those changes the tests in test_apply don't pass. At the same time, the tests in test_apply are sufficient for testing for this bug, so these two issues are very related...

Copy link
Contributor Author

@topper-123 topper-123 May 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fixed bug looks like this, BTW:

>>> df = pd.DataFrame([[np.nan, 2], [3, 4.]])
>>> df.agg(sum, axis=1)
0    NaN  # should say 2.0
1    7.0
dtype: float64

result, how = self._aggregate(func, axis=0, *args, **kwargs)
except TypeError:
pass
df = self if not axis else self.T
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same thing?

Copy link
Contributor Author

@topper-123 topper-123 May 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, a bit. df.agg and df.apply with axis=1 give in many cases identical output, but not in all, as df.apply doesn't make the lookup in _cython_table. So, this is needed for some tests to pass.

AFAIK, transposition is cheap in numpy/pandas, so this is an ok approach?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather you not do this, this is a quite hacky way of handling this, there is a small bug on the lower level i think.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so, as df._aggregate currently doesn't take an axis parameter. I could add an axis parameter to df._aggregate and do the transposition there instead?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can try. this should be handled on a much lower level.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added the axis parameter to df._aggregate, so this is handled there.

pd.DataFrame([[np.nan, 2], [3, 4]]),
pd.DataFrame(),
])
@pytest.mark.parametrize(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm wonder if this would be better as a shared fixture - thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes pls make this a fixture (in conftest.py)

# GH21123
np_func, str_func = cython_table_items

if isinstance(test_input, pd.DataFrame):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't need this type check since the parameters are all DataFrames

Copy link
Contributor Author

@topper-123 topper-123 May 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. I was preparing for eventual tests where test_input would be tuple([frame, args, kwargs]), but of course it looks silly now, when I could not find such tests that make sense.

I'll remove that unless someone comes up with a tests that requires args and/or kwargs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, the assert should now be tm.assert_frame_equal. I'll change that.

@topper-123
Copy link
Contributor Author

The travis failures say "Different tests were collected between gw1 and gw0. The difference is:..."

I don't think this has anything to do with my PR, anyone knows?

@topper-123 topper-123 changed the title WIP/ENH: add np.nan funcs to _cython_table ENH: add np.nan funcs to _cython_table May 19, 2018
@WillAyd
Copy link
Member

WillAyd commented May 19, 2018

I have not seen that before - I'd say let's just take a look after your next push and see if it repeats

@topper-123 topper-123 force-pushed the np-nan-funcs-to-cython-map branch from 075427d to 73048a1 Compare May 19, 2018 17:30
result, how = self._aggregate(func, axis=0, *args, **kwargs)
except TypeError:
pass
df = self if not axis else self.T
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather you not do this, this is a quite hacky way of handling this, there is a small bug on the lower level i think.

pd.DataFrame([[np.nan, 2], [3, 4]]),
pd.DataFrame(),
])
@pytest.mark.parametrize(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes pls make this a fixture (in conftest.py)

pd.Series(),
])
@pytest.mark.parametrize(
"cython_table_items",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

@jreback jreback added Reshaping Concat, Merge/Join, Stack/Unstack, Explode Numeric Operations Arithmetic, Comparison, and Logical operations labels May 19, 2018
@topper-123 topper-123 force-pushed the np-nan-funcs-to-cython-map branch 4 times, most recently from 2f0c0bc to 6ece143 Compare May 19, 2018 23:15
@topper-123
Copy link
Contributor Author

All the failures are of the type

Different tests were collected between gw0 and gw1. The difference is:
--- gw0
+++ gw1

Everything passes locally. Don't think this has anything to do with my PR. I can try push again, but this has been every time so far, so don't think that will help...

@topper-123 topper-123 force-pushed the np-nan-funcs-to-cython-map branch 3 times, most recently from 2e40325 to f353376 Compare May 21, 2018 00:25
@topper-123
Copy link
Contributor Author

topper-123 commented May 21, 2018

The failure above has been solved, it was a mistake concerning python3.5 dicts in xdist in conftest.py: python3.5 randomizes dict retrieval, and of course xdist doesnt like randomizing work distribution. I`ve made the params parameter in cython_table in ``conftest.py`` deterministic, so it doesnt happen again.

@topper-123 topper-123 force-pushed the np-nan-funcs-to-cython-map branch from f353376 to 5ec7e18 Compare May 21, 2018 08:59
@@ -4086,7 +4086,10 @@ def _post_process_cython_aggregate(self, obj):
def aggregate(self, arg, *args, **kwargs):

_level = kwargs.pop('_level', None)
result, how = self._aggregate(arg, _level=_level, *args, **kwargs)
_agg_kwargs = kwargs.copy()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can just list axis as a kwarg

Copy link
Contributor Author

@topper-123 topper-123 May 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That breaks a test. The issue is axis can considered to be supplied twice and you may get (from the breaking test):

>>> _level, args, _agg_kwargs = None, (80,), {'axis': 0}
>>> self._aggregate(arg, _level=_level, *args, **_agg_kwargs)
TypeError: _aggregate() got multiple values for argument 'axis'

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I understand this - what test is breaking? Perhaps the test is configured incorrectly?

Copy link
Contributor Author

@topper-123 topper-123 May 27, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both 80 and 0 may be the value for parameter axis: The function signature is def _aggregate(self, arg, axis=0, *args, **kwargs), so the second unnamed argument (80) will be considered to be axis, but this clashes with the parameter in kwargs ({'axis': 0}), causing the exception.

To avoid this we'd prefer the signature to be def _aggregate(self, arg, *args, axis=0, **kwargs), but this syntax is only supported in Python3...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is that test located?

Copy link
Contributor Author

@topper-123 topper-123 May 27, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pandas\tests\groupby\test_groupby.py::test_pass_args_kwargs. It's the line agg_result = df_grouped.agg(np.percentile, 80, axis=0)

key=lambda x: x[0].__name__)),
ids=lambda x: "({}-{!r})".format(x[0].__name__, x[1]),
)
def cython_table_items(request):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add _fixture to the end of the name

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

# GH21123
np_func, str_func = cython_table_items

tm.assert_almost_equal(df.agg(np_func),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use assert_frame_equal it provides stronger guarantees

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of these aggregate to series (so assert_series_equal)), but cumprod and cumsum is in _cython_table, so in that case a DataFrame is returned.

I could add a conditional, so the more correct assert is used each time.

# GH21123
np_func, str_func = cython_table_items

tm.assert_almost_equal(series.agg(np_func),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use assert_series_equal

def test_agg_function_input(self, df, cython_table_items):
# test whether the functions (keys) in
# pd.core.base.SelectionMixin._cython_table give the same result
# as the related strings (values) when used in df.agg. Examples:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add an example which actually tests (say for sum, nansum) axis=1, IOW contruct the resultant frame

this test doesn't actually tests that axis=1 works, just that it matches with a string (which doesn't have tests itself)

Copy link
Contributor Author

@topper-123 topper-123 May 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand, could you expand.

This tests if the result are the same if e.g. same result when np.sum is supplied as when string 'sum' is supplied. So it is correct that this doesn't verify the result. I considered that to the a different test, where you test against the string versions.

@topper-123 topper-123 force-pushed the np-nan-funcs-to-cython-map branch from 327f1a9 to 396b327 Compare May 25, 2018 20:36
@topper-123
Copy link
Contributor Author

I`ve rewritten the tests to now have expected results also.

@@ -316,13 +331,14 @@ def _try_aggregate_string_function(self, arg, *args, **kwargs):

raise ValueError("{arg} is an unknown string function".format(arg=arg))

def _aggregate(self, arg, *args, **kwargs):
def _aggregate(self, arg, axis=0, *args, **kwargs):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think this should be a separate PR. I know you mentioned there was some interweaving of test dependency, but I feel like we are injecting this keyword in here without any regard to existing test coverage for axis=1.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can look into making separate PR. That will have to be pulled in before this one, so the tests of this PR won't break.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If timing is a concern you can also xfail the axis=1 tests. Rebase thereafter would be minor

Copy link
Contributor Author

@topper-123 topper-123 May 27, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has now been created as #21224

df = inputs[0]
expected = inputs[1][str_func]

if isinstance(expected, type) and issubclass(expected, Exception):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anything that raises should be done in a separate test, i.e. test_agg_function_input_raises

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree in principle, but this test iterates over all items in _cython_table, of which some will fail on some inputs.

So I'd have to construct the tests quite a bit differently and probably the fixture in conftest.py couldn't be used (because it returns all combinations and I now have to select the relevant ones for each test method). So something like:

(builtins.sum, 'sum': 0),
(np.sum, 'sum', 0),
(np.nansum: 'sum', 0),
etc...

which will be very inelegant and repetitive IMO. Is it not possible to bend this rule on this one (or give hint on how to do it elegantly)?...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm OK understood. May make sense as an exception then - don't have anything off the top of my head to improve but will think more about it

df.agg(np_func, axis=axis)
df.agg(str_func, axis=axis)
elif str_func in ('cumprod', 'cumsum'):
tm.assert_frame_equal(df.agg(np_func, axis=axis), expected)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For readability / consistency with other tests create a variable called result and assign to it before the call to assert_frame_equal

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've uploaded a changed version.

tm.assert_series_equal(series.agg(np_func), expected)
tm.assert_series_equal(series.agg(str_func), expected)
else:
tm.assert_almost_equal(series.agg(np_func), expected)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be assert_series_equal no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

series.agg(np_func) resturns a scalar. I could use assert np.isclose(...)?

@topper-123 topper-123 force-pushed the np-nan-funcs-to-cython-map branch from 9a053b8 to 3399bcd Compare May 26, 2018 02:06
@topper-123 topper-123 force-pushed the np-nan-funcs-to-cython-map branch from 3399bcd to 580edcf Compare May 26, 2018 07:46
df = inputs[0]
expected = inputs[1][str_func]

if isinstance(expected, type) and issubclass(expected, Exception):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm OK understood. May make sense as an exception then - don't have anything off the top of my head to improve but will think more about it

@@ -4086,7 +4086,10 @@ def _post_process_cython_aggregate(self, obj):
def aggregate(self, arg, *args, **kwargs):

_level = kwargs.pop('_level', None)
result, how = self._aggregate(arg, _level=_level, *args, **kwargs)
_agg_kwargs = kwargs.copy()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I understand this - what test is breaking? Perhaps the test is configured incorrectly?

try:
result, how = self._aggregate(func, axis=axis, *args, **kwargs)
except TypeError:
pass
if result is None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Related to the axis change, do we still hit this condition?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The axis change is related to #21134. So if I move that to a separate PR, this will move too.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep that is expected just wanted to see if we still needed it (regardless of the PR it appears in)

@jreback
Copy link
Contributor

jreback commented May 29, 2018

as @WillAyd indicates, can you split this up into a cython table PR and on top of that the agg fixes?

first should be straightforward and we can get it quickly. pls put tests, changes and whatsnew for that one (in this PR is fine), and issue another PR for other changes.


@pytest.fixture(
# params: Python 3.5 randomizes dict access and xdist doesn't like that
# in fixtures. In order to get predetermined values we need to sort
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so would like to have this fixture in your other PR

@topper-123
Copy link
Contributor Author

Closing in favor of #22109.

@topper-123 topper-123 closed this Jul 28, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Numeric Operations Arithmetic, Comparison, and Logical operations Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
4 participants