-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: add np.nan funcs to _cython_table #21123
ENH: add np.nan funcs to _cython_table #21123
Conversation
0537897
to
b0a7a0f
Compare
Codecov Report
@@ Coverage Diff @@
## master #21123 +/- ##
==========================================
+ Coverage 91.84% 91.84% +<.01%
==========================================
Files 153 153
Lines 49505 49512 +7
==========================================
+ Hits 45466 45473 +7
Misses 4039 4039
Continue to review full report at Codecov.
|
pandas/tests/test_nanops.py
Outdated
pd.Series([1, 2, 3, 4, 5, 6]), | ||
pd.DataFrame([[1, 2, 3], [4, 5, 6]]) | ||
]) | ||
def nan_test_object(request): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add NA data to these?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added
pandas/core/base.py
Outdated
if f and not args and not kwargs: | ||
return getattr(self, f)(), None | ||
if f: | ||
return getattr(self, f)(*args, **kwargs), None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this equivalent? Wondering if there's any case where providing args / kwargs before would have routed the function to a different place
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not, but this is better :-). This fixes the bug in that was causing #19629 to fail.
The issue is subtle, but it has to do with a bug in numpy < 1.13 np.nan* functions. Numpy < 1.13 handles e.g. np.nanmin(pd_obj)
incorrectly while it handles np.min(pd_obj)
correctly. Numpy >= 1.13 handles both correctly. See #19753 for another issue regarding the same numpy problem.
Anyway, if f
is a string, you should call getattr(self, f)(*args, **kwargs)
, so it was perhaps more luck than design that the previous version did work:-)
pandas/tests/test_nanops.py
Outdated
(np.min, np.nanmin), | ||
]) | ||
def test_np_nan_functions(standard, nan_method, nan_test_object): | ||
tm.assert_almost_equal(nan_test_object.agg(standard), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be a silly question but can we not use the frame / series equals methods here? Don't think precision is that much of a factor with the fixtures
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these are ok as nan_test_object
can be either a series or a frame. I think the name should be changed though, to e.g. series_or_frame
.
Hello @topper-123! Thanks for updating the PR. Cheers ! There are no PEP8 issues in this Pull Request. 🍻 Comment last updated on May 26, 2018 at 07:47 Hours UTC |
2313461
to
075427d
Compare
doc/source/whatsnew/v0.23.1.txt
Outdated
^^^^^^^ | ||
|
||
- :meth:`~DataFrame.agg` now correctly handles numpy NaN-aware methods like :meth:`numpy.nansum` (:issue:`19629`) | ||
- :meth:`~DataFrame.agg` now correctly handles built-in methods like ``sum`` when axis=1 (:issue:`19629`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't think you meant to add this in this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, without those changes the tests in test_apply don't pass. At the same time, the tests in test_apply are sufficient for testing for this bug, so these two issues are very related...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fixed bug looks like this, BTW:
>>> df = pd.DataFrame([[np.nan, 2], [3, 4.]])
>>> df.agg(sum, axis=1)
0 NaN # should say 2.0
1 7.0
dtype: float64
pandas/core/frame.py
Outdated
result, how = self._aggregate(func, axis=0, *args, **kwargs) | ||
except TypeError: | ||
pass | ||
df = self if not axis else self.T |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same thing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, a bit. df.agg
and df.apply
with axis=1
give in many cases identical output, but not in all, as df.apply
doesn't make the lookup in _cython_table
. So, this is needed for some tests to pass.
AFAIK, transposition is cheap in numpy/pandas, so this is an ok approach?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rather you not do this, this is a quite hacky way of handling this, there is a small bug on the lower level i think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so, as df._aggregate
currently doesn't take an axis
parameter. I could add an axis
parameter to df._aggregate
and do the transposition there instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can try. this should be handled on a much lower level.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added the axis
parameter to df._aggregate
, so this is handled there.
pandas/tests/frame/test_apply.py
Outdated
pd.DataFrame([[np.nan, 2], [3, 4]]), | ||
pd.DataFrame(), | ||
]) | ||
@pytest.mark.parametrize( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm wonder if this would be better as a shared fixture - thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes pls make this a fixture (in conftest.py)
pandas/tests/frame/test_apply.py
Outdated
# GH21123 | ||
np_func, str_func = cython_table_items | ||
|
||
if isinstance(test_input, pd.DataFrame): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't need this type check since the parameters are all DataFrames
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right. I was preparing for eventual tests where test_input would be tuple([frame, args, kwargs]), but of course it looks silly now, when I could not find such tests that make sense.
I'll remove that unless someone comes up with a tests that requires args and/or kwargs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, the assert should now be tm.assert_frame_equal
. I'll change that.
The travis failures say "Different tests were collected between gw1 and gw0. The difference is:..." I don't think this has anything to do with my PR, anyone knows? |
I have not seen that before - I'd say let's just take a look after your next push and see if it repeats |
075427d
to
73048a1
Compare
pandas/core/frame.py
Outdated
result, how = self._aggregate(func, axis=0, *args, **kwargs) | ||
except TypeError: | ||
pass | ||
df = self if not axis else self.T |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rather you not do this, this is a quite hacky way of handling this, there is a small bug on the lower level i think.
pandas/tests/frame/test_apply.py
Outdated
pd.DataFrame([[np.nan, 2], [3, 4]]), | ||
pd.DataFrame(), | ||
]) | ||
@pytest.mark.parametrize( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes pls make this a fixture (in conftest.py)
pandas/tests/series/test_apply.py
Outdated
pd.Series(), | ||
]) | ||
@pytest.mark.parametrize( | ||
"cython_table_items", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
2f0c0bc
to
6ece143
Compare
All the failures are of the type
Everything passes locally. Don't think this has anything to do with my PR. I can try push again, but this has been every time so far, so don't think that will help... |
2e40325
to
f353376
Compare
The failure above has been solved, it was a mistake concerning python3.5 dicts in xdist in |
f353376
to
5ec7e18
Compare
@@ -4086,7 +4086,10 @@ def _post_process_cython_aggregate(self, obj): | |||
def aggregate(self, arg, *args, **kwargs): | |||
|
|||
_level = kwargs.pop('_level', None) | |||
result, how = self._aggregate(arg, _level=_level, *args, **kwargs) | |||
_agg_kwargs = kwargs.copy() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can just list axis as a kwarg
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That breaks a test. The issue is axis
can considered to be supplied twice and you may get (from the breaking test):
>>> _level, args, _agg_kwargs = None, (80,), {'axis': 0}
>>> self._aggregate(arg, _level=_level, *args, **_agg_kwargs)
TypeError: _aggregate() got multiple values for argument 'axis'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I understand this - what test is breaking? Perhaps the test is configured incorrectly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both 80
and 0
may be the value for parameter axis
: The function signature is def _aggregate(self, arg, axis=0, *args, **kwargs)
, so the second unnamed argument (80
) will be considered to be axis
, but this clashes with the parameter in kwargs ({'axis': 0}
), causing the exception.
To avoid this we'd prefer the signature to be def _aggregate(self, arg, *args, axis=0, **kwargs)
, but this syntax is only supported in Python3...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where is that test located?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pandas\tests\groupby\test_groupby.py::test_pass_args_kwargs
. It's the line agg_result = df_grouped.agg(np.percentile, 80, axis=0)
pandas/conftest.py
Outdated
key=lambda x: x[0].__name__)), | ||
ids=lambda x: "({}-{!r})".format(x[0].__name__, x[1]), | ||
) | ||
def cython_table_items(request): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add _fixture to the end of the name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
pandas/tests/frame/test_apply.py
Outdated
# GH21123 | ||
np_func, str_func = cython_table_items | ||
|
||
tm.assert_almost_equal(df.agg(np_func), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use assert_frame_equal it provides stronger guarantees
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most of these aggregate to series (so assert_series_equal
)), but cumprod
and cumsum
is in _cython_table
, so in that case a DataFrame is returned.
I could add a conditional, so the more correct assert is used each time.
pandas/tests/series/test_apply.py
Outdated
# GH21123 | ||
np_func, str_func = cython_table_items | ||
|
||
tm.assert_almost_equal(series.agg(np_func), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use assert_series_equal
pandas/tests/frame/test_apply.py
Outdated
def test_agg_function_input(self, df, cython_table_items): | ||
# test whether the functions (keys) in | ||
# pd.core.base.SelectionMixin._cython_table give the same result | ||
# as the related strings (values) when used in df.agg. Examples: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add an example which actually tests (say for sum, nansum) axis=1, IOW contruct the resultant frame
this test doesn't actually tests that axis=1 works, just that it matches with a string (which doesn't have tests itself)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand, could you expand.
This tests if the result are the same if e.g. same result when np.sum
is supplied as when string 'sum'
is supplied. So it is correct that this doesn't verify the result. I considered that to the a different test, where you test against the string versions.
327f1a9
to
396b327
Compare
I`ve rewritten the tests to now have expected results also. |
@@ -316,13 +331,14 @@ def _try_aggregate_string_function(self, arg, *args, **kwargs): | |||
|
|||
raise ValueError("{arg} is an unknown string function".format(arg=arg)) | |||
|
|||
def _aggregate(self, arg, *args, **kwargs): | |||
def _aggregate(self, arg, axis=0, *args, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still think this should be a separate PR. I know you mentioned there was some interweaving of test dependency, but I feel like we are injecting this keyword in here without any regard to existing test coverage for axis=1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can look into making separate PR. That will have to be pulled in before this one, so the tests of this PR won't break.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If timing is a concern you can also xfail the axis=1 tests. Rebase thereafter would be minor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has now been created as #21224
df = inputs[0] | ||
expected = inputs[1][str_func] | ||
|
||
if isinstance(expected, type) and issubclass(expected, Exception): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Anything that raises should be done in a separate test, i.e. test_agg_function_input_raises
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree in principle, but this test iterates over all items in _cython_table
, of which some will fail on some inputs.
So I'd have to construct the tests quite a bit differently and probably the fixture in conftest.py couldn't be used (because it returns all combinations and I now have to select the relevant ones for each test method). So something like:
(builtins.sum, 'sum': 0),
(np.sum, 'sum', 0),
(np.nansum: 'sum', 0),
etc...
which will be very inelegant and repetitive IMO. Is it not possible to bend this rule on this one (or give hint on how to do it elegantly)?...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm OK understood. May make sense as an exception then - don't have anything off the top of my head to improve but will think more about it
pandas/tests/frame/test_apply.py
Outdated
df.agg(np_func, axis=axis) | ||
df.agg(str_func, axis=axis) | ||
elif str_func in ('cumprod', 'cumsum'): | ||
tm.assert_frame_equal(df.agg(np_func, axis=axis), expected) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For readability / consistency with other tests create a variable called result and assign to it before the call to assert_frame_equal
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've uploaded a changed version.
pandas/tests/series/test_apply.py
Outdated
tm.assert_series_equal(series.agg(np_func), expected) | ||
tm.assert_series_equal(series.agg(str_func), expected) | ||
else: | ||
tm.assert_almost_equal(series.agg(np_func), expected) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be assert_series_equal
no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
series.agg(np_func)
resturns a scalar. I could use assert np.isclose(...)
?
9a053b8
to
3399bcd
Compare
3399bcd
to
580edcf
Compare
df = inputs[0] | ||
expected = inputs[1][str_func] | ||
|
||
if isinstance(expected, type) and issubclass(expected, Exception): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm OK understood. May make sense as an exception then - don't have anything off the top of my head to improve but will think more about it
@@ -4086,7 +4086,10 @@ def _post_process_cython_aggregate(self, obj): | |||
def aggregate(self, arg, *args, **kwargs): | |||
|
|||
_level = kwargs.pop('_level', None) | |||
result, how = self._aggregate(arg, _level=_level, *args, **kwargs) | |||
_agg_kwargs = kwargs.copy() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I understand this - what test is breaking? Perhaps the test is configured incorrectly?
try: | ||
result, how = self._aggregate(func, axis=axis, *args, **kwargs) | ||
except TypeError: | ||
pass | ||
if result is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Related to the axis change, do we still hit this condition?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The axis change is related to #21134. So if I move that to a separate PR, this will move too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep that is expected just wanted to see if we still needed it (regardless of the PR it appears in)
as @WillAyd indicates, can you split this up into a cython table PR and on top of that the agg fixes? first should be straightforward and we can get it quickly. pls put tests, changes and whatsnew for that one (in this PR is fine), and issue another PR for other changes. |
|
||
@pytest.fixture( | ||
# params: Python 3.5 randomizes dict access and xdist doesn't like that | ||
# in fixtures. In order to get predetermined values we need to sort |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so would like to have this fixture in your other PR
Closing in favor of #22109. |
closes #19629
closes #21134
git diff upstream/master -u -- "*.py" | flake8 --diff
This started as a copy of #19670 by @AaronCritchley, but has solved two bugs that the tests surfaced along the way.
Bug 1:
there is currently a bug in
df.aggregate
, where the method incorrectly defers todf.apply
in a corner case. This only shows up in the result when using numpy < 1.13 and passing np.nan* functions todf.aggregate
. This is the reason for the change inbase.py
line 571. (see #8383 for further details on the bug in numpy<1.13 and how it affects pandas.)Bug 2:
Passing builtins to
df.aggregate
is ok whenaxis=0
, but gives wrong result,whenaxis=1
(#21134). The reason for this difference is thatdf.aggregate
defers todf._aggregate
whenaxis=0,
but defers todf.apply
, whenaxis=1
, and these give different result when passed funcions and the series/frame contains Nan values. This can be solved by transposing df and defering the transposed frame to its_aggragate
method whenaxis=1
.The added tests have been heavily parametrized (this helped unearth the bugs above). Thet have been placed in
series/test_apply.py
andframe/test_apply
, as a lot of other tests for ser/df.aggregate were already there.