Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: decorator for ops boilerplate #24282

Closed
wants to merge 12 commits into from

Conversation

jbrockmendel
Copy link
Member

There are a handful of places where this decorator is applied but commented-out. This is the "WIP" part of things.

@pep8speaks
Copy link

Hello @jbrockmendel! Thanks for submitting the PR.

# Note: using unpack_and_defer here doesn't break any tests, but the
# behavior here is idiosyncratic enough that I'm not confident enough
# to change it.
# @ops.unpack_and_defer
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TomAugspurger are you the right person to ask about Categorical methods? In particular, why does this only return NotImplemented for ABCSeries and not for ABCDataFrame and ABCIndexClass?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No idea. Returning NotImplemented seems reasonable, as long as it ends up back here with the unboxed values.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Categorical == DataFrame is weird both before and after

data = ["a", "b", 2, "a"]
cat = pd.Categorical(data)
idx = pd.CategoricalIndex(cat)
ser = pd.Series(cat)
df = pd.DataFrame(cat)

master

>>> cat == df
array([[ True, False, False,  True],
       [False,  True, False, False],
       [False, False,  True, False],
       [ True, False, False,  True]])

PR

>>> cat == df
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "pandas/core/ops.py", line 2127, in f
    other = _align_method_FRAME(self, other, axis=None)
  File "pandas/core/ops.py", line 2021, in _align_method_FRAME
    right = to_series(right)
  File "pandas/core/ops.py", line 1983, in to_series
    given_len=len(right)))
ValueError: Unable to coerce to Series, length must be 1: given 4

If we transpose then we get all-True, the only difference being that in the PR the result is a DataFrame instead of an ndarray.

@@ -36,6 +36,8 @@
def _make_comparison_op(cls, op):
# TODO: share code with indexes.base version? Main difference is that
# the block for MultiIndex was removed here.

# @ops.unpack_and_defer
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using this here leads to recursion errors, related to the fact that this only returns NotImplemented for ABCDataFrame. I think this will be easier to resolve after the change to composition.

@@ -573,6 +564,8 @@ def _maybe_mask_result(self, result, mask, other, op_name):

@classmethod
def _create_arithmetic_method(cls, op):

# @ops.unpack_and_defer
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback using the decorator here breaks a bunch of tests, specifically ones operating with DataFrame. Why does this return NotImplemented for Series/Index but not DataFrame?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Below, why use other = other.item() instead of item_from_zerodim?

@@ -51,6 +51,7 @@ def _period_array_cmp(cls, op):
opname = '__{name}__'.format(name=op.__name__)
nat_result = True if opname == '__ne__' else False

# @ops.unwrap_and_defer
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TomAugspurger any idea why not returning NotImplemented for DataFrame?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure... I vaguely remember a broadcasting error, but that may have been user error.

@@ -1650,13 +1651,11 @@ def sparse_unary_method(self):

@classmethod
def _create_arithmetic_method(cls, op):

@ops.unpack_and_defer
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both here and below this is technically a change since in the status quo this doesn't defer to DataFrame. Also doesn't call item_from_zerodim ATM.

if len(other) != len(self) and not is_timedelta64_dtype(other):
# Exclude timedelta64 here so we correctly raise TypeError
# for that instead of ValueError
raise ValueError("Cannot multiply with unequal lengths")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a small change (reflected in a test below). tdarr * tdarr[:-1] checks for length mismatch before checking for dtype compat, so now raises ValueError instead of TypeError.

@@ -63,6 +63,8 @@ def _try_get_item(x):


def _make_comparison_op(op, cls):

# @ops.unpack_and_defer
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Index.__eq__(Series) doesn't return NotImplemented, kind of inconvenient special case

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems like a good idea.

@codecov
Copy link

codecov bot commented Dec 14, 2018

Codecov Report

Merging #24282 into master will increase coverage by <.01%.
The diff coverage is 88.67%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #24282      +/-   ##
==========================================
+ Coverage   92.22%   92.23%   +<.01%     
==========================================
  Files         162      162              
  Lines       51828    51785      -43     
==========================================
- Hits        47798    47763      -35     
+ Misses       4030     4022       -8
Flag Coverage Δ
#multiple 90.63% <88.67%> (ø) ⬆️
#single 43.06% <58.49%> (+0.06%) ⬆️
Impacted Files Coverage Δ
pandas/core/arrays/datetimelike.py 96.44% <ø> (ø) ⬆️
pandas/core/arrays/period.py 98.5% <ø> (ø) ⬆️
pandas/core/arrays/datetimes.py 98.23% <ø> (-0.01%) ⬇️
pandas/core/arrays/categorical.py 95.31% <ø> (ø) ⬆️
pandas/core/arrays/integer.py 95.78% <100%> (+0.24%) ⬆️
pandas/core/arrays/sparse.py 92.03% <100%> (-0.06%) ⬇️
pandas/core/indexes/base.py 96.16% <100%> (-0.12%) ⬇️
pandas/core/indexes/range.py 97.29% <100%> (-0.04%) ⬇️
pandas/core/arrays/timedeltas.py 87.65% <80.95%> (+0.49%) ⬆️
pandas/core/ops.py 94.51% <90.9%> (+0.25%) ⬆️
... and 3 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 040f06f...2f929af. Read the comment docs.

@codecov
Copy link

codecov bot commented Dec 14, 2018

Codecov Report

Merging #24282 into master will increase coverage by 0.01%.
The diff coverage is 95.91%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #24282      +/-   ##
==========================================
+ Coverage   92.22%   92.23%   +0.01%     
==========================================
  Files         162      162              
  Lines       51824    51774      -50     
==========================================
- Hits        47795    47756      -39     
+ Misses       4029     4018      -11
Flag Coverage Δ
#multiple 90.64% <95.91%> (+0.01%) ⬆️
#single 43.07% <71.42%> (+0.06%) ⬆️
Impacted Files Coverage Δ
pandas/core/arrays/datetimelike.py 96.44% <ø> (ø) ⬆️
pandas/core/arrays/period.py 98.48% <ø> (ø) ⬆️
pandas/core/arrays/datetimes.py 98.23% <ø> (ø) ⬆️
pandas/core/arrays/timedeltas.py 88.41% <100%> (+1.25%) ⬆️
pandas/core/arrays/integer.py 95.78% <100%> (+0.24%) ⬆️
pandas/core/arrays/categorical.py 95.3% <100%> (-0.01%) ⬇️
pandas/core/arrays/sparse.py 92.03% <100%> (-0.06%) ⬇️
pandas/core/indexes/base.py 96.16% <100%> (-0.12%) ⬇️
pandas/core/indexes/range.py 97.29% <100%> (-0.04%) ⬇️
pandas/core/ops.py 94.51% <90.9%> (+0.25%) ⬆️
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7b0fa8e...4d48100. Read the comment docs.

@jreback jreback added Clean Internals Related to non-user accessible pandas implementation labels Dec 14, 2018
@jbrockmendel
Copy link
Member Author

After a bunch more poking at this, I think we should hold off on applying this decorator for the methods where it would change behavior, in particular for Categorical and PeriodArray comparisons. Implementing tests for these would be an afterthought, runs the risk of being half-baked.

@jbrockmendel
Copy link
Member Author

@jreback thoughts here? I'm increasingly leaning towards being even stricter, reverting usages that change behavior (basically everything outside of timedeltas) and incrementally re-implementing those in follow-ups with appropriate tests (i.e. things like ops with zero-dim arrays)

@jreback
Copy link
Contributor

jreback commented Dec 18, 2018

@jreback thoughts here? I'm increasingly leaning towards being even stricter, reverting usages that change behavior (basically everything outside of timedeltas) and incrementally re-implementing those in follow-ups with appropriate tests (i.e. things like ops with zero-dim arrays)

not sure exactly what you mean here. with the decorator?

@jbrockmendel
Copy link
Member Author

not sure exactly what you mean here. with the decorator?

The decorator is in fairly good shape, the question is where we want to actually use the decorator. AFAICT every place the decorator is used ATM implies some non-zero change to the behavior of the affected function. For e.g. TimedeltaArray.__mul__, div, etc the change is really small, just handles iterators more precisely. For Categorical comparisons on the other hand, the change is pretty big (with accompanying tests).

But pretty much all the cases in between make not-quite-trivial changes to the methods' behaviors with no accompanying tests. The decision is where to draw then line on those, and my current thought is to be relatively strict for now.

@jreback
Copy link
Contributor

jreback commented Dec 18, 2018

But pretty much all the cases in between make not-quite-trivial changes to the methods' behaviors with no accompanying tests. The decision is where to draw then line on those, and my current thought is to be relatively strict for now.

can you highlite this change via an example(s)?

@jbrockmendel
Copy link
Member Author

can you highlite this change via an example(s)?

IntNA

arr = np.arange(5)
arr_na = pd.core.arrays.integer_array(arr)

other = (x for x in arr_na)
result = arr_na == other

# PR
>>> result
array([ True,  True,  True,  True,  True])

# master
>>> result
array([False, False, False, False, False])

Sparse


arr = np.arange(5)
df = pd.DataFrame(arr)
arr_sparse = pd.core.arrays.SparseArray(arr)

# PR
>>> arr_sparse == np.array(4)
[False, False, False, False, True]
Fill: False
IntIndex
Indices: array([1, 2, 3, 4], dtype=int32)

# master
>>> arr_sparse == np.array(4)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "pandas/core/arrays/sparse.py", line 1712, in cmp_method
    if len(self) != len(other):
TypeError: len() of unsized object

# PR
>>> arr_sparse + df.T
   0  1  2  3  4
0  0  2  4  6  8

# master
>>> arr_sparse + df.T
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "pandas/core/arrays/sparse.py", line 1684, in sparse_arithmetic_method
    self=len(self), other=len(other))))
AssertionError: length mismatch: 5 vs. 1

@jbrockmendel
Copy link
Member Author

Also, thoughts on collecting categorical and sparse arithmetic/comparison tests in tests.arithmetic? I need to take a look at how to improve coverage for these without combinatorial explosion.

@TomAugspurger
Copy link
Contributor

When comparing with a generator, we want the behavior on master, right? That matches NumPy's behavior anyway.

@TomAugspurger
Copy link
Contributor

Also, thoughts on collecting categorical and sparse arithmetic/comparison tests in tests.arithmetic?

As an alternative, you could add an arithmetic pytest marker to the classes those tests are defined on. A dev might want "run all the sparse tests" or "run all the arithmetic tests". I'd favor just leaving them as is for now out of status quo bias.

@jbrockmendel
Copy link
Member Author

As an alternative, you could add an arithmetic pytest marker to the classes those tests are defined on.

That's not a bad idea. Regardless, this can wait until the next test-parametrization pass.

@jbrockmendel
Copy link
Member Author

When comparing with a generator, we want the behavior on master, right? That matches NumPy's behavior anyway.

If that's the case, then we probably need to revert that part of the decorator anyway right?

It isn't obvious to me that is the desired behavior. Is there a reason why would do unroll generators for arithmetic ops but not for comparisons?

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Dec 20, 2018 via email

@jbrockmendel
Copy link
Member Author

You could ask on the numpy mailing list to see if that
behavior is deliberate.

OK. It seems like this PR is going to be in a holding pattern for a while. I'll make a separate PR to implement the changes+tests for Categorical if you're OK with those.

@jbrockmendel
Copy link
Member Author

@TomAugspurger are you on board with the changes to Categorical comparison behavior? (easiest to look at the new test)

@TomAugspurger
Copy link
Contributor

IIUC the change in https://github.com/pandas-dev/pandas/pull/24282/files#diff-c859b3060cdd05f4d0f693b0a4b71fc5R130 is to raise on cat == df, making the behavior consistent with how numpy? If so, then +1

jbrockmendel added a commit to jbrockmendel/pandas that referenced this pull request Jan 4, 2019
@jbrockmendel
Copy link
Member Author

I still think this is worth doing, but I'm not comfortable with introducing lots of small changes in behavior without accompanying tests.

Closing now, will try a new approach after the RC.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Clean Internals Related to non-user accessible pandas implementation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants