Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CLN] Dispatch (some) Frame ops to Series, avoiding _data.eval #22019

Merged
merged 37 commits into from
Oct 3, 2018
Merged
Show file tree
Hide file tree
Changes from 15 commits
Commits
Show all changes
37 commits
Select commit Hold shift + click to select a range
5195dad
avoid casting to object dtype in mixed-type frames
jbrockmendel Jul 22, 2018
1a66906
Dispatch to Series ops in _combine_match_columns
jbrockmendel Jul 22, 2018
a45abec
comment
jbrockmendel Jul 22, 2018
8594c48
docstring
jbrockmendel Jul 22, 2018
108550e
flake8 fixup
jbrockmendel Jul 22, 2018
07fb477
Merge branch 'master' of https://github.com/pandas-dev/pandas into mops2
jbrockmendel Jul 24, 2018
946e54a
Merge branch 'master' of https://github.com/pandas-dev/pandas into mops2
jbrockmendel Jul 24, 2018
323f45e
dont bother with try_cast_result
jbrockmendel Jul 25, 2018
b3cedf1
Merge branch 'master' of https://github.com/pandas-dev/pandas into mops2
jbrockmendel Jul 25, 2018
a0708d1
revert non-central change
jbrockmendel Jul 25, 2018
5d3db89
simplify
jbrockmendel Jul 25, 2018
aa41de3
revert try_cast_results
jbrockmendel Jul 25, 2018
01c3720
revert non-central changes
jbrockmendel Jul 25, 2018
bcb6735
Fixup typo syntaxerror
jbrockmendel Jul 25, 2018
b3ef417
simplify assertion
jbrockmendel Jul 25, 2018
4b2e21a
Merge branch 'master' of https://github.com/pandas-dev/pandas into mops2
jbrockmendel Jul 26, 2018
330a94a
Merge branch 'master' of https://github.com/pandas-dev/pandas into mops2
jbrockmendel Jul 27, 2018
4c4f626
Merge branch 'master' of https://github.com/pandas-dev/pandas into mops2
jbrockmendel Aug 10, 2018
757e2ae
Merge branch 'master' of https://github.com/pandas-dev/pandas into mops2
jbrockmendel Sep 8, 2018
ff96c0d
use dispatch_to_series in combine_match_columns
jbrockmendel Sep 8, 2018
17f33b6
Merge branch 'master' of https://github.com/pandas-dev/pandas into mops2
jbrockmendel Sep 12, 2018
c30f898
Pass unwrapped op where appropriate
jbrockmendel Sep 12, 2018
82a7928
Merge branch 'master' of https://github.com/pandas-dev/pandas into mops2
jbrockmendel Sep 12, 2018
52b7102
catch correct error
jbrockmendel Sep 12, 2018
453ae8e
whatsnew note
jbrockmendel Sep 12, 2018
93887cf
Merge branch 'master' of https://github.com/pandas-dev/pandas into mops2
jbrockmendel Sep 13, 2018
dd115c8
Merge branch 'master' of https://github.com/pandas-dev/pandas into mops2
jbrockmendel Sep 18, 2018
265ec78
Merge branch 'master' of https://github.com/pandas-dev/pandas into mops2
jbrockmendel Sep 18, 2018
db5ca89
Merge branch 'master' of https://github.com/pandas-dev/pandas into mops2
jbrockmendel Sep 27, 2018
f574c24
comment
jbrockmendel Sep 27, 2018
e6821a2
whatsnew section
jbrockmendel Sep 27, 2018
da2e32c
Merge branch 'master' of https://github.com/pandas-dev/pandas into mops2
jbrockmendel Sep 28, 2018
a6a7f58
remove unnecessary tester
jbrockmendel Sep 29, 2018
b21a8a2
Merge branch 'master' of https://github.com/pandas-dev/pandas into mops2
jbrockmendel Sep 29, 2018
858165f
Merge branch 'master' of https://github.com/pandas-dev/pandas into mops2
jbrockmendel Oct 2, 2018
31d4089
Merge branch 'master' of https://github.com/pandas-dev/pandas into mops2
jbrockmendel Oct 2, 2018
5832c2b
doc fixup
jbrockmendel Oct 2, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 13 additions & 4 deletions pandas/core/frame.py
Original file line number Diff line number Diff line change
Expand Up @@ -4922,21 +4922,30 @@ def _arith_op(left, right):
copy=False)

def _combine_match_index(self, other, func, level=None):
assert isinstance(other, Series)
left, right = self.align(other, join='outer', axis=0, level=level,
copy=False)
assert left.index.equals(right.index)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there cases where this might no be true? If not, then let's remove it, as it could be somewhat expensive (though maybe not, since the output of align shares indices.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This always holds. I added it largely so I wouldn't have to keep going back to core.ops to double-check what was being passed.

new_data = func(left.values.T, right.values).T
return self._constructor(new_data,
index=left.index, columns=self.columns,
copy=False)

def _combine_match_columns(self, other, func, level=None, try_cast=True):
assert isinstance(other, Series)
left, right = self.align(other, join='outer', axis=1, level=level,
copy=False)
assert left.columns.equals(right.index)

new_data = left._data.eval(func=func, other=right,
axes=[left.columns, self.index],
try_cast=try_cast)
return self._constructor(new_data)
# Note: we use iloc instead of loc for compat with
# case with non-unique columns; same reason why we pin columns
# to result below instead of passing it to the constructor.
new_data = {n: func(left.iloc[:, n], right.iloc[n])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the reason these are dispatched to internals is so that the proper casting can happen. why is moving operations code here a good idea, rather than internals? we want to ove code out of frame, not in

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is moving operations code here a good idea, rather than internals? we want to ove code out of frame, not in

We want to move code out of internals. There are a ton of Issues about DataFrame operations that are caused by Block.eval not handling things correctly that will be fixed by using the (much more careful and well-tested) Series implementations (which BTW are the ones that correctly defer to EA and FooIndex implementations).

DataFrame ops are all over the place right now: op(DataFrame, DataFrame) and cmp(DataFrame, DataFrame) operate column-wise (dispatching to Series implementation). Other ops will either a) operate on self.values (doing expensive casting for mixed dtypes), or b) dispatch to _data.eval, which messes up on many dtypes and weird alignment behavior.

for n in range(len(left.columns))}

result = self._constructor(new_data, index=left.index, copy=False)
result.columns = left.columns
return result

def _combine_const(self, other, func, errors='raise', try_cast=True):
new_data = self._data.eval(func=func, other=other,
Expand Down
2 changes: 1 addition & 1 deletion pandas/tests/frame/test_axis_select_reindex.py
Original file line number Diff line number Diff line change
Expand Up @@ -738,7 +738,7 @@ def test_align_int_fill_bug(self):

result = df1 - df1.mean()
expected = df2 - df2.mean()
assert_frame_equal(result, expected)
assert_frame_equal(result.astype('f8'), expected)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why did this change?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as below

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this PR needs a note in the what's new describing this change (as its effectievly an API change)


def test_align_multiindex(self):
# GH 10665
Expand Down
8 changes: 5 additions & 3 deletions pandas/tests/reshape/test_pivot.py
Original file line number Diff line number Diff line change
Expand Up @@ -1565,8 +1565,9 @@ def test_crosstab_normalize(self):
full_normal)
tm.assert_frame_equal(pd.crosstab(df.a, df.b, normalize='index'),
row_normal)
tm.assert_frame_equal(pd.crosstab(df.a, df.b, normalize='columns'),
col_normal)
tm.assert_frame_equal(
pd.crosstab(df.a, df.b, normalize='columns').astype('f8'),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this changing is not great, the point of these ops is that the dtypes won't change on the pivot

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change is happening because ATM the casting is done at the Block-level, while with the PR casting is done at the column-level. pd.crosstab(df.a, df.b, normalize='columns') ATM returns

b    3    4
a          
1  0.5  0.0
2  0.5  1.0

and under the PR gives:

b    3    4
a          
1  0.5  0
2  0.5  1

i.e. the 4 column is int64. Note that the original inputs df.a and df.b are both int64 to begin with. I don't know crosstab well enough to comment on the ideal output, but it's hard to call this less-correct.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the correct way to answer this is ask "was the upcast from int to float the result of coercing to a single dtype for multiple columns?" If so, then we should preserve integer dtype where we can, i.e.

For an all-integer crosstab, we preserve integer dtype:

In [27]: a = pd.Series([1, 2])

In [28]: pd.crosstab(a, a)
Out[28]:
col_0  1  2
row_0
1      1  0
2      0  1

so this change seems good (and maybe worth explicitly testing rather than astyping the output, to ensure that we don't regress)?

col_normal)
tm.assert_frame_equal(pd.crosstab(df.a, df.b, normalize=1),
pd.crosstab(df.a, df.b, normalize='columns'))
tm.assert_frame_equal(pd.crosstab(df.a, df.b, normalize=0),
Expand Down Expand Up @@ -1599,7 +1600,8 @@ def test_crosstab_normalize(self):
tm.assert_frame_equal(pd.crosstab(df.a, df.b, normalize='index',
margins=True), row_normal_margins)
tm.assert_frame_equal(pd.crosstab(df.a, df.b, normalize='columns',
margins=True), col_normal_margins)
margins=True).astype('f8'),
col_normal_margins)
tm.assert_frame_equal(pd.crosstab(df.a, df.b, normalize=True,
margins=True), all_normal_margins)

Expand Down
7 changes: 5 additions & 2 deletions pandas/tests/series/test_operators.py
Original file line number Diff line number Diff line change
Expand Up @@ -1547,8 +1547,11 @@ def tester(a, b):
# this is an alignment issue; these are equivalent
# https://github.com/pandas-dev/pandas/issues/5284

pytest.raises(ValueError, lambda: d.__and__(s, axis='columns'))
pytest.raises(ValueError, tester, s, d)
with pytest.raises(TypeError):
d.__and__(s, axis='columns')

with pytest.raises(TypeError):
tester(s, d)

# this is wrong as its not a boolean result
# result = d.__and__(s,axis='index')
Expand Down