PERF: have series/panel arithmetic operators use expressions (numexpr) #3765

jreback · 2013-06-05T23:06:31Z

No description provided.

jtratner · 2013-06-09T04:28:40Z

I have this done for Series and implemented for scalar arithmetic in Panel. (less clear if it's possible / useful to accelerate combination operations with numexpr). @cpcloud @jreback could one of you write up how to create a Panel with >10K items with integer, floats, mixed integer and integers with zeros?

Basically, like these (which are what I'm using for test cases in test_expressions.py currently), but using Panel instead (don't need multiple, just one version each would be fine):

_frame  = DataFrame(np.random.randn(10000, 4), columns = list('ABCD'), dtype='float64')
_frame2 = DataFrame(np.random.randn(100, 4),   columns = list('ABCD'), dtype='float64')
_mixed  = DataFrame({ 'A' : _frame['A'].copy(), 'B' : _frame['B'].astype('float32'), 'C' : _frame['C'].astype('int64'), 'D' : _frame['D'].astype('int32') })
_mixed2 = DataFrame({ 'A' : _frame2['A'].copy(), 'B' : _frame2['B'].astype('float32'), 'C' : _frame2['C'].astype('int64'), 'D' : _frame2['D'].astype('int32') })
_integer  = DataFrame(np.random.randint(1, 100, size=(10001, 4)), columns = list('ABCD'), dtype='int64')

jreback · 2013-06-09T05:33:20Z

also examples in
pandas.utils.testing.makePanel

In [17]: _frame2 = DataFrame(np.random.randn(100, 4),   columns = list('ABCD'), dtype='float64')

In [18]: df = DataFrame({ 'A' : _frame2['A'].copy(), 'B' : _frame2['B'].astype('float32'), 'C' : _frame2['C'].astype('int64'), 'D' : _frame2['D'].astype('int32') })

In [19]: Panel(dict(ItemA = df, ItemB = (df+1).astype('int64')))
Out[19]: 
<class 'pandas.core.panel.Panel'>
Dimensions: 2 (items) x 100 (major_axis) x 4 (minor_axis)
Items axis: ItemA to ItemB
Major_axis axis: 0 to 99
Minor_axis axis: A to D

jtratner · 2013-06-09T19:30:19Z

@cpcloud FYI: numexpr isn't working with // right now (at least with our containers), produces the following exception within _evaluate_numexpr:

    Exception: TypeError("unsupported operand type(s) for //: 'VariableNode' and 'VariableNode'",)

cpcloud · 2013-06-09T20:00:00Z

Yup check out my list on the eval thread I think I note it there. Not sure what to do there. What's weird is that they support scalar floor division

jtratner · 2013-06-09T20:01:24Z

@cpcloud maybe we could special case it to do true division and then convert it to int?

cpcloud · 2013-06-09T20:44:55Z

Yeah that should be fine.

cpcloud · 2013-06-09T20:49:48Z

Hm but let me think about it cuz for a complex expression how to cast without performing the op?

jtratner · 2013-06-09T21:07:56Z

I setup series to use numexpr. Is this speed increase at all useful? Unclear how good it actually would be...

In [10]: ser = Series(np.random.randn(1000000))

In [11]: ser2 = Series(np.random.randn(1000000))

In [12]: %%timeit
   ....: ser * ser2
   ....:
100 loops, best of 3: 6.87 ms per loop

In [13]: expr.set_use_numexpr(True)

In [14]: %%timeit
   ....: ser * ser2
   ....:
100 loops, best of 3: 6.76 ms per loop

In [15]: %%timeit
   ....: ser / ser2
   ....:
100 loops, best of 3: 7.89 ms per loop

In [16]: expr.set_use_numexpr(False)

In [17]: %%timeit
   ....: ser / ser2
   ....:
100 loops, best of 3: 10.4 ms per loop

jreback · 2013-06-09T21:12:37Z

In reality these are very similar to numpy operations and they prob cannot be parallelized, which is where a large part fo the speedups come from; doing these same operations on frames yields pretty big speedups (and even better when we have eval, which will enable multiple operates to be sent to ne)

jtratner · 2013-06-09T21:58:15Z

It just occurred to me that I'm using a 2 core Macbook Air, so I might not notice as much of a difference or I'm not using the right test cases to see the difference. (frame appears to be accelerated by about 30%, huge panel very little and series very little)

jreback · 2013-06-09T22:20:10Z

build the eval branch that @cpcloud is working on

that's interesting because the more things u have in a term the faster it gets

cpcloud · 2013-06-09T22:23:47Z

@jtratner just fyi as of now only scalar arith ops with series/frame and already-aligned frames and series arith ops work as well...

cpcloud · 2013-06-09T22:24:55Z

it's also fun to make a huge frame and run htop to see all of my cores being used for ops :)

jtratner · 2013-06-12T01:36:24Z

hey, do you have any instructions on how to set up vbench/vb_suite? Not clear and I'd like to help write cases and/or check out performance after changes.

jreback · 2013-06-12T01:39:00Z

see this #3156

jtratner · 2013-06-15T00:57:13Z

@jreback I worked something up and would be interested in yoru thoughts on it. After I started with this, I realized that I was duplicating all the code to create the special arithmetic methods (i.e. __*__) and flex methods (e.g., pow, mul, etc.) so I pushed them all into two classmethods pandas/core/generic.py (replicating how Panel handled aggregated methods). It definitely uses some magic to get it done and I'm not sure whether it adds more complexity than it's worth. It shouldn't add any extra cost performance-wise, since it still binds to the class at compile/import time. See jtratner/pandas@899576d

Upsides:

Less places to update arithmetic methods (and therefore less chance of errors like in BUG: Fix __truediv__ numexpr error #3764 .
Creates a consistent arithmetic method interface-ish thing (previously I think Series and Panel were missing some of the operators).
Doesn't change existing behavior except to speed up certain operations
Allows SparsePanel to opt out of numexpr
Doesn't override methods defined in immediate superclass.
Doesn't add an extra layer of function calls...ultimately, everything is bound nearly the same way in the derived classes.
Points the way to more unification of arithmetic methods between Series, DataFrame, and Panel.

Downsides:

Adds more indirection, making it slightly harder to follow the code.
More difficult / impossible for static code analyzers to determine all the methods are defined on them (potentially could be remedied by putting stub methods on NDFrame. That said, this issue must crop up elsewhere with all the Appender usage on docstrings, right?

jreback · 2013-06-15T01:05:58Z

I like it!

your downside points are correct but clearly IMHO outweighed by
less code and less chance of errors when things get modified

I am of the camp that you should only have code to do something in one place
but sometimes I jump thru hoops with subclass/magic to make it happen

great job!

can merge first thing in 0.12

helps consolidation - always a good thing

cpcloud · 2013-06-15T01:09:47Z

this is nice! great to have the behavior propagate down into the pandas object hierarchy with minimal fuss

jtratner · 2013-06-15T01:12:30Z

@jreback @cpcloud thanks! :) now I just need to add a testing mechanism to numexpr to check that this actually uses numexpr everywhere I'm claiming it does (going to follow @jreback's suggestions above). Right now it just checks that the numexpr speedups do not change any of the existing behavior on calculations and that it doesn't mess up default axes on DataFrame, etc.

jreback · 2013-06-15T01:16:49Z

@jtratner as an aside - my heuristic of when to use ne may be too conservative
u may want to try a couple of diff values to see where it helps vs it's overhead

I am talkin about the 20k num of elements here (I forgot what number is actually there)

jtratner · 2013-06-15T01:16:56Z

@cpcloud @jreback Blah, Travis is going to fail right now because of a URL error in unrelated package [this time it's a yahoo test] :-/ .

cpcloud · 2013-06-15T01:17:28Z

number is 10k

cpcloud · 2013-06-15T01:17:52Z

what's with those anyway i keep getting google/yahoo failures

jtratner · 2013-06-15T01:19:37Z

Maybe Travis' IPs are getting flagged as spamming because of all the builds?

On Fri, Jun 14, 2013 at 9:18 PM, Phillip Cloud notifications@github.comwrote:

what's with those anyway i keep getting google/yahoo failures

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/3765#issuecomment-19489118
.

jreback · 2013-06-15T01:20:15Z

I thought there was a fix somewhere?

need a decorator to skip a network test that fails because of a urlibopen failure (eg connection refused)
they exist in some tests in test_yhoo/ google but should be on all

cpcloud · 2013-06-15T01:21:20Z

anyway let me not derail the conversation here

jtratner · 2013-06-20T02:02:06Z

@jreback @cpcloud this is 90% of the way done, I just have to get a test suite working for testing that evaluate actually successfully evaluated an expression. The only problem that I have with this is that it kills the ability for static code analyzers to note that add, div, mul, pow, etc. are methods and have real docstrings. Have you encountered this elsewhere? Do you have any suggestions on handling this? (or maybe it's not worth dealing with?) [btw - it's here jtratner/pandas@899576d

cpcloud · 2013-06-20T02:06:31Z

personally, i never use that feature of static code analyzers. i usually just want them to tell me, e.g., "hey you forgot to assign a variable here, dummy!" so i would say it's not worth it...expressions in pandas are a bit different anyway so the best way to get information about them short of reading the code, is in the online documentation.

jtratner · 2013-06-20T02:12:54Z

@cpcloud sorry, I was conflating two things. I mean that, in the original code, the static analyzer could at least tell you that the add, div, and mul methods existed.

separately - I'm trying to add some kind of testing functionality to core/expressions.py so that you can actually assert "Hey, this test case actually used numexpr - yay!". As it stands now, it's not really possible to tell if numexpr was successfully used or not. (except potentially by performance time)

jreback · 2013-09-20T17:21:31Z

@cpcloud @jtratner I think eval fixes this...so close?

cpcloud · 2013-09-20T17:22:46Z

i think this will be closed by @jtratner's arith op refactor

jreback mentioned this issue Jun 5, 2013

BUG: Fix __truediv__ numexpr error #3764

Merged

jtratner mentioned this issue Jun 10, 2013

BUG: Determine cause of floating point exception with accelerated __mod__ #3819

Closed

jtratner mentioned this issue Jun 27, 2013

ENH: Use numexpr for all arithmetic operators + refactor arithmetic operations #4051

Closed

3 tasks

jtratner mentioned this issue Aug 14, 2013

CLN/ENH: Provide full suite of arithmetic (and flex) methods to all NDFrame objects. #4560

Closed

jtratner mentioned this issue Sep 29, 2013

CLN/ENH: Provide full suite of arithmetic (and flex) methods to all NDFrame objects. #5022

Merged

jtratner closed this as completed in #5022 Sep 29, 2013

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: have series/panel arithmetic operators use expressions (numexpr) #3765

PERF: have series/panel arithmetic operators use expressions (numexpr) #3765

jreback commented Jun 5, 2013

jtratner commented Jun 9, 2013

jreback commented Jun 9, 2013

jtratner commented Jun 9, 2013

cpcloud commented Jun 9, 2013

jtratner commented Jun 9, 2013

cpcloud commented Jun 9, 2013

cpcloud commented Jun 9, 2013

jtratner commented Jun 9, 2013

jreback commented Jun 9, 2013

jtratner commented Jun 9, 2013

jreback commented Jun 9, 2013

cpcloud commented Jun 9, 2013

cpcloud commented Jun 9, 2013

jtratner commented Jun 12, 2013

jreback commented Jun 12, 2013

jtratner commented Jun 15, 2013

jreback commented Jun 15, 2013

cpcloud commented Jun 15, 2013

jtratner commented Jun 15, 2013

jreback commented Jun 15, 2013

jtratner commented Jun 15, 2013

cpcloud commented Jun 15, 2013

cpcloud commented Jun 15, 2013

jtratner commented Jun 15, 2013

jreback commented Jun 15, 2013

cpcloud commented Jun 15, 2013

jtratner commented Jun 20, 2013

cpcloud commented Jun 20, 2013

jtratner commented Jun 20, 2013

jreback commented Sep 20, 2013

cpcloud commented Sep 20, 2013

PERF: have series/panel arithmetic operators use expressions (numexpr) #3765

PERF: have series/panel arithmetic operators use expressions (numexpr) #3765

Comments

jreback commented Jun 5, 2013

jtratner commented Jun 9, 2013

jreback commented Jun 9, 2013

jtratner commented Jun 9, 2013

cpcloud commented Jun 9, 2013

jtratner commented Jun 9, 2013

cpcloud commented Jun 9, 2013

cpcloud commented Jun 9, 2013

jtratner commented Jun 9, 2013

jreback commented Jun 9, 2013

jtratner commented Jun 9, 2013

jreback commented Jun 9, 2013

cpcloud commented Jun 9, 2013

cpcloud commented Jun 9, 2013

jtratner commented Jun 12, 2013

jreback commented Jun 12, 2013

jtratner commented Jun 15, 2013

jreback commented Jun 15, 2013

cpcloud commented Jun 15, 2013

jtratner commented Jun 15, 2013

jreback commented Jun 15, 2013

jtratner commented Jun 15, 2013

cpcloud commented Jun 15, 2013

cpcloud commented Jun 15, 2013

jtratner commented Jun 15, 2013

jreback commented Jun 15, 2013

cpcloud commented Jun 15, 2013

jtratner commented Jun 20, 2013

cpcloud commented Jun 20, 2013

jtratner commented Jun 20, 2013

jreback commented Sep 20, 2013

cpcloud commented Sep 20, 2013