ENH: eval function #3393

jreback · 2013-04-18T17:44:05Z

Provide a top-level eval function, something like:

pd.eval(function_or_string,method=None, **kwargs)

to support things like:

out-of-core computation (locally) (see ENH: create out-of-core processing module #3202)
string evaluation which does not use python lazy evaluation
(so pandas can process effiiently)

pd.eval('df + df2',method='numexpr') (or maybe default to numexpr)

see also:
http://stackoverflow.com/questions/16527491/python-perform-operation-in-string

possible out-of-pandas evaluation
pd.eval('df + df2',method='picloud')
http://www.picloud.com/ (though they seem to have really old versions of pandas),
but I think they handle it anyhow

The text was updated successfully, but these errors were encountered:

hayd · 2013-05-19T14:35:12Z

To follow on from your comment, we shouldn't we be using & and |? I think this may also have the benefit of all and any just working.

Also ~ for not/invert (since would make it the same as numpy).

I haven't got my head around numexpr yet, so I may be talking complete nonsense. (I've moved Term to expressions without breaking things, and changed the repr to eval back to itself (was there a reason for it not?).

jreback · 2013-05-19T14:47:42Z

I agree about the operators (though I think you actually need to accept both), these are always going to be in a string expression in any event....because you need delayed evaluation
but since we actually DO want the & etc...you can just replace them (e.g. this is really a user interface issue), we are not actually going to evaluate them

e.g.

df[(df > 0) and (df < 10)]

vs

df['(df > 0) and (df < 10)]']

cpcloud · 2013-05-24T15:03:19Z

i'm sure everyone involved in this thread knows this but just wanted to point out that the precedence of and and & is different. if i was a first time user i would think that df > 0 and df < 10 and df > 0 & df < 10 do the same thing, so if both are going to be supported i think precedence rules should be kept as close to python as possible meaning parens are required for & but not for and.

jreback · 2013-05-24T15:27:45Z

@cpcloud this is in a string context, so in theory you can make them the same (as this is a big confusion I think in operating in pandas, I think people expect them to be the same (even though they are wrong)

cpcloud · 2013-05-24T15:41:21Z

@jreback sure. i was just semi-thinking-out-loud here, thought that it might warrant a discussion. this goes back to python core devs not wanting to provide the ability to overload not and and or so numpy was forced to overload bitwise operators for boolean operations (there's a youtube video of a discussion about this with GVR, there's even a patch to core python that allows you to do this). i really wish that pep went through sigh. i didn't realize there was a big confusion here, since this really has nothing to do with pandas, it's a language feature/bug. i was just thinking that adding more parsing rules to remember is annoying to users.

jreback · 2013-05-24T15:46:20Z

it's a valid point

the purpose of eval is to facilitate multi expression parsing that we will evaluate in Numexpr
so we have to have a string input (to avoid python evaluation of the sub expressions)
or maybe there is a way to disable this (like how %timeit works in ipython)
but i think they r using an input hook and hence everything really is a string

cpcloud · 2013-05-24T15:56:38Z

@jreback u can do it with the cmd module too. i think ipython used to use that or maybe they still do. i think only macros would allow you do this without string literals. btw there is now a Python macros library. i haven't tried it out but it looks like fun. another possibility is to support numba as a method although first things first (numexpr). do u already have something going for this?

jreback · 2013-05-24T16:14:30Z

@hayd said he was giving a stab
Andy can u post a link to your branch?

jreback · 2013-05-24T16:18:35Z

@cpcloud numba is interesting, but the infrastructure requirement is high, and in any event, its basically using numexpr under the hood :) (as well as ctable for storage)

jreback · 2013-05-24T16:21:03Z

@cpcloud I reread your question

the issue is this: df[(df>0) & (df<10)] is evaluated as 3 separate sub-expressions, plus a boolean mask selection

while

df.eval('(df>0) & (df<10)') can be evaluated (after alignment) in a single numexpr pass (and then a boolean mask) to return the dataframe, so can be a massive speedup

that's the main reason for this function

cpcloud · 2013-05-24T16:24:24Z

@jreback that is pretty cool. i haven't done much with numexpr, i assumed that pandas uses it when it can...is that a fallacious assumption? should i be explicitly using numexpr?

jreback · 2013-05-24T16:46:14Z

it's used in pytables for query processing
and in most evaluations now as of 0.11
(you need a fairly big frame for it to matter)

see the core/expressions module

hayd · 2013-05-24T16:49:56Z

I haven't done much so far, I've moved Term to expressions and added some helper functions for that class, not have I really looked in to numexp yet.

I kind of lost my way on the road map... and may be totally confused atm.

Am I way off here?

~~1. move term to expression~~
2. create class for "termset" (not sure what name, I was thinking this would be a list (possibly of termsets) with a flag whether it was all/any).
3. work out how to process termsets strings numpexp (is this the tricky part?)
4. create method for "termset" to strings which can be processed to numexp e.g.
5. create parser for our DSL to termset e.g. '(df>0) & (df<10)' -> [Term(df, '>', 0), Term(df, '<', 10)]

jreback · 2013-05-24T17:16:11Z

so there are 3 goals here:

parser to turn:

 'df[(df>0)&(df<10)]'

into this (call this the parsed_expr)

df[_And(Term('df','>',0),Term('df','<','10'))]

take a parsed_expr, align the variables (e.g. perform stuff like what combine_frame, combine_series, combine_scalar does (e.g. the alignment/reindexing steps), call this the aligned_expr
take aligned_expr and turn this into a numexpr expression (like what Term does and the expressions module does (though its very simple), this would be an exapansion of expresssions to take in an aligned Terms with their boolean operators (e.g. _And/_Or/_Not and parens)
involves tokenizing/ast manip (kind of like numexpr.evaluate does) to form the Terms; I am not sure how tricky this is, so we were going to skip for now
this is straightforward: take the parsed_expr and substitue variables that are aligned (keep frames as frames), don't need to exapand scalars at allow, mainly just reindex things that need, create the aligned_expr
this is straightforward too, just take the term expressions and generate the numexpr itself

so I think termset is really Term, plus the boolean operators, and a grouping operator (the parens)
these just allow easy expression manip (your 2)
your 3 (skip for now, that's my 1)

your 4 is my 3

I don't think you need 5

cpcloud · 2013-05-24T17:38:39Z

@jreback i know u said skip 1 but i can do that if u want (lots of nice python builtins for dealing with python source) while @hayd does 2 and 3. what would be allowed to be parsed? exprs in the python grammar? or just boolean exprs? could start with booleans fornow and extend after that is working...

jreback · 2013-05-24T18:06:20Z

the more there merrier!

let's start with the example

df.eval('(df>0)&(df<10)')

This is really about the masks as that's where all the work is done

but I think it would be nice evenutally to do something on the rhs as well:

pd.eval('df[(df>0)&(df<10)] = df * 2 + 1', engine='numexpr')

so we can support getitem and setitem and pass both the lhs and rhs to the evaluator

(imagine engine = 'multi-process' or 'out-of-core')......

to the heck with blaze! (well maybe engine=blaze is ok too)

hayd · 2013-05-24T19:11:42Z

I think I was worried that nested Terms wouldn't come for free with _And and _Or, but I'll put something together imminently and we can see whether it does. :)

hayd · 2013-05-24T19:12:29Z

We can just tell everyone it's blaze...

cpcloud · 2013-05-24T19:15:33Z

i've got it parsing nested and terms already :)

cpcloud · 2013-05-24T19:15:56Z

albeit they are strings right now and only & (parsing and is different), i haven't written the _And class yet

jreback · 2013-05-24T19:24:31Z

@cpcloud I would just use the &, |, and ~ for now (to keep consistent), can always add later

jreback · 2013-05-24T19:39:50Z

@cpcloud

the end goal is to create a numexpr expression (the functionaility is in the Selection class in io/pytables.py); so the class that holds the parses expression (the nested _And/_Or) should parse to this (and has to do type translation and such), also this class could do the alignment I think (which is the reason for having the parsed expression, so you can basically just iterate thru all of the terms and see what needs to be aligned)

e.g.

for t in term_expression:
      t.align()

Term align (pseudo codish)

def align(self):
      self.lhs
      self.op
      self.rhs

      if self.lhs ia DataFrame:
           if self.rhs is a Series....
                     is a Frame
                     is a Scalar

maybe return a new expression that is aligned

cpcloud · 2013-05-24T19:57:57Z

ah i see. so an Expr class should hold the ands and ors which consist of terms (or nested expressions). Expr could have an align method which aligns and then passes to numexpr. is that correct?

jreback · 2013-05-24T20:08:28Z

I think you actually need 3 classes here:

Term which holds lhs operator rhs (and prob a reference back to the top-level Expr for variable inference 2)Termset, although maybeExpr, or maybeTerms? is better here (I mean a nested list of_and,_or,_notoperators on theTerms)
Top-level, maybe Expression, which holds 2) the termset, and the engine and such

e.g.

pd.eval('df[(df>0)&(df<10)'])

yields

Expression():
    original string
    df[mask] (you need to keep this where)
    termset of the boolean expression
    engine
    maybe an environment points (this is like a closure) but we are not fancy here :)

    methods:
        parse (create the termset)
        align (have the termset align)
        convert_to_engine_format (return the converted termset)

Termset():
     _and(Term('df','>',0),Term('df','<',10))
     methods:
          align (maybe return a new termset that is aligned)
          convert_to_engine_format (return the converted to engine format,
              this would be a string)

cpcloud · 2013-05-24T20:19:29Z

lol gh doesn't like ur rst flavored monospace

hayd · 2013-05-24T20:24:23Z

This was where I was up to: https://github.com/hayd/pandas/tree/term-refactor

cpcloud · 2013-05-24T20:59:47Z

possible engines right now are 'numexpr' and 'pytables'?

jreback · 2013-05-24T21:05:39Z

well....pytables target is the same, numexpr, only difference is that the Terms need to do different alignment (as they are scalar type conditions, e.g. index>20130523, where index is a field in the table, and the date gets translated to i8; so do need support for that (so yes you could use engine=pytables) to handle that, but in pytables need to have what I call the queryables dict passed in anyhow for validation (whereas in the case of a boolean expression you have the df passed in) (or taken from the locals())

cpcloud · 2013-05-24T21:40:22Z

@jreback @hayd fyi for some reason expressions.py has dos line endings while, for example, frame.py does not. isn't git supposed to take of this? it's pretty annoying and will cause a billion and one merge conflicts...it's just that file: i just ran dos2unix on all of pandas and that's the only thing changed. i did this after a fresh clone

cpcloud · 2013-06-16T03:43:56Z

oh that is nice. still have the issue of the different behav tho

cpcloud · 2013-06-16T03:50:05Z

interesting and possibly alarming bit....

In [53]: df = DataFrame(randn(10000000, 10))

In [54]: df2 = DataFrame(randn(*df.shape))

In [55]: df3 = DataFrame(randn(*df.shape))

In [56]: s = 'df + df2 * 2 + df3 ** 2 * df * df + df2 ** 40'

In [57]: res = pd.eval(s)

In [58]: res2 = df + df2 * 2 + df3 ** 2 * df * df + df2 ** 40

In [59]: norm(res - res2)
Out[59]: 2411155374342516.5

In [60]: allclose(res, res2)
Out[60]: True

i'm guessing this is because of the large power term and because the arrays are big, but i don't see why the L2 norm should be that different (order of magnitude of difference is 10 ** 15). the L2 norm difference is much smaller with a power only half as big a basically disappears below this value.

In [63]: s = 'df + df2 * 2 + df3 ** 2 * df * df + df2 ** 20'

In [64]: res = pd.eval(s)

In [65]: res2 = df + df2 * 2 + df3 ** 2 * df * df + df2 ** 20
nor
In [66]: norm(res - res2)
Out[66]: 1.4282464805728339

cpcloud · 2013-06-16T03:51:21Z

ideally the norm should be 0

cpcloud · 2013-06-16T04:06:28Z

i think numexpr might be unrolling integer power ops

cpcloud · 2013-06-16T04:19:23Z

well i don't think it's a bug in eval since this easily replicable with straight numpy/numexpr

cpcloud · 2013-06-16T04:23:40Z

k not "dumb" loop unrolling maybe there is some other optimization technique or this is just a straight up bug

In [55]: df = DataFrame(randn(10000000, 10))

In [56]: x = df.values

In [57]: norm(ne.evaluate('x ** 40') - ne.evaluate(' * '.join(['x'] * 40)))
Out[57]: 3966006570425040.5

cpcloud · 2013-06-16T04:33:18Z

jreback · 2013-06-16T12:06:43Z

maybe some sort of overflow?

cpcloud · 2013-06-16T22:04:44Z

i believe the optimization is the cause of the divergence:

In [16]: x = randn(10000000, 10)

In [17]: norm(ne.evaluate('x ** 40', optimization='none') - ne.evaluate('x ** 40'))
Out[17]: 616872973144280.12

cpcloud · 2013-06-16T23:00:26Z

@jreback @jtratner cast % nodes to float64? i think raising on floordiv with a message saying "pass engine=python or eval in python if you want to use floor division", thoughts?

jreback · 2013-06-16T23:02:04Z

I would just cast them

cpcloud · 2013-06-16T23:04:16Z

cast mod u mean right? can't really cast floordiv result as that would defeat the purpose of this...

jreback · 2013-06-16T23:08:55Z

yes

jreback · 2013-06-22T00:46:35Z

http://pandas.pydata.org/pandas-docs/dev/enhancingperf.html

would be a good place for docs on eval

jtratner · 2013-06-22T03:26:37Z

I think it makes sense to cast to float, very simple to get back
On Jun 21, 2013 8:46 PM, "jreback" notifications@github.com wrote:

http://pandas.pydata.org/pandas-docs/dev/enhancingperf.html

would be a good place for docs on eval

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/3393#issuecomment-19847860
.

cpcloud · 2013-06-22T03:27:48Z

yep this required a pretty large refactoring since to cast in a general way the op needs to know about the scope of its operands

cpcloud · 2013-06-22T03:28:14Z

must cast recursively down the parse tree

cpcloud · 2013-06-22T03:29:55Z

so that on eval the correct cast is performed...this will work unless there's floor division on both sides, but in that case you shouldn't be using eval anyway since that will run only on the python engine

in other news...implementing an operator in numexpr is not trivial...i thought about doing it but it's kind of a beast...maybe i will anyway

cpcloud · 2013-06-22T03:43:22Z

eval is useful for two things as stated above:

there's now a parser for basic arithmetic expressions useful for manip prior to perform ops
9-10x-ish speedup for long expressions containing huge frames that don't need alignment

cpcloud · 2013-06-25T22:43:30Z

@jreback is it intentional that df % series == series % df? E.g.,

In [20]: df = DataFrame(randn(10,10))

In [21]: df
Out[21]:
       0      1      2      3      4      5      6      7      8      9
0 -1.147 -0.175  0.867 -0.459 -0.751 -0.822 -0.927 -1.572  0.813 -0.558
1 -0.043 -1.416  1.420 -1.243 -0.656  0.726 -0.408  0.545 -2.712 -0.353
2  0.566  0.489  1.528  3.058  1.393  1.282 -0.276 -0.705 -0.183  1.386
3  0.679 -0.082  0.831 -2.167 -1.347 -0.178 -0.812 -0.465 -1.509 -0.337
4  0.031  0.975 -1.157 -0.613 -0.491 -0.478  1.763 -0.328 -0.897 -0.011
5  1.110 -0.088  0.162  0.061 -0.715  1.214  1.188  1.802 -0.841  1.435
6 -1.063 -0.447 -0.743 -0.567  1.492  0.468  2.043 -0.873 -0.803 -1.178
7  0.507 -2.446 -1.553  0.468 -0.148 -0.871 -0.207  1.386 -1.173  0.155
8 -1.267 -0.219 -0.021 -0.686  0.159 -0.868 -1.734  0.312 -1.460  0.864
9  1.800  0.751 -0.677 -2.029 -0.711 -0.748  0.555  1.060  0.493  1.842

In [22]: s = df[0]

In [23]: s
Out[23]:
0   -1.147
1   -0.043
2    0.566
3    0.679
4    0.031
5    1.110
6   -1.063
7    0.507
8   -1.267
9    1.800
Name: 0, dtype: float64

In [24]: allclose(df % s, s % df)
Out[24]: True

In [25]: allclose(df.values % s.values, s.values % df.values)
Out[25]: False

cpcloud · 2013-06-25T22:45:08Z

basically force the frame on the lhs of modulus is what's happening

cpcloud · 2013-06-25T23:02:30Z

i'll submit a pr to fix it

jreback mentioned this issue May 14, 2013

PyTables enhancements for selection #1996

Closed

cpcloud mentioned this issue Jun 26, 2013

WIP: add top-level eval function #4037

Closed

31 tasks

cpcloud mentioned this issue Jul 8, 2013

ENH: add expression evaluation functionality via eval #4162

Merged

35 tasks

jreback closed this as completed in #4162 Jul 8, 2013

cpcloud mentioned this issue Jul 8, 2013

ENH: add expression evaluation functionality via eval #4164

Merged

64 tasks

cpcloud reopened this Jul 8, 2013

cpcloud closed this as completed in #4164 Sep 16, 2013

danfrankj mentioned this issue Dec 23, 2015

filtration chain for DataFrames #11875

Closed

wesm unassigned cpcloud Oct 12, 2016

ENH: eval function #3393

ENH: eval function #3393

Comments

jreback commented Apr 18, 2013

hayd commented May 19, 2013

jreback commented May 19, 2013

cpcloud commented May 24, 2013

jreback commented May 24, 2013

cpcloud commented May 24, 2013

jreback commented May 24, 2013

cpcloud commented May 24, 2013

jreback commented May 24, 2013

jreback commented May 24, 2013

jreback commented May 24, 2013

cpcloud commented May 24, 2013

jreback commented May 24, 2013

hayd commented May 24, 2013

jreback commented May 24, 2013

cpcloud commented May 24, 2013

jreback commented May 24, 2013

hayd commented May 24, 2013

hayd commented May 24, 2013

cpcloud commented May 24, 2013

cpcloud commented May 24, 2013

jreback commented May 24, 2013

jreback commented May 24, 2013

cpcloud commented May 24, 2013

jreback commented May 24, 2013

cpcloud commented May 24, 2013

hayd commented May 24, 2013

cpcloud commented May 24, 2013

jreback commented May 24, 2013

cpcloud commented May 24, 2013

cpcloud commented Jun 16, 2013

cpcloud commented Jun 16, 2013

cpcloud commented Jun 16, 2013

cpcloud commented Jun 16, 2013

cpcloud commented Jun 16, 2013

cpcloud commented Jun 16, 2013

cpcloud commented Jun 16, 2013

jreback commented Jun 16, 2013

cpcloud commented Jun 16, 2013

cpcloud commented Jun 16, 2013

jreback commented Jun 16, 2013

cpcloud commented Jun 16, 2013

jreback commented Jun 16, 2013

jreback commented Jun 22, 2013

jtratner commented Jun 22, 2013

cpcloud commented Jun 22, 2013

cpcloud commented Jun 22, 2013

cpcloud commented Jun 22, 2013

cpcloud commented Jun 22, 2013

cpcloud commented Jun 25, 2013

cpcloud commented Jun 25, 2013

cpcloud commented Jun 25, 2013