Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Fix __truediv__ numexpr error #3764

Merged
merged 5 commits into from
Jun 18, 2013

Conversation

jtratner
Copy link
Contributor

@jtratner jtratner commented Jun 5, 2013

Fixes a number of items relating to accelerated arithmetic in frame.


__truediv__ had been set up to use numexpr, but this happened by passingthe division operator as the string. numexpr's evaluate only checked 2 frames up, which meant that it picked up the division setting fromframe.py and would do floor/integer division when both inputs were integers.You'd only see that issue with a dataframe large enough to trigger numexpr evaluation (>10000 cells)

This adds test cases to test_expression.py that exhibit this failure under thePython2.7 full deps test. The testcases only test Series and DataFrame (though it looks like neitherSeries nor Panel use numexpr). It doesn't fail under Python 3 because integer division is totally gone.

Now evaluate, _evaluate_standard and _evaluate_numexpr all accept extra keyword arguments, which are passed to numexpr.evaluate.

The test case is currently a separate commit that fails. I wasn't sure whether I should have combined it with the bugfix commit or not. Happyto change it if that's more appropriate.

@jreback
Copy link
Contributor

jreback commented Jun 5, 2013

didn't realized ne could do truediv...good to know

fyi....easy to mod panel to use ne, series should be straightforward as well....

going to add an issue about that

@jreback
Copy link
Contributor

jreback commented Jun 5, 2013

see #3765

@jreback
Copy link
Contributor

jreback commented Jun 5, 2013

FYI, i just pushed this: c278ca6, because of the original failing test

numexpr is only installed on 1 travis build (the 3.2), BUT it is under numpy 1.6.1 which causes weirdness....hmmm

@jtratner
Copy link
Contributor Author

jtratner commented Jun 5, 2013

@jreback so why did it fail on 2.7? that test only runs if numexpr is installed. Also, probably a good idea to enable numexpr for one of the 2.x builds, right?

@jreback
Copy link
Contributor

jreback commented Jun 6, 2013

it failed on 2.7, weird (failed for 3.2 for me).....

...just testing enabling numexpr for all builds

@jtratner
Copy link
Contributor Author

jtratner commented Jun 6, 2013

@jreback check out the travis ci link: https://travis-ci.org/jtratner/pandas/builds/7822069 .

@jtratner
Copy link
Contributor Author

jtratner commented Jun 6, 2013

(that link is for a different build, obviously). Just for clarity - what needs to happen to incorporate this fix? Or are you saying this still fails in 3.2?

@jreback
Copy link
Contributor

jreback commented Jun 6, 2013

that s your test though (numexpr IS enabled on that build)...look at the print_versions at the very bottom (you have to click it)

@jreback
Copy link
Contributor

jreback commented Jun 6, 2013

I can't believe this never failed before...really weird https://www.travis-ci.org/jreback/pandas/jobs/7824130
fixed now though

@jtratner
Copy link
Contributor Author

jtratner commented Jun 6, 2013

@jreback fixed the test to not use div. obviously test_expressions wasn't running in python 3.

@jreback
Copy link
Contributor

jreback commented Jun 6, 2013

yep.....numexpr was only recently turned on....since we are now using it practially everythwere going to turn it on

@jtratner
Copy link
Contributor Author

jtratner commented Jun 6, 2013

@jreback so yeah, I'm happy to help out. probably can't get to putting this into Series and Panel today, but hopefully by the end of the week. What's your position on putting in new tests with fixes? Should I have put the test together with the fix / after the fix? Just wondering so I have a sense of how to contribute well to this project.

@jreback
Copy link
Contributor

jreback commented Jun 6, 2013

I usually write the test first, make sure it breaks, then put the fix in
(it could be the same commit or not)...

if I get too many commits or I am 'trying' to fix something...then I just squash them at the end...

e.g. the HDFStore py3k this was like 20 commits...squashed back into 5....had to try a bunch of things!

@jtratner
Copy link
Contributor Author

jtratner commented Jun 6, 2013

@jreback is there any way to confirm that numexpr was triggered? I added some print statements for my own testing, but I'm unclear whether test_expressions.py, as it stands now, would fail if the actual numexpr call wasn't working. In other words, it would stil be getting the right numerical result, but it wouldn't use the speedup from numexpr.

@jreback
Copy link
Contributor

jreback commented Jun 6, 2013

put in some temporary print statements around the actual call to ne.evaluate (I think I may have had them at one time, but took them out)

@jreback
Copy link
Contributor

jreback commented Jun 6, 2013

FYI...you might be interested in the long discussion about #3393, ne going to get even heavier use going forward

@jtratner
Copy link
Contributor Author

jtratner commented Jun 6, 2013

@jreback just to be clear, what I mean is that if you set pandas.core.expressions to only use _evaluate_standard and _where_standard, none of the tests in test_expressions.py fail. Similarly, if you add a ValueError in the middle of _evaluate_numexpr that causes it to always fall back to _evaluate_standard, all the tests pass. Any ideas on how to enforce it using numexpr (or maybe that would come out with performance tests?) Just wondering if you think this is a problem.

def _evaluate_numexpr(op, op_str, a, b, raise_on_error = False, **eval_kwargs):
    result = None

    if _can_use_numexpr(op, op_str, a, b, 'evaluate'):
        try:
            a_value, b_value = a, b
            if hasattr(a_value,'values'):
                a_value = a_value.values
            if hasattr(b_value,'values'):
                b_value = b_value.values
            # This causes fallback to _evaluate_standard
            raise ValueError("unknown type object")
            result = ne.evaluate('a_value %s b_value' % op_str, 
                                 local_dict={ 'a_value' : a_value, 
                                              'b_value' : b_value }, 
                                 casting='safe', **eval_kwargs)
        except (ValueError), detail:
            if 'unknown type object' in str(detail):
                pass
        except (Exception), detail:
            if raise_on_error:
                raise TypeError(str(detail))

    if result is None:
        result = _evaluate_standard(op,op_str,a,b,raise_on_error)

    return result

@jreback
Copy link
Contributor

jreback commented Jun 6, 2013

@jtratner

2 different issues here

  1. numexpr is technically optional, so we always need to be able to fall back if numexpr is not installed, also fallback if its not efficient to use it, or the dtypes are wrong
  2. tests, there are the set_use_numexpr to explicity make it use, this is for testing (in a similar vein, the set_numexpr_threads is for perf testing (see the vb_suite)) - set_use_numexpr will explicity set/reset _evaluate to point to the right ones (which can actually STILL fallback if there is an exception in numexpr, always have to guard against that)

answer your question?

@jtratner
Copy link
Contributor Author

jtratner commented Jun 6, 2013

Thanks for digging into this with me - I appreciate you taking the time to
explain!

I get what you're saying and it makes sense. However, test_expressions
doesn't even get run unless numexpr is installed, so you can assert
within that module that everything ought to be working correctly. What I'm
wondering is how you test for the specific case where numexpr is
installed but panda's way of calling it fails. _evalute_numexpr turns
everything into a TypeError which gets caught and handled by
_arith_method, so I'm not sure how you check for that to make sure pandas
is integrating correctly with numexpr.

On Wed, Jun 5, 2013 at 9:02 PM, jreback notifications@github.com wrote:

@jtratner https://github.com/jtratner

2 different issues here

  1. numexpr is technically optional, so we always need to be able to
    fall back if numexpr is not installed, also fallback if its not efficient
    to use it, or the dtypes are wrong
  2. tests, there are the set_use_numexpr to explicity make it use, this is
    for testing (in a similar vein, the set_numexpr_threads is for perf
    testing (see the vb_suite)) - set_use_numexpr will explicity set/reset
    _evaluate to point to the right ones (which can actually STILL fallback
    if there is an exception in numexpr, always have to guard against that)

answer your question?


Reply to this email directly or view it on GitHubhttps://github.com//pull/3764#issuecomment-19019172
.

@jreback
Copy link
Contributor

jreback commented Jun 6, 2013

numexpr rarely raises itself (mainly on a type issue, which we catch), or the catchall (which Exception), I think if some other kind of error, don't remember. We only bail on the catchall

otherwise it is computed as normal (which can itself raise an exception, but only a weird case I think)

so its not tested per se that calling from a data frame works (and uses numexpr when it is installed).

I suppose you could add some functions to expressions which would set/unset a module level flag for the last computation done (and maybe status info, would have to sort of 'record' which path actually is taken)

that way you could do something like this:

from pandas.core import expressions as e
e.set_test_mode(T)
result = big_df + 1
test_result = e.get_test_results()

and get_test_mode can return (and invalid the module level data which set_test_mode sets)

and test_result could be say:

test_result = dict(computation = ...., result = ...., path = ....)

whatever

@jtratner
Copy link
Contributor Author

jtratner commented Jun 6, 2013

Sounds like it's not really worth the time/overhead to check though.

@jreback
Copy link
Contributor

jreback commented Jun 6, 2013

what do u mean?

@jtratner
Copy link
Contributor Author

jtratner commented Jun 6, 2013

Well, as in: would it slow things down much to check for a global variable each time? do you think it's actually worth checking for?

@jreback
Copy link
Contributor

jreback commented Jun 6, 2013

this was just for testing

you would have a single check inside evaluate that would set the test result

you could actually set the status each time but you don't want to store the result at all because of he memory refs

easy enough to store the passed string and and the status

@jtratner
Copy link
Contributor Author

jtratner commented Jun 8, 2013

I want to leave the test to check that numexpr is actually used to be included with the #3765 , since that will probably turn into a more general check that has to happen for all expressions tests. Anything else I need to do before this can be merged? (Hopefully this gets into 0.11.1, since it reflects a numerical bug, rather than performance bug).

@jtratner
Copy link
Contributor Author

jtratner commented Jun 9, 2013

Update on this: everything passes now and I added all the standard arithmetic operators to the test cases. This exposed a regression where __mod__ was given a str_rep of * instead of %. Unfortunately, I kept getting floating point exceptions when I ran the tests for __mod__ accelerated by numexpr, so I just removed str_rep so it will use numpy's operator.mod instead. (I think this was added on 5/14, so doesn't need to go in RELEASE.rst)

Accelerating // caused an inconsistency too (though I don't think it was actually set up) where, when dividing by zero, numpy returns 0.0000 whereas numexpr evaluates to inf. Not sure which is correct, so it's also not included.

In summary, this pull request adds test cases for expressions.py for all major operators and tests Series (which isn't currently accelerated) and Frame with >10,000 cells. It also fixes __truediv__ to actually do truedivision when evaluated under numexpr (previously was doing integer/floor division) and fixes __mod__ by removing the incorrect str_rep being passed to numexpr.

@jreback
Copy link
Contributor

jreback commented Jun 9, 2013

@jtratner see here: http://pandas.pydata.org/pandas-docs/dev/whatsnew.html (the module/div section). I believe numpy is wrong in returning 0.0000, but numexpr is correct in return np.inf; that's what I 'fixed' pandas to do (float division is correct on both numpy/ne for all cases), but for some reason numpy does weird things with ints....so just make its consistent....

nice job on this...

@jtratner
Copy link
Contributor Author

jtratner commented Jun 9, 2013

Thanks!

On mod/floordiv - can you show me where you implemented that fix? is that
with the com._fill_zeros?

@jreback
Copy link
Contributor

jreback commented Jun 9, 2013

exactly

I can be reached on my cell 917-971-6387

On Jun 8, 2013, at 10:57 PM, Jeff Tratner notifications@github.com wrote:

Thanks!

On mod/floordiv - can you show me where you implemented that fix? is that
with the com._fill_zeros?

On Sat, Jun 8, 2013 at 10:50 PM, jreback notifications@github.com wrote:

@jtratner https://github.com/jtratner see here:
http://pandas.pydata.org/pandas-docs/dev/whatsnew.html (the module/div
section). I believe numpy is wrong in returning 0.0000, but numexpr is
correct in return np.inf; that's what I 'fixed' pandas to do (float
division is correct on both numpy/ne for all cases), but for some reason
numpy does weird things with ints....so just make its consistent....

nice job on this...


Reply to this email directly or view it on GitHubhttps://github.com//pull/3764#issuecomment-19159796
.


Reply to this email directly or view it on GitHub.

@jtratner
Copy link
Contributor Author

jtratner commented Jun 9, 2013

I think I figured out the issue...didn't include the explicit requirement to fill zeros on all...will fix soon.

@jtratner
Copy link
Contributor Author

jtratner commented Jun 9, 2013

Also, any reason why __mod__ gets filled with nan?

    __mod__ = _arith_method(operator.mod, '__mod__', default_axis=None, fill_zeros=np.nan)

@jreback
Copy link
Contributor

jreback commented Jun 9, 2013

I was consistent with how numpy treats floats
so mod was filled with nan, div with inf
(as you can see the integers are screwed up, that's
the reason we did it)

In [10]: np.array([1.0,2.0,3.0]) % 0
Out[10]: array([ nan,  nan,  nan])

In [11]: np.array([1,2,3]) % 0
Out[11]: array([0, 0, 0])

@jtratner
Copy link
Contributor Author

jtratner commented Jun 9, 2013

Okay, this should resolve everything. I have a separate raft of commits that adds truediv and floordiv to all the containers as well as fixes up fill_zeros (which is still not consistent here). If it's okay with you, I'd rather handle all of those separately, because they make a logical group (and also because the multiple changes makes it harder to cherry-pick frame fixes to here).

The key thing was to move the com._fill_zeros call outside of the try/catch because it needed to apply to everything even when numexpr didn't evaluate it to be fully consistent.

@jreback
Copy link
Contributor

jreback commented Jun 9, 2013

ok by me

@jreback
Copy link
Contributor

jreback commented Jun 9, 2013

@jtratner this pr doesn't actually change the default to truediv right? just makes ne do it if its already enabled

otherwise will need a big API warning

@jtratner
Copy link
Contributor Author

jtratner commented Jun 9, 2013

No, div explicitly only does floor division (which is what it was doing before this change, since the lambda x, y: x / y was always executed in the non-from __future__ import division environment). I also didn't change the Python 3 version (where div does true division). It doesn't change the behavior from before (only fixes the __truediv__ and __mod__ methods to not be wrong).

From Python 2 --> Python 3 the behavior of div changes, but that was already the case beforehand.

@jtratner
Copy link
Contributor Author

This is failing because of issue in #3814 (I think). But I do think it should be incorporated for 0.11.1, given that it fixes a bug (__truediv__ actually doing floor division with two integer dataframes using numexpr). Doesn't change any behavior (except fixing the broken behavior) nor does it change public API of anything.

Only add 'div' if not python 3

Also checks dtype in test cases for series
by passing truediv=True to evaluate for __truediv__
and truediv=False for __div__.
`%` (modulus) intermittently causes floating point exceptions when used with
numexpr.

`//` (floordiv) is inconsistent between unaccelerated and accelerated
when dividing by zero (unaccelerated produces 0.0000, accelerated
produces `inf`)
@jtratner
Copy link
Contributor Author

@jreback okay, this is definitely ready to go. It doesn't change anything API-wise, except for fixing a bug with using true division with integer arrays of greater than 10,000 cells with numexpr installed. (And we know it's a bug because it means that we'd end up with different output than numpy arrays, etc.)

jreback added a commit that referenced this pull request Jun 18, 2013
@jreback jreback merged commit da14c6e into pandas-dev:master Jun 18, 2013
@jreback
Copy link
Contributor

jreback commented Jun 18, 2013

thanks @jtratner your contribs are great!

keep em comin!

@jtratner
Copy link
Contributor Author

thanks :)

On Tue, Jun 18, 2013 at 9:01 AM, jreback notifications@github.com wrote:

thanks @jtratner https://github.com/jtratner your contribs are great!

keep em comin!


Reply to this email directly or view it on GitHubhttps://github.com//pull/3764#issuecomment-19609398
.

@jtratner jtratner deleted the fix_division_with_numexpr branch September 21, 2013 20:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants