Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

resample with numpy.product gives unexpected results #5586

Closed
ruidc opened this issue Nov 25, 2013 · 12 comments · Fixed by #30444
Closed

resample with numpy.product gives unexpected results #5586

ruidc opened this issue Nov 25, 2013 · 12 comments · Fixed by #30444
Labels
good first issue Needs Tests Unit test(s) needed to prevent regressions Resample resample method
Milestone

Comments

@ruidc
Copy link
Contributor

ruidc commented Nov 25, 2013

import datetime
import numpy
import pandas

index = pandas.DatetimeIndex(start=datetime.date(2012, 1, 31), freq='M', periods=12)

ts = pandas.Series(range(12), index=index)
df = pandas.DataFrame(dict(A=ts, B=ts+2))

print df

# Correct
print df.resample('M', how=numpy.sum)
# Incorrect
print df.resample('M', how=numpy.product)
# Correct
print df.resample('Q', how=numpy.sum)
# Incorrect
print df.resample('Q', how=numpy.product)

for print df.resample('Q', how=numpy.product), was expecting:

2012-03-31   2*1*0   4*3*2
2012-06-30   5*4*3   7*6*5
2012-09-30   8*7*6  10*9*8
2012-12-31  11*10*9  13*12*11
@jreback
Copy link
Contributor

jreback commented Nov 25, 2013

I think this just needs a mapping like np.prod has (which works correcly). These numpy functions are mapped internally to cython functions.

@jreback jreback modified the milestones: 0.15.0, 0.14.0 Apr 6, 2014
@dsm054
Copy link
Contributor

dsm054 commented Jun 21, 2014

I had just fixed this when I realized there are lots of other numpy functions which will behave in the same way, e.g.

>>> df.groupby("a").agg(np.max)
    b
a    
10  2
20  3
30  5

but

>>> df.groupby("a").agg(np.nanmax)
     b
a     
10  10
20  20
30  30

Is whitelisting them one by one really our best option?

@jreback
Copy link
Contributor

jreback commented Jun 21, 2014

the nan ones are somewhat newer

I think there are only a few more to add
not really sure of a better way l

maybe could try some sort of regex matching on the name of the function?
(and see what comes up and just map them)

FYI since for example nanmean doesn't exist in older numpies have to conditionally add them

@dsm054
Copy link
Contributor

dsm054 commented Jun 21, 2014

Maybe we could go the other way, and blacklist the ones we haven't done. Admittedly that wouldn't catch .agg(lambda x: np.some_new_function(x)), but it would protect against .agg(np.some_new_function) if we hadn't handled it yet.

But couldn't we just catch anything else in numpy and pass it np.asarray(x), though?

First, check to see if it's been special cased. If so, use it.
Next, check to see if it's a numpy function we haven't whitelisted. If so, call it with np.asarray(x); more likely to give the expected answer.
If not, pass it x.

Vaguely like

>>> np_funcs = {f for n, f in inspect.getmembers(np) if callable(f)}
>>> f = np.product
>>> df.groupby("a").agg(np.prod)
     b
a     
10   2
20   3
30  20
>>> df.groupby("a").agg(np.product)
        b
a        
10    200
20     60
30  18000
>>> f = np.product
>>> df.groupby("a").agg(f if f not in np_funcs else (lambda x: f(np.asarray(x))))
     b
a     
10   2
20   3
30  20

@jreback
Copy link
Contributor

jreback commented Jun 21, 2014

yes I think that how could definitly be more strict in that you could evaluate on cases:

  • if its in our white list map to a pandas fast function and go
  • if its on black list raise
  • if its a lamba pass thru
  • if its a numpy function that is not the above, the use np.asarray (and maybe show a warning?)

@dsm054
Copy link
Contributor

dsm054 commented Jun 21, 2014

I like the warning idea, though I'm not sure what the appropriate label would be: SemiImplementedWarning, maybe? :^)

SemiImplementedWarning: numpy function {} does not have a fast native pandas
equivalent yet and is not fully supported; coercing the groups to ndarrays

Whatever we do, I think the current approach of special-casing a few and letting the rest silently give incorrect results -- even though they're called in exactly the same way the ones which work do -- should be replaced with something noisier.

@jreback
Copy link
Contributor

jreback commented Jun 21, 2014

you can say that you should use lambda x: np.function_that_we_do_fast as this will pass thru and just let the warning happen (until we put an explicity put it on the whitelist with either a fast version or a simple passthru)

@jreback jreback modified the milestones: 0.16.0, Next Major Release Mar 6, 2015
@mroeschke
Copy link
Member

Looks fixed in master. Could use a test.

@mroeschke mroeschke added Needs Tests Unit test(s) needed to prevent regressions good first issue and removed API Design Bug labels Sep 29, 2019
@jbrockmendel
Copy link
Member

@mroeschke what would we be testing? the df.resample("M", np.product) usage from the OP is no longer supported

@mroeschke
Copy link
Member

df.resample('M').apply(np.product) produces the correct result IIRC.

@baevpetr
Copy link
Contributor

So are tests required here now? if so, could I try to implement them?

@jbrockmendel
Copy link
Member

jbrockmendel commented Dec 18, 2019

@baevpetr yes, that applies to any issue with the "Needs Tests" label

@jreback jreback modified the milestones: Contributions Welcome, 1.0 Dec 26, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Needs Tests Unit test(s) needed to prevent regressions Resample resample method
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants